Compare commits
10 Commits
304cf4fde4
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
557994f7d9 | ||
|
|
e9b2e42893 | ||
|
|
68e8d532be | ||
|
|
473d02f71a | ||
|
|
4798edcea7 | ||
|
|
19bfdaecbe | ||
|
|
4ccd918eb1 | ||
|
|
a869e8c7ba | ||
|
|
340293ab94 | ||
|
|
050cd49124 |
13
.env.example
13
.env.example
@@ -6,9 +6,12 @@
|
|||||||
|
|
||||||
# --- ChatGPT ---
|
# --- ChatGPT ---
|
||||||
# How to get: open chatgpt.com in Chrome → F12 → Application tab
|
# How to get: open chatgpt.com in Chrome → F12 → Application tab
|
||||||
# → Cookies → https://chatgpt.com → find "__Secure-next-auth.session-token" → copy Value
|
# → Cookies → https://chatgpt.com → find the two cookie chunks:
|
||||||
# Token type: JWT (starts with "eyJ"). Typically valid for ~7 days.
|
# __Secure-next-auth.session-token.0 (starts with "eyJ") → CHATGPT_SESSION_TOKEN
|
||||||
|
# __Secure-next-auth.session-token.1 (the remainder) → CHATGPT_SESSION_TOKEN_1
|
||||||
|
# Token type: JWE. Typically valid for ~7 days.
|
||||||
CHATGPT_SESSION_TOKEN=
|
CHATGPT_SESSION_TOKEN=
|
||||||
|
CHATGPT_SESSION_TOKEN_1=
|
||||||
|
|
||||||
# ChatGPT Projects (optional): comma-separated list of project gizmo IDs.
|
# ChatGPT Projects (optional): comma-separated list of project gizmo IDs.
|
||||||
# Project conversations are NOT included in the default /conversations listing.
|
# Project conversations are NOT included in the default /conversations listing.
|
||||||
@@ -46,9 +49,9 @@ JOPLIN_API_URL=http://localhost:41184
|
|||||||
# JOPLIN_REQUEST_TIMEOUT=30
|
# JOPLIN_REQUEST_TIMEOUT=30
|
||||||
|
|
||||||
# --- Cache ---
|
# --- Cache ---
|
||||||
# Where the sync manifest and logs are stored (default: ~/.ai-chat-exporter)
|
# Where the sync manifest is stored (default: ./cache, inside the install directory)
|
||||||
CACHE_DIR=~/.ai-chat-exporter
|
CACHE_DIR=./cache
|
||||||
|
|
||||||
# --- Logging ---
|
# --- Logging ---
|
||||||
# Log file path. Set to "none" to disable file logging.
|
# Log file path. Set to "none" to disable file logging.
|
||||||
LOG_FILE=~/.ai-chat-exporter/logs/exporter.log
|
LOG_FILE=./cache/logs/exporter.log
|
||||||
|
|||||||
4
.gitignore
vendored
4
.gitignore
vendored
@@ -25,10 +25,14 @@ exports/
|
|||||||
!CHANGELOG.md
|
!CHANGELOG.md
|
||||||
|
|
||||||
# Cache and logs
|
# Cache and logs
|
||||||
|
cache/
|
||||||
.ai-chat-exporter/
|
.ai-chat-exporter/
|
||||||
logs/
|
logs/
|
||||||
*.log
|
*.log
|
||||||
|
|
||||||
|
# Test tracking
|
||||||
|
test-plan.csv
|
||||||
|
|
||||||
# Editor / OS
|
# Editor / OS
|
||||||
.DS_Store
|
.DS_Store
|
||||||
.idea/
|
.idea/
|
||||||
|
|||||||
51
CHANGELOG.md
51
CHANGELOG.md
@@ -3,6 +3,57 @@
|
|||||||
All notable changes to this project will be documented here.
|
All notable changes to this project will be documented here.
|
||||||
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||||
|
|
||||||
|
## [0.4.1] - Unreleased
|
||||||
|
### Added
|
||||||
|
- ChatGPT `execution_output` (Code Interpreter / `container.exec` / `python`) renders as a `tool_result` block with `tool_name` from `author.name`, `is_error` from `metadata.aggregate_result.status`, and the optional `summary` line populated from `metadata.reasoning_title`. Captured live during planning.
|
||||||
|
- ChatGPT `system_error` content (e.g. browse-service 503) renders as an error `tool_result` block with `tool_name` from `author.name` (typically `"web"`).
|
||||||
|
- ChatGPT `tether_browsing_display` populated case (defensive, not observed in real data) renders as a `tool_result` block; transient spinner placeholders (empty `result`+`summary`) skip silently with DEBUG log.
|
||||||
|
- `tool_result` block schema gains optional `summary: str | None` field, rendered as italic line between header and fenced output.
|
||||||
|
- `tool_result` rendering shows `tool_name` in the header when present (e.g. `📤 **Result: container.exec**`); when absent, header stays as `📤 **Result**` (no regression).
|
||||||
|
- Markdown exporter: `_ROLE_LABELS["tool"] = ("🔧 Tool", "tool")` so tool-role messages render under a recognisable header instead of the generic fallback.
|
||||||
|
- 11 new tests covering all four cases plus the conv_id fallback (192 total, all passing).
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- ChatGPT `normalize_conversation` now reads `conversation_id` as a fallback for `id`. Live ChatGPT detail responses use `conversation_id` at top level; fixtures and listing summaries use `id`. Without the fallback, normalized conversations had empty `id` (visible as blank `conversation_id:` in YAML frontmatter and missing context in WARNING log lines).
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
- No new schema breaks; `tool_result` blocks gain a `summary` field that defaults to None on legacy data. Existing exports re-render cleanly with the cache-clear-and-export workflow from v0.4.0.
|
||||||
|
|
||||||
|
## [0.4.0] - Unreleased
|
||||||
|
### Added
|
||||||
|
- Rich content support: messages now carry an ordered `blocks` list (text, code, thinking, tool_use, tool_result, citation, image_placeholder, file_placeholder, unknown)
|
||||||
|
- ChatGPT voice mode: `audio_transcription` parts render as text blocks; `audio_asset_pointer` and `real_time_user_audio_video_asset_pointer` render as `📎 File attached` placeholders with size and duration metadata
|
||||||
|
- ChatGPT Custom Instructions: `user_editable_context` and `model_editable_context` messages now appear in exports (were silently dropped — pre-existing bug fixed); rendered with a `> ℹ️ Hidden context` marker driven by the `is_visually_hidden_from_conversation` flag
|
||||||
|
- Image placeholders for `image_asset_pointer` parts (uploads + DALL-E) inside `multimodal_text` and at message level
|
||||||
|
- Defensive Claude block extraction: `text`, `thinking`, `tool_use`, `tool_result` (including nested-block flattening), `image` blocks (untested against real data; will fix-forward in v0.4.1 if real shapes diverge)
|
||||||
|
- `LossReport` summary table emitted at end of every `export` run, breaking down `unknown blocks` and `extraction failures` by raw type so silently-dropped data becomes visible
|
||||||
|
- `_safe_fence` helper picks a fence longer than any backtick run in extracted content, preventing embedded triple-backticks from corrupting downstream rendering (verified live in Joplin during planning)
|
||||||
|
- `unknown` blocks render as `> ⚠️ Unsupported content` with the raw type, observed top-level keys, and reason — so future API additions are visible rather than silent
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- ChatGPT role filter (previously dropped `tool` and `system` messages) is **lifted**: all roles now route through normal extraction; truly empty messages skip via the existing empty-content guard
|
||||||
|
- Markdown rendering moves from provider-time to exporter-write-time. Providers produce blocks; exporters call `render_blocks_to_markdown` at write time. This unblocks future Obsidian/HTML exporters
|
||||||
|
- `BaseProvider.normalize_conversation` signature now accepts an optional `LossReport` parameter (breaking change for any future custom subclass; FileProvider hasn't shipped yet)
|
||||||
|
- `o1`/`o3` reasoning subparts inside `text` content_type messages remain rendered as plain text (defensive; reclassification to `thinking` block deferred until live shape is captured)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- `user_editable_context` / `model_editable_context` extraction (parts-vs-direct-fields mismatch) — Custom Instructions are no longer silently dropped from every conversation
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
- Existing exports are not re-rendered automatically. To pick up v0.4.0 rendering for previously exported conversations:
|
||||||
|
```
|
||||||
|
python -m src.main cache --clear
|
||||||
|
python -m src.main export --provider all
|
||||||
|
```
|
||||||
|
- JSON exports: messages now contain `blocks` (typed structured content) and may omit the legacy `content` field. External consumers reading JSON should prefer `blocks`.
|
||||||
|
- Per-conversation message counts may increase: previously-dropped Custom Instructions, image-only user turns, and tool-only assistant turns now appear.
|
||||||
|
|
||||||
|
### Out of scope (deferred to v0.5.0+)
|
||||||
|
- Binary downloads of images and audio assets (placeholders show metadata only; `content not preserved in this export`)
|
||||||
|
- Joplin resource upload for embedded media
|
||||||
|
- Filename resolution for `file-XYZ` / `sediment://` references
|
||||||
|
- Speculative ChatGPT types (`tether_browsing_display`, `tether_quote`) and DALL-E assistant images — fall through to `unknown` blocks if encountered
|
||||||
|
|
||||||
## [0.2.0] - Unreleased
|
## [0.2.0] - Unreleased
|
||||||
### Added
|
### Added
|
||||||
- Joplin import automation: `joplin` command syncs exported Markdown files to Joplin as notes
|
- Joplin import automation: `joplin` command syncs exported Markdown files to Joplin as notes
|
||||||
|
|||||||
50
FUTURE.md
50
FUTURE.md
@@ -7,6 +7,7 @@ of these additions straightforward.
|
|||||||
**Completed:**
|
**Completed:**
|
||||||
- v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
|
- v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
|
||||||
- v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation)
|
- v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation)
|
||||||
|
- v0.4.0 — Rich content support: typed message blocks (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown); ChatGPT voice transcripts as text + audio placeholders; Custom Instructions extraction; data-loss visibility via `LossReport` summary and visible `unknown` blocks
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -58,26 +59,43 @@ export command to accept a pre-downloaded export ZIP or JSON.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Rich Content Support (v0.4.0)
|
## Binary Content Downloads (v0.5.0)
|
||||||
|
|
||||||
Currently only text content is exported. Future versions should handle:
|
v0.4.0 ships placeholders for images and audio assets but does not download
|
||||||
|
the binary content. The `_safe_fence`-wrapped placeholders include the asset
|
||||||
|
reference (`sediment://...` or `file-service://...`), MIME type, size, and
|
||||||
|
duration where available; the actual bytes are not preserved.
|
||||||
|
|
||||||
### Claude
|
Next steps:
|
||||||
- Artifacts (code, documents, HTML) — export as separate files, link from Markdown
|
- Download attached images alongside the Markdown export, save under a
|
||||||
- Uploaded images — download and embed or link
|
`media/` sibling directory with a stable filename derived from the asset
|
||||||
- Extended thinking/reasoning blocks — include as collapsible sections
|
reference.
|
||||||
- Tool call results and web search citations — include as footnotes or appendices
|
- Replace `image_placeholder` rendering with an inline ``
|
||||||
|
reference once the file is on disk.
|
||||||
|
- Joplin integration: upload binaries as Joplin resources via `POST /resources`,
|
||||||
|
rewrite the rendered Markdown to use `:/resourceId` references, and track
|
||||||
|
the resource ID in the cache manifest so re-syncs stay idempotent.
|
||||||
|
- DALL-E images on the assistant side: not observed in this user's data; the
|
||||||
|
code path exists (`source = "model_generated"`) but is untested.
|
||||||
|
|
||||||
### ChatGPT
|
The block-level schema is already in place — only the file-fetch + rewrite
|
||||||
- DALL-E generated images — download and embed or link
|
layer needs to be added. See the `image_placeholder` and `file_placeholder`
|
||||||
- Code Interpreter outputs — export code and results
|
block definitions in `src/blocks.py`.
|
||||||
- File attachments — download and reference
|
|
||||||
- Voice transcripts — include as text
|
|
||||||
|
|
||||||
Implementation note: the normalized message schema already includes a
|
## Reclassify o1/o3 Reasoning Subparts (v0.4.1)
|
||||||
`content_type` field placeholder. When this work begins, extend the schema
|
|
||||||
rather than replacing it. Non-text content already logs a WARNING when
|
v0.4.0 leaves dict parts inside `text` content_type messages with shape
|
||||||
encountered so users can see what was skipped.
|
`{"summary": ..., "content": ...}` rendered as plain text (defensive — the
|
||||||
|
shape was inferred from a code comment, not captured live). Once a real
|
||||||
|
reasoning conversation is captured, reclassify these as `thinking` blocks.
|
||||||
|
|
||||||
|
## Suppress Hidden Context (v0.4.x)
|
||||||
|
|
||||||
|
If Custom Instructions duplication across conversations becomes a storage
|
||||||
|
problem, add `EXPORTER_INCLUDE_HIDDEN_CONTEXT=false` env var. The toggle is
|
||||||
|
a single `os.getenv()` check at the start of
|
||||||
|
`_extract_editable_context_blocks` in `src/providers/chatgpt.py` — return
|
||||||
|
empty list if disabled.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
114
README.md
114
README.md
@@ -28,6 +28,8 @@ This tool is designed for a single user backing up their own conversations. Do n
|
|||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
|
### Linux / macOS
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone <repo-url>
|
git clone <repo-url>
|
||||||
cd ai-chat-exporter
|
cd ai-chat-exporter
|
||||||
@@ -36,6 +38,37 @@ source .venv/bin/activate
|
|||||||
pip install -e ".[dev]"
|
pip install -e ".[dev]"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Windows
|
||||||
|
|
||||||
|
No admin access required. Run these in **Command Prompt** (`cmd.exe`) — it's the simplest option on Windows because it doesn't have PowerShell's script execution policy restrictions.
|
||||||
|
|
||||||
|
```bat
|
||||||
|
git clone <repo-url>
|
||||||
|
cd ai-chat-exporter
|
||||||
|
python -m venv .venv
|
||||||
|
.venv\Scripts\activate
|
||||||
|
pip install -e ".[dev]"
|
||||||
|
```
|
||||||
|
|
||||||
|
All `ai-chat-exporter` commands work identically in Command Prompt.
|
||||||
|
|
||||||
|
**Using PowerShell instead?** If you prefer PowerShell, you may need to allow script execution first (one-time, current user only):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
|
||||||
|
```
|
||||||
|
|
||||||
|
Then activate the venv and run commands the same way.
|
||||||
|
|
||||||
|
**Prerequisites:**
|
||||||
|
- Python 3.11 or later — install from [python.org](https://www.python.org/downloads/windows/). During installation, tick **"Add Python to PATH"**.
|
||||||
|
- Git — install from [git-scm.com](https://git-scm.com/) if not already present.
|
||||||
|
|
||||||
|
**Notes:**
|
||||||
|
- The cache manifest and logs are stored in `cache\` inside the install directory — the same as on Linux.
|
||||||
|
- File permission hardening (`chmod 600`) is silently ignored on Windows — not a concern for single-user desktop use.
|
||||||
|
- Joplin Web Clipper runs on `localhost:41184` on all platforms; no configuration changes needed.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## First Run: Run Doctor
|
## First Run: Run Doctor
|
||||||
@@ -43,7 +76,7 @@ pip install -e ".[dev]"
|
|||||||
Before anything else, validate your setup:
|
Before anything else, validate your setup:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m src.main doctor
|
ai-chat-exporter doctor
|
||||||
```
|
```
|
||||||
|
|
||||||
This checks token presence, format, expiry, directory permissions, disk space, and live API connectivity. Fix any failures before proceeding.
|
This checks token presence, format, expiry, directory permissions, disk space, and live API connectivity. Fix any failures before proceeding.
|
||||||
@@ -58,7 +91,7 @@ Session tokens are how your browser stays logged in. This tool uses them to acce
|
|||||||
|
|
||||||
| Provider | Cookie Name | Lifetime | Expiry Detection |
|
| Provider | Cookie Name | Lifetime | Expiry Detection |
|
||||||
|----------|-------------|----------|-----------------|
|
|----------|-------------|----------|-----------------|
|
||||||
| ChatGPT | `__Secure-next-auth.session-token` | ~7 days | JWT `exp` claim (decoded automatically) |
|
| ChatGPT | `__Secure-next-auth.session-token.0` + `.1` | ~7 days | JWT `exp` claim (decoded automatically) |
|
||||||
| Claude | `sessionKey` | ~30 days | Only detectable via 401 response |
|
| Claude | `sessionKey` | ~30 days | Only detectable via 401 response |
|
||||||
|
|
||||||
### Finding Tokens in Chrome DevTools
|
### Finding Tokens in Chrome DevTools
|
||||||
@@ -69,14 +102,18 @@ Session tokens are how your browser stays logged in. This tool uses them to acce
|
|||||||
4. In the left panel, expand **Cookies** and click the site URL
|
4. In the left panel, expand **Cookies** and click the site URL
|
||||||
5. Find the cookie by name and copy its **Value**
|
5. Find the cookie by name and copy its **Value**
|
||||||
|
|
||||||
**ChatGPT:** go to `https://chatgpt.com` → find `__Secure-next-auth.session-token` → copy Value (starts with `eyJ`)
|
**ChatGPT:** go to `https://chatgpt.com` → find **two** cookies:
|
||||||
|
- `__Secure-next-auth.session-token.0` — copy Value (starts with `eyJ`) → `CHATGPT_SESSION_TOKEN`
|
||||||
|
- `__Secure-next-auth.session-token.1` — copy Value → `CHATGPT_SESSION_TOKEN_1`
|
||||||
|
|
||||||
|
ChatGPT splits large session tokens across two cookies to stay under the browser's 4KB cookie limit. Both are required.
|
||||||
|
|
||||||
**Claude:** go to `https://claude.ai` → find `sessionKey` → copy Value
|
**Claude:** go to `https://claude.ai` → find `sessionKey` → copy Value
|
||||||
|
|
||||||
### When Tokens Expire
|
### When Tokens Expire
|
||||||
|
|
||||||
When a token expires you'll see a `401 Unauthorized` error. To refresh:
|
When a token expires you'll see a `401 Unauthorized` error. To refresh:
|
||||||
- Re-run the `auth` wizard: `python -m src.main auth`
|
- Re-run the `auth` wizard: `ai-chat-exporter auth`
|
||||||
- Or manually update the value in your `.env` file
|
- Or manually update the value in your `.env` file
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -86,7 +123,7 @@ When a token expires you'll see a `401 Unauthorized` error. To refresh:
|
|||||||
The easiest way to configure tokens is the interactive wizard:
|
The easiest way to configure tokens is the interactive wizard:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m src.main auth
|
ai-chat-exporter auth
|
||||||
```
|
```
|
||||||
|
|
||||||
This walks you through finding your token, validates it, shows the expiry date (ChatGPT only), and offers to write it to your `.env` automatically. Tokens are never echoed to the terminal.
|
This walks you through finding your token, validates it, shows the expiry date (ChatGPT only), and offers to write it to your `.env` automatically. Tokens are never echoed to the terminal.
|
||||||
@@ -105,7 +142,8 @@ cp .env.example .env
|
|||||||
|
|
||||||
| Variable | Description |
|
| Variable | Description |
|
||||||
|----------|-------------|
|
|----------|-------------|
|
||||||
| `CHATGPT_SESSION_TOKEN` | Your ChatGPT JWT session token (`eyJ…`) |
|
| `CHATGPT_SESSION_TOKEN` | ChatGPT session token chunk `.0` (starts with `eyJ…`) |
|
||||||
|
| `CHATGPT_SESSION_TOKEN_1` | ChatGPT session token chunk `.1` (the remainder) |
|
||||||
| `CHATGPT_PROJECT_IDS` | Comma-separated ChatGPT project IDs (see below) |
|
| `CHATGPT_PROJECT_IDS` | Comma-separated ChatGPT project IDs (see below) |
|
||||||
| `CLAUDE_SESSION_KEY` | Your Claude session key |
|
| `CLAUDE_SESSION_KEY` | Your Claude session key |
|
||||||
|
|
||||||
@@ -128,8 +166,8 @@ cp .env.example .env
|
|||||||
|
|
||||||
| Variable | Default | Description |
|
| Variable | Default | Description |
|
||||||
|----------|---------|-------------|
|
|----------|---------|-------------|
|
||||||
| `CACHE_DIR` | `~/.ai-chat-exporter` | Where to store the sync manifest |
|
| `CACHE_DIR` | `./cache` | Where to store the sync manifest |
|
||||||
| `LOG_FILE` | `~/.ai-chat-exporter/logs/exporter.log` | Log file path (`none` to disable) |
|
| `LOG_FILE` | `./cache/logs/exporter.log` | Log file path (`none` to disable) |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -218,7 +256,7 @@ Each provider+project combination maps to a flat Joplin notebook created automat
|
|||||||
### `auth` — Interactive token setup
|
### `auth` — Interactive token setup
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m src.main auth
|
ai-chat-exporter auth
|
||||||
```
|
```
|
||||||
|
|
||||||
Guided wizard to find and save session tokens and ChatGPT project IDs. Detects OS and shows the correct DevTools shortcut.
|
Guided wizard to find and save session tokens and ChatGPT project IDs. Detects OS and shows the correct DevTools shortcut.
|
||||||
@@ -226,7 +264,7 @@ Guided wizard to find and save session tokens and ChatGPT project IDs. Detects O
|
|||||||
### `doctor` — Health check
|
### `doctor` — Health check
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m src.main doctor
|
ai-chat-exporter doctor
|
||||||
```
|
```
|
||||||
|
|
||||||
Checks: token presence, JWT validity and expiry, directory permissions, disk space, live API reachability. Exits with code 0 if all pass, 1 if any fail.
|
Checks: token presence, JWT validity and expiry, directory permissions, disk space, live API reachability. Exits with code 0 if all pass, 1 if any fail.
|
||||||
@@ -235,31 +273,31 @@ Checks: token presence, JWT validity and expiry, directory permissions, disk spa
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Export everything (new/updated only)
|
# Export everything (new/updated only)
|
||||||
python -m src.main export
|
ai-chat-exporter export
|
||||||
|
|
||||||
# Single provider
|
# Single provider
|
||||||
python -m src.main export --provider claude
|
ai-chat-exporter export --provider claude
|
||||||
|
|
||||||
# JSON output
|
# JSON output
|
||||||
python -m src.main export --format json
|
ai-chat-exporter export --format json
|
||||||
|
|
||||||
# Both Markdown and JSON
|
# Both Markdown and JSON
|
||||||
python -m src.main export --format both
|
ai-chat-exporter export --format both
|
||||||
|
|
||||||
# Only conversations updated since a date
|
# Only conversations updated since a date
|
||||||
python -m src.main export --since 2024-06-01
|
ai-chat-exporter export --since 2024-06-01
|
||||||
|
|
||||||
# Only conversations in a specific project (case-insensitive substring)
|
# Only conversations in a specific project (case-insensitive substring)
|
||||||
python -m src.main export --project "learning python"
|
ai-chat-exporter export --project "learning python"
|
||||||
|
|
||||||
# Only conversations outside any project
|
# Only conversations outside any project
|
||||||
python -m src.main export --project none
|
ai-chat-exporter export --project none
|
||||||
|
|
||||||
# Write to a custom directory
|
# Write to a custom directory
|
||||||
python -m src.main export --output /path/to/my/notes
|
ai-chat-exporter export --output /path/to/my/notes
|
||||||
|
|
||||||
# Preview without writing anything
|
# Preview without writing anything
|
||||||
python -m src.main export --dry-run
|
ai-chat-exporter export --dry-run
|
||||||
```
|
```
|
||||||
|
|
||||||
Options: `--provider [chatgpt|claude|all]`, `--format [markdown|json|both]`, `--output PATH`, `--since YYYY-MM-DD`, `--project NAME`, `--dry-run`
|
Options: `--provider [chatgpt|claude|all]`, `--format [markdown|json|both]`, `--output PATH`, `--since YYYY-MM-DD`, `--project NAME`, `--dry-run`
|
||||||
@@ -268,16 +306,16 @@ Options: `--provider [chatgpt|claude|all]`, `--format [markdown|json|both]`, `--
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# List all conversations for all providers
|
# List all conversations for all providers
|
||||||
python -m src.main list
|
ai-chat-exporter list
|
||||||
|
|
||||||
# Single provider
|
# Single provider
|
||||||
python -m src.main list --provider chatgpt
|
ai-chat-exporter list --provider chatgpt
|
||||||
|
|
||||||
# Filter by project
|
# Filter by project
|
||||||
python -m src.main list --project "learning python"
|
ai-chat-exporter list --project "learning python"
|
||||||
|
|
||||||
# Only conversations outside any project
|
# Only conversations outside any project
|
||||||
python -m src.main list --project none
|
ai-chat-exporter list --project none
|
||||||
```
|
```
|
||||||
|
|
||||||
Fetches and displays all conversations without exporting them. Useful for verifying what the tool can see before running an export.
|
Fetches and displays all conversations without exporting them. Useful for verifying what the tool can see before running an export.
|
||||||
@@ -286,19 +324,19 @@ Fetches and displays all conversations without exporting them. Useful for verify
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Sync all pending conversations to Joplin
|
# Sync all pending conversations to Joplin
|
||||||
python -m src.main joplin
|
ai-chat-exporter joplin
|
||||||
|
|
||||||
# Preview what would be synced without sending anything
|
# Preview what would be synced without sending anything
|
||||||
python -m src.main joplin --dry-run
|
ai-chat-exporter joplin --dry-run
|
||||||
|
|
||||||
# Sync a single provider
|
# Sync a single provider
|
||||||
python -m src.main joplin --provider chatgpt
|
ai-chat-exporter joplin --provider chatgpt
|
||||||
|
|
||||||
# Sync only conversations in a specific project
|
# Sync only conversations in a specific project
|
||||||
python -m src.main joplin --project "learning python"
|
ai-chat-exporter joplin --project "learning python"
|
||||||
|
|
||||||
# Sync only conversations outside any project
|
# Sync only conversations outside any project
|
||||||
python -m src.main joplin --project none
|
ai-chat-exporter joplin --project none
|
||||||
```
|
```
|
||||||
|
|
||||||
Reads the local export cache and pushes each exported Markdown file to Joplin as a note. Notebooks are created automatically. Re-running is safe — notes are updated (not duplicated).
|
Reads the local export cache and pushes each exported Markdown file to Joplin as a note. Notebooks are created automatically. Re-running is safe — notes are updated (not duplicated).
|
||||||
@@ -315,20 +353,20 @@ Options: `--provider [chatgpt|claude|all]`, `--project NAME`, `--dry-run`
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Show statistics
|
# Show statistics
|
||||||
python -m src.main cache --show
|
ai-chat-exporter cache --show
|
||||||
|
|
||||||
# Clear all cached entries (forces full re-export next run)
|
# Clear all cached entries (forces full re-export next run)
|
||||||
python -m src.main cache --clear
|
ai-chat-exporter cache --clear
|
||||||
|
|
||||||
# Clear a single provider
|
# Clear a single provider
|
||||||
python -m src.main cache --clear --provider claude
|
ai-chat-exporter cache --clear --provider claude
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## How the Cache Works
|
## How the Cache Works
|
||||||
|
|
||||||
The cache manifest lives at `~/.ai-chat-exporter/manifest.json` and records every exported conversation: its title, project, `updated_at` timestamp, output file path, and (after Joplin sync) the Joplin note ID.
|
The cache manifest lives at `cache/manifest.json` (inside the install directory) and records every exported conversation: its title, project, `updated_at` timestamp, output file path, and (after Joplin sync) the Joplin note ID.
|
||||||
|
|
||||||
On every `export` run:
|
On every `export` run:
|
||||||
1. Fetch the full conversation list from the provider
|
1. Fetch the full conversation list from the provider
|
||||||
@@ -343,7 +381,7 @@ On every `joplin` run:
|
|||||||
|
|
||||||
**This design makes every run inherently resumable.** If the tool is interrupted for any reason — rate limit, network drop, Ctrl+C, crash — simply re-run the same command. It will skip already-processed conversations and continue from where it stopped.
|
**This design makes every run inherently resumable.** If the tool is interrupted for any reason — rate limit, network drop, Ctrl+C, crash — simply re-run the same command. It will skip already-processed conversations and continue from where it stopped.
|
||||||
|
|
||||||
To force a full re-export: `python -m src.main cache --clear` then re-run export.
|
To force a full re-export: `ai-chat-exporter cache --clear` then re-run export.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -351,7 +389,7 @@ To force a full re-export: `python -m src.main cache --clear` then re-run export
|
|||||||
|
|
||||||
### `401 Unauthorized`
|
### `401 Unauthorized`
|
||||||
Your session token has expired.
|
Your session token has expired.
|
||||||
- Run `python -m src.main auth` to get a new token interactively
|
- Run `ai-chat-exporter auth` to get a new token interactively
|
||||||
- Or manually copy a fresh cookie value into your `.env` file
|
- Or manually copy a fresh cookie value into your `.env` file
|
||||||
|
|
||||||
Note: Claude's `sessionKey` is an opaque string — the only way to know it's expired is the 401 error. ChatGPT JWTs have an `exp` claim that the `doctor` command can decode and display.
|
Note: Claude's `sessionKey` is an opaque string — the only way to know it's expired is the 401 error. ChatGPT JWTs have an `exp` claim that the `doctor` command can decode and display.
|
||||||
@@ -388,13 +426,13 @@ Make sure you've added the project IDs to `CHATGPT_PROJECT_IDS` in your `.env`.
|
|||||||
The provider's internal API may have changed. Run with `--debug`, sanitize the output (remove any personal content), and check the project's GitHub Issues for known fixes.
|
The provider's internal API may have changed. Run with `--debug`, sanitize the output (remove any personal content), and check the project's GitHub Issues for known fixes.
|
||||||
|
|
||||||
### Non-text content warnings
|
### Non-text content warnings
|
||||||
Images, code interpreter outputs, DALL-E generations, and Claude artifacts are not exported in v0.2.0. A WARNING is logged for each skipped item. See `FUTURE.md` for the roadmap.
|
Since v0.4.0, rich content is preserved as typed blocks in the export. ChatGPT voice transcripts render as text and audio assets as `📎 File attached` placeholders with size and duration metadata. Images render as `🖼️ Image attached` placeholders showing the asset reference. Custom Instructions appear under a `> ℹ️ Hidden context` marker. Anything the extractor doesn't recognise renders as a visible `> ⚠️ Unsupported content` block naming the type and observed keys, *and* increments a counter in the post-export summary so you can tell whether real content is being silently skipped. Binary downloads (the actual image/audio bytes) are still deferred — see `FUTURE.md` v0.5.0.
|
||||||
|
|
||||||
### Empty export / all conversations skipped
|
### Empty export / all conversations skipped
|
||||||
No new or updated conversations since your last run. To verify: `python -m src.main cache --show`. To force a full re-export: `python -m src.main cache --clear`.
|
No new or updated conversations since your last run. To verify: `ai-chat-exporter cache --show`. To force a full re-export: `ai-chat-exporter cache --clear`.
|
||||||
|
|
||||||
### Filing a bug report
|
### Filing a bug report
|
||||||
1. Run with `--debug`: `python -m src.main export --debug 2>&1 | tee debug.log`
|
1. Run with `--debug`: `ai-chat-exporter export --debug 2>&1 | tee debug.log`
|
||||||
2. Remove any personal conversation content from `debug.log`
|
2. Remove any personal conversation content from `debug.log`
|
||||||
3. Open a GitHub Issue with the sanitized log and the exact command you ran
|
3. Open a GitHub Issue with the sanitized log and the exact command you ran
|
||||||
|
|
||||||
@@ -406,7 +444,7 @@ See `FUTURE.md` for planned features:
|
|||||||
|
|
||||||
- **v0.2.x** — `export --force` flag; `joplin --force` flag; per-conversation cache reset
|
- **v0.2.x** — `export --force` flag; `joplin --force` flag; per-conversation cache reset
|
||||||
- **v0.3.0** — Official API fallback: parse export ZIP files from ChatGPT/Claude settings
|
- **v0.3.0** — Official API fallback: parse export ZIP files from ChatGPT/Claude settings
|
||||||
- **v0.4.0** — Rich content: images, artifacts, code interpreter output, extended thinking
|
- **v0.4.x / v0.5.0** — Binary content downloads (images, audio bytes) and Joplin resource upload; reclassify o1/o3 reasoning subparts; optional `EXPORTER_INCLUDE_HIDDEN_CONTEXT` toggle
|
||||||
- **v0.5.0** — Watch/scheduled mode; Obsidian vault output
|
- **v0.5.0** — Watch/scheduled mode; Obsidian vault output
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "ai-chat-exporter"
|
name = "ai-chat-exporter"
|
||||||
version = "0.2.0"
|
version = "0.4.1"
|
||||||
description = "Export ChatGPT and Claude conversation history to Markdown for personal archival in Joplin"
|
description = "Export ChatGPT and Claude conversation history to Markdown for personal archival in Joplin"
|
||||||
requires-python = ">=3.11"
|
requires-python = ">=3.11"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
|
|||||||
339
src/blocks.py
Normal file
339
src/blocks.py
Normal file
@@ -0,0 +1,339 @@
|
|||||||
|
"""Typed content blocks for normalized messages.
|
||||||
|
|
||||||
|
Providers produce ordered lists of blocks; exporters render them. Living outside
|
||||||
|
``src/providers/`` deliberately — blocks are a separate concern from extraction
|
||||||
|
or rendering, shared by both layers.
|
||||||
|
|
||||||
|
Block dicts always have ``type`` set to one of the BLOCK_TYPE_* constants.
|
||||||
|
Construct via the ``make_*`` helpers; never build dicts by hand. The ``unknown``
|
||||||
|
block constructor REQUIRES a corresponding WARNING log + ``LossReport`` tally
|
||||||
|
at the call site — see plan §Data-loss visibility.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
BLOCK_TYPE_TEXT = "text"
|
||||||
|
BLOCK_TYPE_CODE = "code"
|
||||||
|
BLOCK_TYPE_THINKING = "thinking"
|
||||||
|
BLOCK_TYPE_TOOL_USE = "tool_use"
|
||||||
|
BLOCK_TYPE_TOOL_RESULT = "tool_result"
|
||||||
|
BLOCK_TYPE_CITATION = "citation"
|
||||||
|
BLOCK_TYPE_IMAGE_PLACEHOLDER = "image_placeholder"
|
||||||
|
BLOCK_TYPE_FILE_PLACEHOLDER = "file_placeholder"
|
||||||
|
BLOCK_TYPE_UNKNOWN = "unknown"
|
||||||
|
BLOCK_TYPE_HIDDEN_CONTEXT_MARKER = "hidden_context_marker"
|
||||||
|
|
||||||
|
UNKNOWN_REASON_UNKNOWN_TYPE = "unknown_type"
|
||||||
|
UNKNOWN_REASON_EXTRACTION_FAILED = "extraction_failed"
|
||||||
|
UNKNOWN_REASON_ALL_BLOCKS_FAILED = "all_blocks_failed"
|
||||||
|
UNKNOWN_REASON_UNKNOWN_FIELD_IN_KNOWN_TYPE = "unknown_field_in_known_type"
|
||||||
|
|
||||||
|
_OBSERVED_KEYS_LIMIT = 10
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Constructors
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def make_text_block(text: str) -> dict | None:
|
||||||
|
"""Return a text block, or None if the text is empty/whitespace-only.
|
||||||
|
|
||||||
|
Returning None lets callers do ``if block: blocks.append(block)`` and prune
|
||||||
|
empty blocks at construction time. See plan §Finalizing the message dict.
|
||||||
|
"""
|
||||||
|
if not isinstance(text, str) or not text.strip():
|
||||||
|
return None
|
||||||
|
return {"type": BLOCK_TYPE_TEXT, "text": text}
|
||||||
|
|
||||||
|
|
||||||
|
def make_code_block(code: str, language: str = "") -> dict | None:
|
||||||
|
"""Return a code block, or None if code is empty."""
|
||||||
|
if not isinstance(code, str) or not code.strip():
|
||||||
|
return None
|
||||||
|
return {"type": BLOCK_TYPE_CODE, "language": language or "", "code": code}
|
||||||
|
|
||||||
|
|
||||||
|
def make_thinking_block(text: str) -> dict | None:
|
||||||
|
"""Return a thinking block, or None if empty."""
|
||||||
|
if not isinstance(text, str) or not text.strip():
|
||||||
|
return None
|
||||||
|
return {"type": BLOCK_TYPE_THINKING, "text": text}
|
||||||
|
|
||||||
|
|
||||||
|
def make_tool_use_block(name: str, input_data: Any, tool_id: str | None = None) -> dict:
|
||||||
|
"""Return a tool_use block.
|
||||||
|
|
||||||
|
Always returns a block (no None) — tool calls are meaningful even with
|
||||||
|
empty inputs.
|
||||||
|
"""
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_TOOL_USE,
|
||||||
|
"name": name or "",
|
||||||
|
"input": input_data if input_data is not None else {},
|
||||||
|
"tool_id": tool_id,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def make_tool_result_block(
|
||||||
|
output: str,
|
||||||
|
tool_name: str | None = None,
|
||||||
|
is_error: bool = False,
|
||||||
|
summary: str | None = None,
|
||||||
|
) -> dict:
|
||||||
|
"""Return a tool_result block.
|
||||||
|
|
||||||
|
``summary`` is an optional short human label rendered between header and
|
||||||
|
fence (e.g. ChatGPT's ``metadata.reasoning_title`` for execution_output).
|
||||||
|
"""
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_TOOL_RESULT,
|
||||||
|
"tool_name": tool_name,
|
||||||
|
"output": output if isinstance(output, str) else str(output),
|
||||||
|
"is_error": bool(is_error),
|
||||||
|
"summary": summary,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def make_citation_block(
|
||||||
|
url: str,
|
||||||
|
title: str | None = None,
|
||||||
|
snippet: str | None = None,
|
||||||
|
) -> dict | None:
|
||||||
|
if not url:
|
||||||
|
return None
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_CITATION,
|
||||||
|
"url": url,
|
||||||
|
"title": title,
|
||||||
|
"snippet": snippet,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def make_image_placeholder(
|
||||||
|
ref: str,
|
||||||
|
source: str = "unknown",
|
||||||
|
mime: str | None = None,
|
||||||
|
) -> dict:
|
||||||
|
"""source ∈ {'user_upload', 'model_generated', 'unknown'}."""
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_IMAGE_PLACEHOLDER,
|
||||||
|
"ref": ref or "",
|
||||||
|
"source": source,
|
||||||
|
"mime": mime,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def make_file_placeholder(
|
||||||
|
ref: str,
|
||||||
|
filename: str | None = None,
|
||||||
|
mime: str | None = None,
|
||||||
|
size_bytes: int | None = None,
|
||||||
|
duration_seconds: float | None = None,
|
||||||
|
) -> dict:
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_FILE_PLACEHOLDER,
|
||||||
|
"ref": ref or "",
|
||||||
|
"filename": filename,
|
||||||
|
"mime": mime,
|
||||||
|
"size_bytes": size_bytes,
|
||||||
|
"duration_seconds": duration_seconds,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def make_unknown_block(
|
||||||
|
raw_type: str,
|
||||||
|
observed_keys: list[str] | None = None,
|
||||||
|
reason: str = UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
summary: str | None = None,
|
||||||
|
) -> dict:
|
||||||
|
"""Construct an unknown block.
|
||||||
|
|
||||||
|
Every call site MUST also emit a WARNING log and increment a LossReport
|
||||||
|
tally — see plan §Data-loss visibility. The block surfaces the loss at
|
||||||
|
read time; the WARNING surfaces it at run time. Both signals matter.
|
||||||
|
"""
|
||||||
|
keys = list(observed_keys or [])[:_OBSERVED_KEYS_LIMIT]
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_UNKNOWN,
|
||||||
|
"raw_type": raw_type,
|
||||||
|
"observed_keys": keys,
|
||||||
|
"reason": reason,
|
||||||
|
"summary": summary,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def make_hidden_context_marker(content_type: str) -> dict:
|
||||||
|
"""A short prepend block that flags the surrounding message as hidden context.
|
||||||
|
|
||||||
|
Driven by the ``metadata.is_visually_hidden_from_conversation`` flag, not by
|
||||||
|
content_type matching. The marker tells the reader "this message is
|
||||||
|
hidden in the source UI; we're showing it here for archival fidelity."
|
||||||
|
"""
|
||||||
|
return {
|
||||||
|
"type": BLOCK_TYPE_HIDDEN_CONTEXT_MARKER,
|
||||||
|
"content_type": content_type or "",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Rendering
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def render_blocks_to_markdown(blocks: list[dict]) -> str:
|
||||||
|
"""Render an ordered list of blocks to a single Markdown string.
|
||||||
|
|
||||||
|
Blocks are joined with one blank line between them. Pure function; no I/O.
|
||||||
|
"""
|
||||||
|
if not blocks:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
rendered: list[str] = []
|
||||||
|
for block in blocks:
|
||||||
|
chunk = _render_one(block)
|
||||||
|
if chunk:
|
||||||
|
rendered.append(chunk)
|
||||||
|
|
||||||
|
return "\n\n".join(rendered)
|
||||||
|
|
||||||
|
|
||||||
|
def _render_one(block: dict) -> str:
|
||||||
|
btype = block.get("type", "")
|
||||||
|
if btype == BLOCK_TYPE_TEXT:
|
||||||
|
return block.get("text", "")
|
||||||
|
if btype == BLOCK_TYPE_CODE:
|
||||||
|
lang = block.get("language") or ""
|
||||||
|
code = block.get("code", "")
|
||||||
|
fence = _safe_fence(code)
|
||||||
|
return f"{fence}{lang}\n{code}\n{fence}"
|
||||||
|
if btype == BLOCK_TYPE_THINKING:
|
||||||
|
text = block.get("text", "")
|
||||||
|
quoted = _blockquote_prefix(text)
|
||||||
|
return f"**💭 Reasoning**\n\n{quoted}"
|
||||||
|
if btype == BLOCK_TYPE_TOOL_USE:
|
||||||
|
name = block.get("name", "")
|
||||||
|
input_data = block.get("input", {})
|
||||||
|
body_json = json.dumps(input_data, indent=2, sort_keys=False, default=str, ensure_ascii=False)
|
||||||
|
fence = _safe_fence(body_json)
|
||||||
|
body = f"{fence}json\n{body_json}\n{fence}"
|
||||||
|
quoted = _blockquote_prefix(f"🔧 **Tool: {name}**\n{body}")
|
||||||
|
return quoted
|
||||||
|
if btype == BLOCK_TYPE_TOOL_RESULT:
|
||||||
|
output = block.get("output", "")
|
||||||
|
is_error = bool(block.get("is_error"))
|
||||||
|
tool_name = block.get("tool_name") or ""
|
||||||
|
summary = block.get("summary") or ""
|
||||||
|
icon = "❌" if is_error else "📤"
|
||||||
|
label = "Result (error)" if is_error else "Result"
|
||||||
|
if tool_name:
|
||||||
|
header = f"{icon} **{label}: {tool_name}**"
|
||||||
|
else:
|
||||||
|
header = f"{icon} **{label}**"
|
||||||
|
fence = _safe_fence(output)
|
||||||
|
body = f"{fence}\n{output}\n{fence}"
|
||||||
|
if summary:
|
||||||
|
inner = f"{header}\n*{summary}*\n{body}"
|
||||||
|
else:
|
||||||
|
inner = f"{header}\n{body}"
|
||||||
|
return _blockquote_prefix(inner)
|
||||||
|
if btype == BLOCK_TYPE_CITATION:
|
||||||
|
url = block.get("url", "")
|
||||||
|
title = block.get("title") or url
|
||||||
|
return f"[{title}]({url})"
|
||||||
|
if btype == BLOCK_TYPE_IMAGE_PLACEHOLDER:
|
||||||
|
ref = block.get("ref", "")
|
||||||
|
source = block.get("source", "unknown")
|
||||||
|
mime = block.get("mime")
|
||||||
|
meta_parts = [source] if source else []
|
||||||
|
if mime:
|
||||||
|
meta_parts.append(mime)
|
||||||
|
meta_parts.append("content not preserved in this export")
|
||||||
|
meta = ", ".join(meta_parts)
|
||||||
|
return f"> 🖼️ **Image attached** — `{ref}` ({meta})"
|
||||||
|
if btype == BLOCK_TYPE_FILE_PLACEHOLDER:
|
||||||
|
ref = block.get("ref", "")
|
||||||
|
filename = block.get("filename")
|
||||||
|
label = filename or ref
|
||||||
|
mime = block.get("mime")
|
||||||
|
size_bytes = block.get("size_bytes")
|
||||||
|
duration = block.get("duration_seconds")
|
||||||
|
meta_parts: list[str] = []
|
||||||
|
if mime:
|
||||||
|
meta_parts.append(mime)
|
||||||
|
if isinstance(size_bytes, int) and size_bytes > 0:
|
||||||
|
kb = size_bytes / 1024
|
||||||
|
meta_parts.append(f"{kb:.1f} KB" if kb < 1024 else f"{kb / 1024:.2f} MB")
|
||||||
|
if isinstance(duration, (int, float)) and duration > 0:
|
||||||
|
meta_parts.append(f"{duration:.2f}s")
|
||||||
|
meta_parts.append("content not preserved in this export")
|
||||||
|
meta = ", ".join(meta_parts)
|
||||||
|
return f"> 📎 **File attached** — `{label}` ({meta})"
|
||||||
|
if btype == BLOCK_TYPE_UNKNOWN:
|
||||||
|
raw_type = block.get("raw_type", "?")
|
||||||
|
reason = block.get("reason", UNKNOWN_REASON_UNKNOWN_TYPE)
|
||||||
|
keys = block.get("observed_keys") or []
|
||||||
|
summary = block.get("summary")
|
||||||
|
first_line = f"⚠️ **Unsupported content** — type `{raw_type}` ({reason})"
|
||||||
|
lines = [first_line]
|
||||||
|
if summary:
|
||||||
|
lines.append(summary)
|
||||||
|
if keys:
|
||||||
|
keys_str = ", ".join(f"`{k}`" for k in keys)
|
||||||
|
lines.append(f"Keys observed: {keys_str}")
|
||||||
|
return _blockquote_prefix("\n".join(lines))
|
||||||
|
if btype == BLOCK_TYPE_HIDDEN_CONTEXT_MARKER:
|
||||||
|
ctype = block.get("content_type", "")
|
||||||
|
return f"> ℹ️ **Hidden context** — `{ctype}`"
|
||||||
|
|
||||||
|
# Defensive: a block of unrecognised local type (shouldn't happen if
|
||||||
|
# constructors are used). Render as visible warning rather than dropping.
|
||||||
|
return f"> ⚠️ **Internal: unrecognised block type** — `{btype}`"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _safe_fence(text: str) -> str:
|
||||||
|
"""Return a backtick fence longer than the longest run of backticks in ``text``.
|
||||||
|
|
||||||
|
CommonMark requires the closing fence to be at least as long as the opening
|
||||||
|
fence. Picking N+1 (where N = longest run in content) ensures the content's
|
||||||
|
own backticks are inert. Minimum is 3.
|
||||||
|
|
||||||
|
Verified live against Joplin during planning — see plan
|
||||||
|
§Backtick-corruption defense.
|
||||||
|
"""
|
||||||
|
if not isinstance(text, str):
|
||||||
|
return "```"
|
||||||
|
longest_run = 0
|
||||||
|
current_run = 0
|
||||||
|
for ch in text:
|
||||||
|
if ch == "`":
|
||||||
|
current_run += 1
|
||||||
|
if current_run > longest_run:
|
||||||
|
longest_run = current_run
|
||||||
|
else:
|
||||||
|
current_run = 0
|
||||||
|
fence_len = max(3, longest_run + 1)
|
||||||
|
return "`" * fence_len
|
||||||
|
|
||||||
|
|
||||||
|
def _blockquote_prefix(text: str) -> str:
|
||||||
|
"""Prefix every line of ``text`` with ``> `` so the whole block renders as a quote.
|
||||||
|
|
||||||
|
Empty source lines become ``>`` (no trailing space) so blockquote continuity
|
||||||
|
is preserved without trailing-whitespace noise.
|
||||||
|
"""
|
||||||
|
if not isinstance(text, str):
|
||||||
|
return ""
|
||||||
|
out_lines: list[str] = []
|
||||||
|
for line in text.split("\n"):
|
||||||
|
if line == "":
|
||||||
|
out_lines.append(">")
|
||||||
|
else:
|
||||||
|
out_lines.append(f"> {line}")
|
||||||
|
return "\n".join(out_lines)
|
||||||
@@ -87,6 +87,7 @@ class Cache:
|
|||||||
self._data[provider][conv_id] = {
|
self._data[provider][conv_id] = {
|
||||||
"title": metadata.get("title", ""),
|
"title": metadata.get("title", ""),
|
||||||
"project": metadata.get("project"),
|
"project": metadata.get("project"),
|
||||||
|
"created_at": metadata.get("created_at", ""),
|
||||||
"updated_at": metadata.get("updated_at", ""),
|
"updated_at": metadata.get("updated_at", ""),
|
||||||
"exported_at": datetime.now(tz=timezone.utc).isoformat(),
|
"exported_at": datetime.now(tz=timezone.utc).isoformat(),
|
||||||
"file_path": metadata.get("file_path", ""),
|
"file_path": metadata.get("file_path", ""),
|
||||||
|
|||||||
@@ -28,6 +28,7 @@ class ConfigError(Exception):
|
|||||||
@dataclass
|
@dataclass
|
||||||
class Config:
|
class Config:
|
||||||
chatgpt_session_token: str | None
|
chatgpt_session_token: str | None
|
||||||
|
chatgpt_session_token_1: str | None
|
||||||
claude_session_key: str | None
|
claude_session_key: str | None
|
||||||
export_dir: Path
|
export_dir: Path
|
||||||
output_structure: str
|
output_structure: str
|
||||||
@@ -55,11 +56,12 @@ def load_config() -> Config:
|
|||||||
load_dotenv(override=False)
|
load_dotenv(override=False)
|
||||||
|
|
||||||
chatgpt_token = os.getenv("CHATGPT_SESSION_TOKEN", "").strip() or None
|
chatgpt_token = os.getenv("CHATGPT_SESSION_TOKEN", "").strip() or None
|
||||||
|
chatgpt_token_1 = os.getenv("CHATGPT_SESSION_TOKEN_1", "").strip() or None
|
||||||
claude_key = os.getenv("CLAUDE_SESSION_KEY", "").strip() or None
|
claude_key = os.getenv("CLAUDE_SESSION_KEY", "").strip() or None
|
||||||
export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser()
|
export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser()
|
||||||
output_structure = os.getenv("OUTPUT_STRUCTURE", "provider/project/year").strip()
|
output_structure = os.getenv("OUTPUT_STRUCTURE", "provider/project/year").strip()
|
||||||
cache_dir = Path(os.getenv("CACHE_DIR", "~/.ai-chat-exporter")).expanduser()
|
cache_dir = Path(os.getenv("CACHE_DIR", "./cache")).expanduser()
|
||||||
log_file = os.getenv("LOG_FILE", "~/.ai-chat-exporter/logs/exporter.log").strip()
|
log_file = os.getenv("LOG_FILE", "./cache/logs/exporter.log").strip()
|
||||||
|
|
||||||
# Joplin
|
# Joplin
|
||||||
joplin_token = os.getenv("JOPLIN_API_TOKEN", "").strip() or None
|
joplin_token = os.getenv("JOPLIN_API_TOKEN", "").strip() or None
|
||||||
@@ -101,7 +103,7 @@ def load_config() -> Config:
|
|||||||
if not chatgpt_token and not claude_key:
|
if not chatgpt_token and not claude_key:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"Neither CHATGPT_SESSION_TOKEN nor CLAUDE_SESSION_KEY is set. "
|
"Neither CHATGPT_SESSION_TOKEN nor CLAUDE_SESSION_KEY is set. "
|
||||||
"Run 'python -m src.main auth' to configure credentials."
|
"Run 'ai-chat-exporter auth' to configure credentials."
|
||||||
)
|
)
|
||||||
|
|
||||||
# Create and validate output directory
|
# Create and validate output directory
|
||||||
@@ -127,6 +129,7 @@ def load_config() -> Config:
|
|||||||
|
|
||||||
config = Config(
|
config = Config(
|
||||||
chatgpt_session_token=chatgpt_token,
|
chatgpt_session_token=chatgpt_token,
|
||||||
|
chatgpt_session_token_1=chatgpt_token_1,
|
||||||
claude_session_key=claude_key,
|
claude_session_key=claude_key,
|
||||||
export_dir=export_dir,
|
export_dir=export_dir,
|
||||||
output_structure=output_structure,
|
output_structure=output_structure,
|
||||||
@@ -173,7 +176,7 @@ def _validate_chatgpt_token(token: str) -> datetime | None:
|
|||||||
if delta.total_seconds() < 0:
|
if delta.total_seconds() < 0:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"CHATGPT_SESSION_TOKEN expired at %s. "
|
"CHATGPT_SESSION_TOKEN expired at %s. "
|
||||||
"Run 'python -m src.main auth' to refresh it.",
|
"Run 'ai-chat-exporter auth' to refresh it.",
|
||||||
expiry.strftime("%Y-%m-%d %H:%M UTC"),
|
expiry.strftime("%Y-%m-%d %H:%M UTC"),
|
||||||
)
|
)
|
||||||
elif delta.total_seconds() < 86400:
|
elif delta.total_seconds() < 86400:
|
||||||
|
|||||||
@@ -6,6 +6,7 @@ from datetime import datetime, timezone
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
|
from src.blocks import render_blocks_to_markdown
|
||||||
from src.utils import build_export_path, generate_filename
|
from src.utils import build_export_path, generate_filename
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -15,6 +16,7 @@ _ROLE_LABELS = {
|
|||||||
"user": ("🧑 Human", "user"),
|
"user": ("🧑 Human", "user"),
|
||||||
"assistant": ("🤖 Assistant", "assistant"),
|
"assistant": ("🤖 Assistant", "assistant"),
|
||||||
"system": ("⚙️ System", "system"),
|
"system": ("⚙️ System", "system"),
|
||||||
|
"tool": ("🔧 Tool", "tool"),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -125,10 +127,17 @@ class MarkdownExporter:
|
|||||||
# Messages
|
# Messages
|
||||||
for msg in messages:
|
for msg in messages:
|
||||||
role = msg.get("role", "user")
|
role = msg.get("role", "user")
|
||||||
content = msg.get("content", "")
|
blocks = msg.get("blocks") or []
|
||||||
timestamp = msg.get("timestamp")
|
timestamp = msg.get("timestamp")
|
||||||
|
|
||||||
if not content or not content.strip():
|
# Prefer rendering from blocks (v0.4.0+). Backward-compat fallback:
|
||||||
|
# if blocks is missing/empty AND content exists, render content as-is.
|
||||||
|
if blocks:
|
||||||
|
body = render_blocks_to_markdown(blocks)
|
||||||
|
else:
|
||||||
|
body = msg.get("content", "") or ""
|
||||||
|
|
||||||
|
if not body or not body.strip():
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"[markdown] Skipping empty/whitespace message in conversation %s",
|
"[markdown] Skipping empty/whitespace message in conversation %s",
|
||||||
conv_id[:8],
|
conv_id[:8],
|
||||||
@@ -143,7 +152,7 @@ class MarkdownExporter:
|
|||||||
else:
|
else:
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
lines.append(content)
|
lines.append(body)
|
||||||
lines.append("")
|
lines.append("")
|
||||||
lines.append("---")
|
lines.append("---")
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|||||||
@@ -32,8 +32,8 @@ class JoplinClient:
|
|||||||
def __init__(self, base_url: str, token: str) -> None:
|
def __init__(self, base_url: str, token: str) -> None:
|
||||||
self._base_url = base_url.rstrip("/")
|
self._base_url = base_url.rstrip("/")
|
||||||
self._token = token
|
self._token = token
|
||||||
# In-memory cache of notebook title → ID to avoid repeated GET /folders
|
# In-memory cache: (parent_id | None, title) → folder ID
|
||||||
self._notebook_cache: dict[str, str] = {}
|
self._notebook_cache: dict[tuple[str | None, str], str] = {}
|
||||||
self._notebooks_loaded = False
|
self._notebooks_loaded = False
|
||||||
logger.debug("[joplin] Client initialised with base_url=%s", self._base_url)
|
logger.debug("[joplin] Client initialised with base_url=%s", self._base_url)
|
||||||
|
|
||||||
@@ -89,13 +89,13 @@ class JoplinClient:
|
|||||||
"""Return all Joplin notebooks (folders), handling pagination.
|
"""Return all Joplin notebooks (folders), handling pagination.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
List of folder dicts with at least ``id`` and ``title`` keys.
|
List of folder dicts with at least ``id``, ``title``, and ``parent_id`` keys.
|
||||||
"""
|
"""
|
||||||
results: list[dict] = []
|
results: list[dict] = []
|
||||||
page = 1
|
page = 1
|
||||||
while True:
|
while True:
|
||||||
logger.debug("[joplin] GET /folders page=%d", page)
|
logger.debug("[joplin] GET /folders page=%d", page)
|
||||||
resp = self._get("/folders", params={"page": page, "fields": "id,title"})
|
resp = self._get("/folders", params={"page": page, "fields": "id,title,parent_id"})
|
||||||
items = resp.get("items", [])
|
items = resp.get("items", [])
|
||||||
results.extend(items)
|
results.extend(items)
|
||||||
logger.debug("[joplin] /folders page=%d → %d items, has_more=%s", page, len(items), resp.get("has_more"))
|
logger.debug("[joplin] /folders page=%d → %d items, has_more=%s", page, len(items), resp.get("has_more"))
|
||||||
@@ -104,11 +104,12 @@ class JoplinClient:
|
|||||||
page += 1
|
page += 1
|
||||||
return results
|
return results
|
||||||
|
|
||||||
def get_or_create_notebook(self, title: str) -> str:
|
def get_or_create_notebook(self, title: str, parent_id: str | None = None) -> str:
|
||||||
"""Return the Joplin folder ID for ``title``, creating it if needed.
|
"""Return the Joplin folder ID for ``title`` under ``parent_id``, creating if needed.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
title: Notebook display name (e.g. "ChatGPT - My Project").
|
title: Notebook display name.
|
||||||
|
parent_id: ID of the parent folder, or None for a root notebook.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Joplin folder ID string.
|
Joplin folder ID string.
|
||||||
@@ -116,19 +117,40 @@ class JoplinClient:
|
|||||||
if not self._notebooks_loaded:
|
if not self._notebooks_loaded:
|
||||||
self._load_notebook_cache()
|
self._load_notebook_cache()
|
||||||
|
|
||||||
if title in self._notebook_cache:
|
key = (parent_id, title)
|
||||||
folder_id = self._notebook_cache[title]
|
if key in self._notebook_cache:
|
||||||
logger.debug("[joplin] Notebook cache hit: %r → %s", title, folder_id)
|
folder_id = self._notebook_cache[key]
|
||||||
|
logger.debug("[joplin] Notebook cache hit: %r (parent=%s) → %s", title, parent_id, folder_id)
|
||||||
return folder_id
|
return folder_id
|
||||||
|
|
||||||
# Not found — create it
|
# Not found — create it
|
||||||
logger.info("[joplin] Creating notebook: %r", title)
|
logger.info("[joplin] Creating notebook: %r (parent=%s)", title, parent_id)
|
||||||
resp = self._post("/folders", {"title": title})
|
data: dict = {"title": title}
|
||||||
|
if parent_id:
|
||||||
|
data["parent_id"] = parent_id
|
||||||
|
resp = self._post("/folders", data)
|
||||||
folder_id = resp["id"]
|
folder_id = resp["id"]
|
||||||
self._notebook_cache[title] = folder_id
|
self._notebook_cache[key] = folder_id
|
||||||
logger.debug("[joplin] Notebook created: %r → %s", title, folder_id)
|
logger.debug("[joplin] Notebook created: %r → %s", title, folder_id)
|
||||||
return folder_id
|
return folder_id
|
||||||
|
|
||||||
|
def get_or_create_notebook_path(self, path: list[str]) -> str:
|
||||||
|
"""Ensure a nested notebook path exists and return the leaf folder ID.
|
||||||
|
|
||||||
|
Creates intermediate notebooks as needed.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
path: Ordered list of notebook names, e.g. ["AI-ChatGPT", "No Project"].
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Joplin folder ID of the deepest (leaf) notebook.
|
||||||
|
"""
|
||||||
|
parent_id: str | None = None
|
||||||
|
for name in path:
|
||||||
|
parent_id = self.get_or_create_notebook(name, parent_id)
|
||||||
|
assert parent_id is not None
|
||||||
|
return parent_id
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
# Notes
|
# Notes
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
@@ -233,11 +255,14 @@ class JoplinClient:
|
|||||||
def _load_notebook_cache(self) -> None:
|
def _load_notebook_cache(self) -> None:
|
||||||
logger.debug("[joplin] Loading notebook list from Joplin…")
|
logger.debug("[joplin] Loading notebook list from Joplin…")
|
||||||
notebooks = self.list_notebooks()
|
notebooks = self.list_notebooks()
|
||||||
self._notebook_cache = {nb["title"]: nb["id"] for nb in notebooks}
|
self._notebook_cache = {
|
||||||
|
(nb.get("parent_id") or None, nb["title"]): nb["id"]
|
||||||
|
for nb in notebooks
|
||||||
|
}
|
||||||
self._notebooks_loaded = True
|
self._notebooks_loaded = True
|
||||||
logger.debug("[joplin] Notebook cache loaded: %d notebooks", len(self._notebook_cache))
|
logger.debug("[joplin] Notebook cache loaded: %d notebooks", len(self._notebook_cache))
|
||||||
for title, folder_id in self._notebook_cache.items():
|
for (parent_id, title), folder_id in self._notebook_cache.items():
|
||||||
logger.debug("[joplin] %r → %s", title, folder_id)
|
logger.debug("[joplin] (%s) %r → %s", parent_id or "root", title, folder_id)
|
||||||
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
@@ -285,19 +310,21 @@ def _http_error_message(method: str, path: str, e: requests.exceptions.HTTPError
|
|||||||
|
|
||||||
|
|
||||||
_PROVIDER_DISPLAY = {
|
_PROVIDER_DISPLAY = {
|
||||||
"chatgpt": "ChatGPT",
|
"chatgpt": "AI-ChatGPT",
|
||||||
"claude": "Claude",
|
"claude": "AI-Claude",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def notebook_title(provider: str, project: str | None) -> str:
|
def notebook_path(provider: str, project: str | None) -> tuple[str, str]:
|
||||||
"""Derive a flat Joplin notebook title from provider and project name.
|
"""Return (parent_notebook, child_notebook) for the given provider and project.
|
||||||
|
|
||||||
|
The parent is the top-level provider notebook; the child is the project name.
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
notebook_title("chatgpt", "no-project") → "ChatGPT - No Project"
|
notebook_path("chatgpt", None) → ("AI-ChatGPT", "No Project")
|
||||||
notebook_title("claude", "budget-tracker") → "Claude - Budget Tracker"
|
notebook_path("chatgpt", "no-project") → ("AI-ChatGPT", "No Project")
|
||||||
notebook_title("chatgpt", None) → "ChatGPT - No Project"
|
notebook_path("claude", "budget-tracker") → ("AI-Claude", "Budget Tracker")
|
||||||
"""
|
"""
|
||||||
prov_display = _PROVIDER_DISPLAY.get(provider, provider.capitalize())
|
parent = _PROVIDER_DISPLAY.get(provider, f"AI-{provider.capitalize()}")
|
||||||
proj = (project or "no-project").replace("-", " ").title()
|
child = (project or "no-project").replace("-", " ").title()
|
||||||
return f"{prov_display} - {proj}"
|
return parent, child
|
||||||
|
|||||||
85
src/loss_report.py
Normal file
85
src/loss_report.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
"""Per-export-run tally for content that was dropped or partially extracted.
|
||||||
|
|
||||||
|
Surfaces the loss visibility that the rest of the system promises in its
|
||||||
|
output (visible ``unknown`` blocks). The summary emitted at the end of
|
||||||
|
each export is the load-bearing operator-facing signal: if a real content
|
||||||
|
type starts being silently dropped, this is where it shows up.
|
||||||
|
|
||||||
|
Pass a single instance through ``BaseProvider.normalize_conversation`` and
|
||||||
|
read it back in ``src/main.py`` after the export loop. No global state.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from collections import Counter
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
|
||||||
|
_TOP_N_BREAKDOWN = 5
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class LossReport:
|
||||||
|
"""Counters for things that didn't render cleanly in an export run."""
|
||||||
|
|
||||||
|
# Type-keyed counters. Values are int counts.
|
||||||
|
unknown_blocks: Counter = field(default_factory=Counter)
|
||||||
|
extraction_failures: Counter = field(default_factory=Counter)
|
||||||
|
filtered_roles: Counter = field(default_factory=Counter)
|
||||||
|
|
||||||
|
# Aggregate counters
|
||||||
|
messages_rendered: int = 0
|
||||||
|
conversations: int = 0
|
||||||
|
|
||||||
|
# Recording -------------------------------------------------------------
|
||||||
|
|
||||||
|
def record_unknown(self, raw_type: str) -> None:
|
||||||
|
self.unknown_blocks[raw_type or "?"] += 1
|
||||||
|
|
||||||
|
def record_extraction_failure(self, raw_type: str) -> None:
|
||||||
|
self.extraction_failures[raw_type or "?"] += 1
|
||||||
|
|
||||||
|
def record_filtered_role(self, role: str) -> None:
|
||||||
|
self.filtered_roles[role or "?"] += 1
|
||||||
|
|
||||||
|
def record_message(self) -> None:
|
||||||
|
self.messages_rendered += 1
|
||||||
|
|
||||||
|
def record_conversation(self) -> None:
|
||||||
|
self.conversations += 1
|
||||||
|
|
||||||
|
# Summary ---------------------------------------------------------------
|
||||||
|
|
||||||
|
def format_summary(self) -> str:
|
||||||
|
"""Return a multi-line summary table suitable for INFO logging.
|
||||||
|
|
||||||
|
Format pinned by plan §Post-export summary — "(none)" sentinel when a
|
||||||
|
counter is empty, top-5 breakdown with "+ N more types" overflow.
|
||||||
|
"""
|
||||||
|
lines: list[str] = ["[export] Run summary:"]
|
||||||
|
lines.append(f" conversations: {self.conversations}")
|
||||||
|
lines.append(f" messages rendered: {self.messages_rendered}")
|
||||||
|
lines.extend(_format_section("unknown blocks: ", self.unknown_blocks))
|
||||||
|
lines.extend(_format_section("extraction failures: ", self.extraction_failures))
|
||||||
|
lines.append(
|
||||||
|
" filtered roles: "
|
||||||
|
"(filter lifted in v0.4.0 — counter retained for future use, expected 0)"
|
||||||
|
)
|
||||||
|
if self.filtered_roles:
|
||||||
|
for role, count in self.filtered_roles.most_common(_TOP_N_BREAKDOWN):
|
||||||
|
lines.append(f" {role}={count}")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def _format_section(label: str, counter: Counter) -> list[str]:
|
||||||
|
"""Render one counter section: header line + indented breakdown lines."""
|
||||||
|
total = sum(counter.values())
|
||||||
|
header = f" {label} {total}"
|
||||||
|
if total == 0:
|
||||||
|
return [header, " (none)"]
|
||||||
|
|
||||||
|
lines = [header]
|
||||||
|
most_common = counter.most_common()
|
||||||
|
for raw_type, count in most_common[:_TOP_N_BREAKDOWN]:
|
||||||
|
lines.append(f" {raw_type}={count}")
|
||||||
|
if len(most_common) > _TOP_N_BREAKDOWN:
|
||||||
|
remainder = len(most_common) - _TOP_N_BREAKDOWN
|
||||||
|
lines.append(f" + {remainder} more types")
|
||||||
|
return lines
|
||||||
137
src/main.py
137
src/main.py
@@ -16,6 +16,7 @@ from rich.table import Table
|
|||||||
from src.cache import Cache, CacheError
|
from src.cache import Cache, CacheError
|
||||||
from src.config import ConfigError
|
from src.config import ConfigError
|
||||||
from src.logging_config import setup_logging
|
from src.logging_config import setup_logging
|
||||||
|
from src.loss_report import LossReport
|
||||||
from src.providers.base import ProviderError
|
from src.providers.base import ProviderError
|
||||||
|
|
||||||
console = Console()
|
console = Console()
|
||||||
@@ -70,7 +71,7 @@ def cli(ctx: click.Context, verbose: bool, quiet: bool, debug: bool, no_log_file
|
|||||||
|
|
||||||
# Determine log file path from env (setup_logging handles "none")
|
# Determine log file path from env (setup_logging handles "none")
|
||||||
import os
|
import os
|
||||||
log_file = os.getenv("LOG_FILE", "~/.ai-chat-exporter/logs/exporter.log")
|
log_file = os.getenv("LOG_FILE", "./cache/logs/exporter.log")
|
||||||
|
|
||||||
setup_logging(level=level, log_file=log_file, no_log_file=no_log_file)
|
setup_logging(level=level, log_file=log_file, no_log_file=no_log_file)
|
||||||
|
|
||||||
@@ -79,7 +80,7 @@ def cli(ctx: click.Context, verbose: bool, quiet: bool, debug: bool, no_log_file
|
|||||||
|
|
||||||
# Initialise cache (needed for ToS gate on every command)
|
# Initialise cache (needed for ToS gate on every command)
|
||||||
import os
|
import os
|
||||||
cache_dir = Path(os.getenv("CACHE_DIR", "~/.ai-chat-exporter")).expanduser()
|
cache_dir = Path(os.getenv("CACHE_DIR", "./cache")).expanduser()
|
||||||
try:
|
try:
|
||||||
cache = Cache(cache_dir)
|
cache = Cache(cache_dir)
|
||||||
except CacheError as e:
|
except CacheError as e:
|
||||||
@@ -140,7 +141,7 @@ def auth(ctx: click.Context) -> None:
|
|||||||
if configure_claude:
|
if configure_claude:
|
||||||
_auth_claude(os_name)
|
_auth_claude(os_name)
|
||||||
|
|
||||||
console.print("\n[green]Done! Run 'python -m src.main doctor' to verify your setup.[/green]")
|
console.print("\n[green]Done! Run 'ai-chat-exporter doctor' to verify your setup.[/green]")
|
||||||
|
|
||||||
|
|
||||||
def _auth_chatgpt(os_name: str) -> None:
|
def _auth_chatgpt(os_name: str) -> None:
|
||||||
@@ -153,15 +154,19 @@ def _auth_chatgpt(os_name: str) -> None:
|
|||||||
else:
|
else:
|
||||||
console.print("2. Press [bold]F12[/bold] to open DevTools → Application tab.")
|
console.print("2. Press [bold]F12[/bold] to open DevTools → Application tab.")
|
||||||
console.print("3. Expand [bold]Cookies[/bold] → [bold]https://chatgpt.com[/bold]")
|
console.print("3. Expand [bold]Cookies[/bold] → [bold]https://chatgpt.com[/bold]")
|
||||||
console.print("4. Find [bold]__Secure-next-auth.session-token[/bold] → copy the Value.")
|
console.print("4. ChatGPT splits the session token across two cookies:")
|
||||||
console.print(" (Token starts with 'eyJ...' — it is a long JWT string)")
|
console.print(" [bold]__Secure-next-auth.session-token.0[/bold] (starts with 'eyJ')")
|
||||||
console.print("5. Paste it below (input is hidden).\n")
|
console.print(" [bold]__Secure-next-auth.session-token.1[/bold] (the remainder)")
|
||||||
|
console.print(" Copy each Value in turn and paste below.")
|
||||||
|
console.print(" (If you only see one cookie without a .0/.1 suffix, paste it for .0 and leave .1 blank.)\n")
|
||||||
|
|
||||||
token = click.prompt("ChatGPT session token", hide_input=True, default="", show_default=False).strip()
|
token = click.prompt("ChatGPT session token (.0)", hide_input=True, default="", show_default=False).strip()
|
||||||
if not token:
|
if not token:
|
||||||
console.print("[yellow]Skipped ChatGPT token.[/yellow]")
|
console.print("[yellow]Skipped ChatGPT token.[/yellow]")
|
||||||
return
|
return
|
||||||
|
|
||||||
|
token_1 = click.prompt("ChatGPT session token (.1, leave blank if absent)", hide_input=True, default="", show_default=False).strip() or None
|
||||||
|
|
||||||
# Validate
|
# Validate
|
||||||
if not token.startswith("eyJ"):
|
if not token.startswith("eyJ"):
|
||||||
console.print("[yellow]Warning: token doesn't look like a JWT (expected 'eyJ...').[/yellow]")
|
console.print("[yellow]Warning: token doesn't look like a JWT (expected 'eyJ...').[/yellow]")
|
||||||
@@ -178,7 +183,28 @@ def _auth_chatgpt(os_name: str) -> None:
|
|||||||
except Exception:
|
except Exception:
|
||||||
console.print("[yellow]Could not decode token expiry.[/yellow]")
|
console.print("[yellow]Could not decode token expiry.[/yellow]")
|
||||||
|
|
||||||
|
# Live validation — exchange session token for an access token
|
||||||
|
_valid = False
|
||||||
|
_error: str | None = None
|
||||||
|
with console.status("[dim]Validating token with ChatGPT API…[/dim]"):
|
||||||
|
try:
|
||||||
|
from src.providers.chatgpt import ChatGPTProvider
|
||||||
|
_prov = ChatGPTProvider(session_token=token, session_token_1=token_1)
|
||||||
|
_prov._fetch_access_token()
|
||||||
|
_valid = True
|
||||||
|
except ProviderError as e:
|
||||||
|
_error = str(e.original)
|
||||||
|
except Exception as e:
|
||||||
|
_error = str(e)
|
||||||
|
|
||||||
|
if _valid:
|
||||||
|
console.print("[green]✓ Token verified — connected to ChatGPT API.[/green]")
|
||||||
|
else:
|
||||||
|
console.print(f"[red]✗ Token validation failed: {_error}[/red]")
|
||||||
|
|
||||||
_write_token_to_env("CHATGPT_SESSION_TOKEN", token)
|
_write_token_to_env("CHATGPT_SESSION_TOKEN", token)
|
||||||
|
if token_1:
|
||||||
|
_write_token_to_env("CHATGPT_SESSION_TOKEN_1", token_1)
|
||||||
|
|
||||||
# --- ChatGPT Projects ---
|
# --- ChatGPT Projects ---
|
||||||
console.print("\n[bold]ChatGPT Projects (optional)[/bold]")
|
console.print("\n[bold]ChatGPT Projects (optional)[/bold]")
|
||||||
@@ -231,7 +257,25 @@ def _auth_claude(os_name: str) -> None:
|
|||||||
console.print("[yellow]Skipped Claude token.[/yellow]")
|
console.print("[yellow]Skipped Claude token.[/yellow]")
|
||||||
return
|
return
|
||||||
|
|
||||||
console.print("[green]Claude session key saved.[/green]")
|
# Live validation — fetch org ID (the first call any Claude operation makes)
|
||||||
|
_valid = False
|
||||||
|
_error: str | None = None
|
||||||
|
with console.status("[dim]Validating token with Claude API…[/dim]"):
|
||||||
|
try:
|
||||||
|
from src.providers.claude import ClaudeProvider
|
||||||
|
_prov = ClaudeProvider(session_key=key)
|
||||||
|
_prov._get_org_id()
|
||||||
|
_valid = True
|
||||||
|
except ProviderError as e:
|
||||||
|
_error = str(e.original)
|
||||||
|
except Exception as e:
|
||||||
|
_error = str(e)
|
||||||
|
|
||||||
|
if _valid:
|
||||||
|
console.print("[green]✓ Token verified — connected to Claude API.[/green]")
|
||||||
|
else:
|
||||||
|
console.print(f"[red]✗ Token validation failed: {_error}[/red]")
|
||||||
|
|
||||||
_write_token_to_env("CLAUDE_SESSION_KEY", key)
|
_write_token_to_env("CLAUDE_SESSION_KEY", key)
|
||||||
|
|
||||||
|
|
||||||
@@ -341,7 +385,7 @@ def _run_doctor_checks() -> list[dict]:
|
|||||||
|
|
||||||
# Directories
|
# Directories
|
||||||
export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser()
|
export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser()
|
||||||
cache_dir = Path(os.getenv("CACHE_DIR", "~/.ai-chat-exporter")).expanduser()
|
cache_dir = Path(os.getenv("CACHE_DIR", "./cache")).expanduser()
|
||||||
|
|
||||||
for label, dirpath in [("Export dir writable", export_dir), ("Cache dir writable", cache_dir)]:
|
for label, dirpath in [("Export dir writable", export_dir), ("Cache dir writable", cache_dir)]:
|
||||||
try:
|
try:
|
||||||
@@ -365,7 +409,8 @@ def _run_doctor_checks() -> list[dict]:
|
|||||||
if chatgpt_token:
|
if chatgpt_token:
|
||||||
try:
|
try:
|
||||||
from src.providers.chatgpt import ChatGPTProvider
|
from src.providers.chatgpt import ChatGPTProvider
|
||||||
p = ChatGPTProvider(chatgpt_token)
|
chatgpt_token_1 = os.getenv("CHATGPT_SESSION_TOKEN_1", "").strip() or None
|
||||||
|
p = ChatGPTProvider(chatgpt_token, session_token_1=chatgpt_token_1)
|
||||||
results = p.list_conversations(offset=0, limit=1)
|
results = p.list_conversations(offset=0, limit=1)
|
||||||
add("ChatGPT API reachable", True, f"Got {len(results)} result(s)")
|
add("ChatGPT API reachable", True, f"Got {len(results)} result(s)")
|
||||||
except ProviderError as e:
|
except ProviderError as e:
|
||||||
@@ -496,7 +541,7 @@ def export(
|
|||||||
providers_to_run = _resolve_providers(provider, cfg)
|
providers_to_run = _resolve_providers(provider, cfg)
|
||||||
if not providers_to_run:
|
if not providers_to_run:
|
||||||
err_console.print(
|
err_console.print(
|
||||||
"[red]No providers configured. Run 'python -m src.main auth' to set up tokens.[/red]"
|
"[red]No providers configured. Run 'ai-chat-exporter auth' to set up tokens.[/red]"
|
||||||
)
|
)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
@@ -510,6 +555,9 @@ def export(
|
|||||||
# Summary counters
|
# Summary counters
|
||||||
summary: dict[str, dict[str, int]] = {}
|
summary: dict[str, dict[str, int]] = {}
|
||||||
|
|
||||||
|
# Single LossReport tracks data-loss visibility across all providers in this run.
|
||||||
|
loss_report = LossReport()
|
||||||
|
|
||||||
for prov_name, prov_instance in providers_to_run:
|
for prov_name, prov_instance in providers_to_run:
|
||||||
summary[prov_name] = {"exported": 0, "skipped": 0, "failed": 0}
|
summary[prov_name] = {"exported": 0, "skipped": 0, "failed": 0}
|
||||||
|
|
||||||
@@ -557,7 +605,17 @@ def export(
|
|||||||
conv_id = raw_conv.get("id") or raw_conv.get("uuid", "unknown")
|
conv_id = raw_conv.get("id") or raw_conv.get("uuid", "unknown")
|
||||||
try:
|
try:
|
||||||
full_raw = prov_instance.get_conversation(conv_id)
|
full_raw = prov_instance.get_conversation(conv_id)
|
||||||
normalized = prov_instance.normalize_conversation(full_raw)
|
# Propagate metadata from the listing summary into the full
|
||||||
|
# detail so normalize_conversation can use it.
|
||||||
|
# - Keys starting with "_" are provider annotations
|
||||||
|
# (e.g. _project_name injected by ChatGPT project fetching).
|
||||||
|
# - "project" is included explicitly because Claude's detail
|
||||||
|
# endpoint omits it even though the listing returns it.
|
||||||
|
_PROPAGATE_KEYS = {"project"}
|
||||||
|
for key, val in raw_conv.items():
|
||||||
|
if (key.startswith("_") or key in _PROPAGATE_KEYS) and key not in full_raw:
|
||||||
|
full_raw[key] = val
|
||||||
|
normalized = prov_instance.normalize_conversation(full_raw, loss_report)
|
||||||
|
|
||||||
exported_path: Path | None = None
|
exported_path: Path | None = None
|
||||||
if md_exporter:
|
if md_exporter:
|
||||||
@@ -569,6 +627,7 @@ def export(
|
|||||||
cache.mark_exported(prov_name, conv_id, {
|
cache.mark_exported(prov_name, conv_id, {
|
||||||
"title": normalized.get("title", ""),
|
"title": normalized.get("title", ""),
|
||||||
"project": normalized.get("project"),
|
"project": normalized.get("project"),
|
||||||
|
"created_at": normalized.get("created_at", ""),
|
||||||
"updated_at": normalized.get("updated_at", ""),
|
"updated_at": normalized.get("updated_at", ""),
|
||||||
"file_path": str(exported_path) if exported_path else "",
|
"file_path": str(exported_path) if exported_path else "",
|
||||||
})
|
})
|
||||||
@@ -588,6 +647,10 @@ def export(
|
|||||||
|
|
||||||
if not dry_run:
|
if not dry_run:
|
||||||
_print_export_summary(summary)
|
_print_export_summary(summary)
|
||||||
|
# Emit the data-loss summary at INFO level so it lands in the log file
|
||||||
|
# AND the operator's console (default level is INFO).
|
||||||
|
for line in loss_report.format_summary().split("\n"):
|
||||||
|
logger.info(line)
|
||||||
|
|
||||||
|
|
||||||
def _resolve_providers(provider: str, cfg) -> list[tuple[str, object]]:
|
def _resolve_providers(provider: str, cfg) -> list[tuple[str, object]]:
|
||||||
@@ -618,6 +681,7 @@ def _resolve_providers(provider: str, cfg) -> list[tuple[str, object]]:
|
|||||||
"chatgpt",
|
"chatgpt",
|
||||||
ChatGPTProvider(
|
ChatGPTProvider(
|
||||||
session_token=cfg.chatgpt_session_token,
|
session_token=cfg.chatgpt_session_token,
|
||||||
|
session_token_1=cfg.chatgpt_session_token_1,
|
||||||
project_ids=cfg.chatgpt_project_ids,
|
project_ids=cfg.chatgpt_project_ids,
|
||||||
),
|
),
|
||||||
))
|
))
|
||||||
@@ -757,18 +821,26 @@ def list_conversations(ctx: click.Context, provider: str, project_filter: str |
|
|||||||
if project_filter is not None:
|
if project_filter is not None:
|
||||||
all_convs = _filter_by_project(all_convs, project_filter)
|
all_convs = _filter_by_project(all_convs, project_filter)
|
||||||
|
|
||||||
table = Table()
|
# no_wrap + overflow="ellipsis" prevents Rich from wrapping cells to
|
||||||
table.add_column("Title")
|
# multiple lines on narrow terminals (e.g. Windows Command Prompt),
|
||||||
table.add_column("Project")
|
# which can otherwise make the output look garbled. Widths are tuned
|
||||||
table.add_column("Updated")
|
# to fit within an 80-column terminal.
|
||||||
table.add_column("ID")
|
# Total width budget for 80-column terminals:
|
||||||
|
# borders (5) + padding (4 cols * 2) = 13 chars of overhead
|
||||||
|
# remaining 67 chars split: 34 title + 15 project + 10 date + 8 id
|
||||||
|
table = Table(show_lines=False, expand=False, padding=(0, 1))
|
||||||
|
table.add_column("Title", no_wrap=True, overflow="ellipsis", max_width=34)
|
||||||
|
table.add_column("Project", no_wrap=True, overflow="ellipsis", max_width=15)
|
||||||
|
table.add_column("Updated", no_wrap=True, min_width=10)
|
||||||
|
table.add_column("ID", no_wrap=True, min_width=8)
|
||||||
|
|
||||||
for conv in all_convs:
|
for conv in all_convs:
|
||||||
title = conv.get("title") or "Untitled"
|
# ChatGPT uses "title"; Claude uses "name".
|
||||||
|
title = conv.get("title") or conv.get("name") or "Untitled"
|
||||||
project = _raw_project_name(conv) or ""
|
project = _raw_project_name(conv) or ""
|
||||||
updated = (conv.get("updated_at") or conv.get("update_time") or "")[:10]
|
updated = (conv.get("updated_at") or conv.get("update_time") or "")[:10]
|
||||||
conv_id = (conv.get("id") or conv.get("uuid") or "")[:8]
|
conv_id = (conv.get("id") or conv.get("uuid") or "")[:8]
|
||||||
table.add_row(title[:60], project[:30], updated, conv_id)
|
table.add_row(title, project, updated, conv_id)
|
||||||
|
|
||||||
console.print(table)
|
console.print(table)
|
||||||
console.print(f"Total: {len(all_convs)} conversations")
|
console.print(f"Total: {len(all_convs)} conversations")
|
||||||
@@ -845,9 +917,9 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
|
|||||||
via its local REST API. Requires Joplin desktop to be running with the
|
via its local REST API. Requires Joplin desktop to be running with the
|
||||||
Web Clipper service enabled.
|
Web Clipper service enabled.
|
||||||
|
|
||||||
Notebooks are created automatically based on provider and project:
|
Notebooks are created automatically as nested folders:
|
||||||
exports/chatgpt/my-project/ → "ChatGPT - My Project" notebook
|
chatgpt / my-project → AI-ChatGPT / My Project
|
||||||
exports/claude/no-project/ → "Claude - No Project" notebook
|
claude / no-project → AI-Claude / No Project
|
||||||
|
|
||||||
Re-running is safe: notes are updated (not duplicated) on subsequent runs.
|
Re-running is safe: notes are updated (not duplicated) on subsequent runs.
|
||||||
|
|
||||||
@@ -873,7 +945,7 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
|
|||||||
)
|
)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
from src.joplin import JoplinClient, JoplinError, notebook_title
|
from src.joplin import JoplinClient, JoplinError, notebook_path
|
||||||
|
|
||||||
client = JoplinClient(cfg.joplin_api_url, cfg.joplin_api_token)
|
client = JoplinClient(cfg.joplin_api_url, cfg.joplin_api_token)
|
||||||
|
|
||||||
@@ -953,7 +1025,9 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
|
|||||||
|
|
||||||
for conv_id, entry in pending:
|
for conv_id, entry in pending:
|
||||||
file_path = entry.get("file_path", "")
|
file_path = entry.get("file_path", "")
|
||||||
title = entry.get("title") or "Untitled"
|
raw_title = entry.get("title") or "Untitled"
|
||||||
|
created_date = (entry.get("created_at") or "")[:10]
|
||||||
|
title = f"{created_date} {raw_title}" if created_date else raw_title
|
||||||
project = entry.get("project") or None
|
project = entry.get("project") or None
|
||||||
existing_note_id = entry.get("joplin_note_id")
|
existing_note_id = entry.get("joplin_note_id")
|
||||||
action = "update" if existing_note_id else "create"
|
action = "update" if existing_note_id else "create"
|
||||||
@@ -968,9 +1042,9 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
|
|||||||
body = Path(file_path).read_text(encoding="utf-8")
|
body = Path(file_path).read_text(encoding="utf-8")
|
||||||
logger.debug("[joplin] Read %d chars from %s", len(body), file_path)
|
logger.debug("[joplin] Read %d chars from %s", len(body), file_path)
|
||||||
|
|
||||||
# Get or create the notebook
|
# Get or create the nested notebook
|
||||||
nb_title = notebook_title(prov_name, project)
|
nb_path = notebook_path(prov_name, project)
|
||||||
notebook_id = client.get_or_create_notebook(nb_title)
|
notebook_id = client.get_or_create_notebook_path(list(nb_path))
|
||||||
|
|
||||||
if existing_note_id:
|
if existing_note_id:
|
||||||
client.update_note(existing_note_id, title, body)
|
client.update_note(existing_note_id, title, body)
|
||||||
@@ -1007,7 +1081,7 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
|
|||||||
|
|
||||||
|
|
||||||
def _print_joplin_dry_run_table(prov_name: str, pending: list[tuple[str, dict]]) -> None:
|
def _print_joplin_dry_run_table(prov_name: str, pending: list[tuple[str, dict]]) -> None:
|
||||||
from src.joplin import notebook_title
|
from src.joplin import notebook_path
|
||||||
|
|
||||||
table = Table(title=f"[DRY RUN] {prov_name.upper()} — Would sync {len(pending)} conversation(s)")
|
table = Table(title=f"[DRY RUN] {prov_name.upper()} — Would sync {len(pending)} conversation(s)")
|
||||||
table.add_column("Title")
|
table.add_column("Title")
|
||||||
@@ -1016,9 +1090,12 @@ def _print_joplin_dry_run_table(prov_name: str, pending: list[tuple[str, dict]])
|
|||||||
table.add_column("Action")
|
table.add_column("Action")
|
||||||
|
|
||||||
for conv_id, entry in pending[:50]:
|
for conv_id, entry in pending[:50]:
|
||||||
title = entry.get("title") or "Untitled"
|
raw_title = entry.get("title") or "Untitled"
|
||||||
|
created_date = (entry.get("created_at") or "")[:10]
|
||||||
|
title = f"{created_date} {raw_title}" if created_date else raw_title
|
||||||
project = entry.get("project") or "no-project"
|
project = entry.get("project") or "no-project"
|
||||||
nb = notebook_title(prov_name, entry.get("project"))
|
parent, child = notebook_path(prov_name, entry.get("project"))
|
||||||
|
nb = f"{parent} / {child}"
|
||||||
action = "update" if entry.get("joplin_note_id") else "create"
|
action = "update" if entry.get("joplin_note_id") else "create"
|
||||||
table.add_row(title[:50], project[:30], nb, action)
|
table.add_row(title[:50], project[:30], nb, action)
|
||||||
|
|
||||||
|
|||||||
@@ -9,6 +9,7 @@ from typing import Any
|
|||||||
|
|
||||||
import requests
|
import requests
|
||||||
|
|
||||||
|
from src.loss_report import LossReport
|
||||||
from src.utils import redact_secrets
|
from src.utils import redact_secrets
|
||||||
|
|
||||||
# curl_cffi has its own exception hierarchy (rooted at CurlError → OSError),
|
# curl_cffi has its own exception hierarchy (rooted at CurlError → OSError),
|
||||||
@@ -89,8 +90,14 @@ class BaseProvider(ABC):
|
|||||||
"""Return the full conversation detail for a single ID."""
|
"""Return the full conversation detail for a single ID."""
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def normalize_conversation(self, raw: dict) -> dict:
|
def normalize_conversation(self, raw: dict, loss_report: LossReport | None = None) -> dict:
|
||||||
"""Transform provider-specific schema to the common normalized schema."""
|
"""Transform provider-specific schema to the common normalized schema.
|
||||||
|
|
||||||
|
``loss_report`` accumulates counts of dropped/unhandled content so the
|
||||||
|
export loop can surface a single summary at the end. When None, providers
|
||||||
|
construct a throwaway local report (so calling normalize_conversation in
|
||||||
|
isolation, e.g. from tests, doesn't crash).
|
||||||
|
"""
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
# Concrete helpers
|
# Concrete helpers
|
||||||
@@ -326,7 +333,7 @@ class BaseProvider(ABC):
|
|||||||
msg = (
|
msg = (
|
||||||
f"[{self.provider_name}] Authentication failed (401 Unauthorized). "
|
f"[{self.provider_name}] Authentication failed (401 Unauthorized). "
|
||||||
"Your session token has likely expired. "
|
"Your session token has likely expired. "
|
||||||
"Run 'python -m src.main auth' to refresh your token."
|
"Run 'ai-chat-exporter auth' to refresh your token."
|
||||||
)
|
)
|
||||||
logger.error(msg)
|
logger.error(msg)
|
||||||
raise ProviderError(
|
raise ProviderError(
|
||||||
|
|||||||
@@ -25,6 +25,20 @@ from typing import Any
|
|||||||
|
|
||||||
from curl_cffi import requests as curl_requests
|
from curl_cffi import requests as curl_requests
|
||||||
|
|
||||||
|
from src.blocks import (
|
||||||
|
UNKNOWN_REASON_EXTRACTION_FAILED,
|
||||||
|
UNKNOWN_REASON_UNKNOWN_FIELD_IN_KNOWN_TYPE,
|
||||||
|
UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
make_code_block,
|
||||||
|
make_file_placeholder,
|
||||||
|
make_hidden_context_marker,
|
||||||
|
make_image_placeholder,
|
||||||
|
make_text_block,
|
||||||
|
make_thinking_block,
|
||||||
|
make_tool_result_block,
|
||||||
|
make_unknown_block,
|
||||||
|
)
|
||||||
|
from src.loss_report import LossReport
|
||||||
from src.providers.base import BaseProvider, ProviderError, REQUEST_TIMEOUT
|
from src.providers.base import BaseProvider, ProviderError, REQUEST_TIMEOUT
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -56,6 +70,7 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
session_token: str | None = None,
|
session_token: str | None = None,
|
||||||
|
session_token_1: str | None = None,
|
||||||
project_ids: list[str] | None = None,
|
project_ids: list[str] | None = None,
|
||||||
) -> None:
|
) -> None:
|
||||||
# Pass a curl_cffi session to the base class instead of a requests.Session.
|
# Pass a curl_cffi session to the base class instead of a requests.Session.
|
||||||
@@ -77,11 +92,15 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
"init",
|
"init",
|
||||||
RuntimeError(
|
RuntimeError(
|
||||||
"CHATGPT_SESSION_TOKEN is not set. "
|
"CHATGPT_SESSION_TOKEN is not set. "
|
||||||
"Run 'python -m src.main auth' to configure it."
|
"Run 'ai-chat-exporter auth' to configure it."
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
self._session_token = token
|
self._session_token = token
|
||||||
|
|
||||||
|
# Second chunk of the session token (ChatGPT splits large cookies into
|
||||||
|
# __Secure-next-auth.session-token.0 and .1 to stay under the 4KB limit).
|
||||||
|
token_1 = session_token_1 or os.getenv("CHATGPT_SESSION_TOKEN_1", "").strip() or None
|
||||||
|
|
||||||
# Project gizmo IDs (g-p-xxx) whose conversations we'll fetch.
|
# Project gizmo IDs (g-p-xxx) whose conversations we'll fetch.
|
||||||
# ChatGPT project conversations do not appear in the default
|
# ChatGPT project conversations do not appear in the default
|
||||||
# /conversations listing — they require explicit project IDs.
|
# /conversations listing — they require explicit project IDs.
|
||||||
@@ -93,13 +112,24 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
# Cache of project_id → display name (avoids re-fetching gizmo details)
|
# Cache of project_id → display name (avoids re-fetching gizmo details)
|
||||||
self._project_name_cache: dict[str, str] = {}
|
self._project_name_cache: dict[str, str] = {}
|
||||||
|
|
||||||
# Set the session cookie in the cookie jar
|
# ChatGPT now splits large session cookies into .0 / .1 chunks.
|
||||||
|
# Always send both named chunks; the server reassembles them.
|
||||||
self._session.cookies.set(
|
self._session.cookies.set(
|
||||||
"__Secure-next-auth.session-token",
|
"__Secure-next-auth.session-token.0",
|
||||||
token,
|
token,
|
||||||
domain="chatgpt.com",
|
domain="chatgpt.com",
|
||||||
path="/",
|
path="/",
|
||||||
)
|
)
|
||||||
|
if token_1:
|
||||||
|
self._session.cookies.set(
|
||||||
|
"__Secure-next-auth.session-token.1",
|
||||||
|
token_1,
|
||||||
|
domain="chatgpt.com",
|
||||||
|
path="/",
|
||||||
|
)
|
||||||
|
logger.debug("[chatgpt] Set both session cookie chunks (.0 and .1)")
|
||||||
|
else:
|
||||||
|
logger.debug("[chatgpt] Set session cookie chunk .0 only (no .1 configured)")
|
||||||
|
|
||||||
# Set only Referer and sec-fetch-* headers for the auth exchange.
|
# Set only Referer and sec-fetch-* headers for the auth exchange.
|
||||||
# Origin is intentionally omitted: Chrome does not send Origin on
|
# Origin is intentionally omitted: Chrome does not send Origin on
|
||||||
@@ -157,7 +187,7 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
"fetch_access_token",
|
"fetch_access_token",
|
||||||
RuntimeError(
|
RuntimeError(
|
||||||
"No accessToken in /api/auth/session response. "
|
"No accessToken in /api/auth/session response. "
|
||||||
"Your session token may be expired — run 'python -m src.main auth' to refresh."
|
"Your session token may be expired — run 'ai-chat-exporter auth' to refresh."
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
return access_token
|
return access_token
|
||||||
@@ -169,7 +199,7 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
"The session token is used to obtain a short-lived access token via /api/auth/session. "
|
"The session token is used to obtain a short-lived access token via /api/auth/session. "
|
||||||
"To refresh: open chatgpt.com in Chrome → F12 → Application → Cookies "
|
"To refresh: open chatgpt.com in Chrome → F12 → Application → Cookies "
|
||||||
"→ find '__Secure-next-auth.session-token' → copy the value. "
|
"→ find '__Secure-next-auth.session-token' → copy the value. "
|
||||||
"Then run 'python -m src.main auth' or update CHATGPT_SESSION_TOKEN in .env."
|
"Then run 'ai-chat-exporter auth' or update CHATGPT_SESSION_TOKEN in .env."
|
||||||
)
|
)
|
||||||
logger.error(msg)
|
logger.error(msg)
|
||||||
raise ProviderError(
|
raise ProviderError(
|
||||||
@@ -369,7 +399,7 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
logger.info(
|
logger.info(
|
||||||
"[chatgpt] No project IDs configured — skipping project conversations. "
|
"[chatgpt] No project IDs configured — skipping project conversations. "
|
||||||
"To include projects, set CHATGPT_PROJECT_IDS in .env "
|
"To include projects, set CHATGPT_PROJECT_IDS in .env "
|
||||||
"(see 'python -m src.main auth' for instructions)."
|
"(see 'ai-chat-exporter auth' for instructions)."
|
||||||
)
|
)
|
||||||
return self._apply_since_filter(default_convs, since)
|
return self._apply_since_filter(default_convs, since)
|
||||||
|
|
||||||
@@ -535,7 +565,7 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
# Normalization
|
# Normalization
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
|
|
||||||
def normalize_conversation(self, raw: dict) -> dict:
|
def normalize_conversation(self, raw: dict, loss_report: LossReport | None = None) -> dict:
|
||||||
"""Transform ChatGPT raw schema to the common normalized schema.
|
"""Transform ChatGPT raw schema to the common normalized schema.
|
||||||
|
|
||||||
ChatGPT stores messages in a nested ``mapping`` dict where each node
|
ChatGPT stores messages in a nested ``mapping`` dict where each node
|
||||||
@@ -546,21 +576,32 @@ class ChatGPTProvider(BaseProvider):
|
|||||||
fetch_all_conversations). The conversation detail endpoint does not
|
fetch_all_conversations). The conversation detail endpoint does not
|
||||||
include project information.
|
include project information.
|
||||||
"""
|
"""
|
||||||
conv_id = raw.get("id", "")
|
report = loss_report if loss_report is not None else LossReport()
|
||||||
|
# ChatGPT's /backend-api/conversation/<id> response uses ``conversation_id``
|
||||||
|
# at the top level (not ``id``); fixtures and listing summaries use ``id``.
|
||||||
|
# Read both so both code paths populate the normalized ``id`` correctly.
|
||||||
|
conv_id = raw.get("id") or raw.get("conversation_id") or ""
|
||||||
title = raw.get("title") or "Untitled"
|
title = raw.get("title") or "Untitled"
|
||||||
created_at = _ts_to_iso(raw.get("create_time"))
|
created_at = _ts_to_iso(raw.get("create_time"))
|
||||||
updated_at = _ts_to_iso(raw.get("update_time"))
|
updated_at = _ts_to_iso(raw.get("update_time"))
|
||||||
|
|
||||||
# Look up project name from the map built during fetch_all_conversations.
|
# Prefer _project_name annotation injected from the listing summary
|
||||||
project = self._project_map.get(conv_id) if conv_id else None
|
# (propagated by the export loop). Fall back to _project_map lookup.
|
||||||
|
project = raw.get("_project_name") or (
|
||||||
|
self._project_map.get(conv_id) if conv_id else None
|
||||||
|
)
|
||||||
logger.debug(
|
logger.debug(
|
||||||
"[chatgpt] normalize_conversation[%s]: project_map lookup → %r",
|
"[chatgpt] normalize_conversation[%s]: project=%r (source=%s)",
|
||||||
conv_id[:8] if conv_id else "?",
|
conv_id[:8] if conv_id else "?",
|
||||||
project,
|
project,
|
||||||
|
"_project_name" if raw.get("_project_name") else "_project_map",
|
||||||
)
|
)
|
||||||
|
|
||||||
mapping: dict = raw.get("mapping", {})
|
mapping: dict = raw.get("mapping", {})
|
||||||
messages = _extract_messages(mapping, raw, conv_id)
|
messages = _extract_messages(mapping, raw, conv_id, report)
|
||||||
|
for _ in messages:
|
||||||
|
report.record_message()
|
||||||
|
report.record_conversation()
|
||||||
|
|
||||||
return {
|
return {
|
||||||
"id": conv_id,
|
"id": conv_id,
|
||||||
@@ -590,14 +631,18 @@ def _ts_to_iso(ts: float | int | str | None) -> str:
|
|||||||
|
|
||||||
|
|
||||||
def _extract_messages(
|
def _extract_messages(
|
||||||
mapping: dict[str, Any], raw: dict, conv_id: str
|
mapping: dict[str, Any], raw: dict, conv_id: str, report: LossReport
|
||||||
) -> list[dict]:
|
) -> list[dict]:
|
||||||
"""Walk the ChatGPT conversation mapping tree to produce an ordered message list."""
|
"""Walk the ChatGPT conversation mapping tree to produce an ordered message list.
|
||||||
|
|
||||||
|
All roles (user/assistant/system/tool) are processed; the prior filter that
|
||||||
|
dropped non-user/assistant messages is lifted in v0.4.0 — truly empty
|
||||||
|
messages skip via the empty-content guard, anything with content renders.
|
||||||
|
"""
|
||||||
if not mapping:
|
if not mapping:
|
||||||
logger.warning("[chatgpt] Conversation %s has empty mapping", conv_id[:8])
|
logger.warning("[chatgpt] Conversation %s has empty mapping", conv_id[:8])
|
||||||
return []
|
return []
|
||||||
|
|
||||||
# Find the root node (the one that has no parent, or whose parent is None/not in mapping)
|
|
||||||
root_id = _find_root(mapping)
|
root_id = _find_root(mapping)
|
||||||
if root_id is None:
|
if root_id is None:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
@@ -615,39 +660,12 @@ def _extract_messages(
|
|||||||
|
|
||||||
node = mapping.get(node_id, {})
|
node = mapping.get(node_id, {})
|
||||||
msg_data = node.get("message")
|
msg_data = node.get("message")
|
||||||
|
|
||||||
if msg_data:
|
if msg_data:
|
||||||
role = msg_data.get("author", {}).get("role", "")
|
built = _build_message(msg_data, conv_id, node_id, report)
|
||||||
# Skip system/tool messages silently unless they have visible content
|
if built is not None:
|
||||||
if role in ("user", "assistant"):
|
messages.append(built)
|
||||||
content_obj = msg_data.get("content", {})
|
|
||||||
content_type = content_obj.get("content_type", "text")
|
|
||||||
text = _extract_text(content_obj, conv_id, node_id)
|
|
||||||
|
|
||||||
if content_type != "text":
|
# Walk children in order (linear in typical conversations)
|
||||||
logger.warning(
|
|
||||||
"[chatgpt] Skipping %s content in conversation %s message %s "
|
|
||||||
"— rich content not yet supported (see FUTURE.md)",
|
|
||||||
content_type,
|
|
||||||
conv_id[:8],
|
|
||||||
node_id[:8],
|
|
||||||
)
|
|
||||||
elif text:
|
|
||||||
ts = msg_data.get("create_time")
|
|
||||||
messages.append(
|
|
||||||
{
|
|
||||||
"role": role,
|
|
||||||
"content": text,
|
|
||||||
"content_type": "text",
|
|
||||||
"timestamp": _ts_to_iso(ts) if ts else None,
|
|
||||||
}
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
logger.debug(
|
|
||||||
"[chatgpt] Skipping empty message in conversation %s", conv_id[:8]
|
|
||||||
)
|
|
||||||
|
|
||||||
# Walk children in order (ChatGPT typically has one child per node in a linear chat)
|
|
||||||
for child_id in node.get("children", []):
|
for child_id in node.get("children", []):
|
||||||
walk(child_id)
|
walk(child_id)
|
||||||
|
|
||||||
@@ -669,27 +687,529 @@ def _find_root(mapping: dict[str, Any]) -> str | None:
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def _extract_text(content_obj: dict, conv_id: str, node_id: str) -> str:
|
def _build_message(
|
||||||
"""Extract plain text from a ChatGPT content object."""
|
msg_data: dict, conv_id: str, node_id: str, report: LossReport
|
||||||
parts = content_obj.get("parts", [])
|
) -> dict | None:
|
||||||
if not parts:
|
"""Construct a normalized message dict (with ``blocks``) for one ChatGPT node.
|
||||||
return ""
|
|
||||||
|
|
||||||
text_parts = []
|
Returns None for messages that should be skipped (truly empty). Otherwise
|
||||||
for part in parts:
|
returns a dict with ``role``, ``content_type``, ``timestamp``, ``blocks``.
|
||||||
if isinstance(part, str):
|
"""
|
||||||
text_parts.append(part)
|
author = msg_data.get("author") or {}
|
||||||
elif isinstance(part, dict):
|
role = author.get("role", "") or ""
|
||||||
# Could be an image or file reference — skip and warn
|
if role not in ("user", "assistant", "system", "tool"):
|
||||||
part_type = part.get("content_type", "unknown")
|
# Unrecognised role — log and surface, but pass through so role metadata
|
||||||
if part_type != "text":
|
# is preserved for the reader.
|
||||||
logger.warning(
|
logger.debug(
|
||||||
"[chatgpt] Skipping %s attachment in conversation %s "
|
"[chatgpt] Unrecognised role %r in conversation %s message %s",
|
||||||
"— rich content not yet supported (see FUTURE.md)",
|
role,
|
||||||
part_type,
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
|
||||||
|
content_obj = msg_data.get("content") or {}
|
||||||
|
content_type = content_obj.get("content_type", "text")
|
||||||
|
ts = msg_data.get("create_time")
|
||||||
|
metadata = msg_data.get("metadata") or {}
|
||||||
|
is_hidden = bool(metadata.get("is_visually_hidden_from_conversation"))
|
||||||
|
author_name = author.get("name") or None
|
||||||
|
|
||||||
|
blocks = _extract_blocks_for_content(
|
||||||
|
content_type, content_obj, role, conv_id, node_id, report,
|
||||||
|
author_name=author_name, msg_metadata=metadata,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not blocks:
|
||||||
|
logger.debug(
|
||||||
|
"[chatgpt] Skipping empty %s message in conversation %s",
|
||||||
|
content_type,
|
||||||
conv_id[:8],
|
conv_id[:8],
|
||||||
)
|
)
|
||||||
else:
|
return None
|
||||||
text_parts.append(part.get("text", ""))
|
|
||||||
|
|
||||||
return "\n".join(t for t in text_parts if t)
|
if is_hidden:
|
||||||
|
# Prepend a marker so the reader knows this message is hidden in the
|
||||||
|
# source UI. The marker is content-type-agnostic.
|
||||||
|
blocks = [make_hidden_context_marker(content_type)] + blocks
|
||||||
|
|
||||||
|
# Vestigial content_type: "code" for code-only messages, otherwise "text"
|
||||||
|
msg_content_type = "code" if (
|
||||||
|
len(blocks) == 1 and blocks[0].get("type") == "code"
|
||||||
|
) else "text"
|
||||||
|
|
||||||
|
return {
|
||||||
|
"role": role or "user",
|
||||||
|
"content_type": msg_content_type,
|
||||||
|
"timestamp": _ts_to_iso(ts) if ts else None,
|
||||||
|
"blocks": blocks,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# Content types whose ``parts`` are plain text strings.
|
||||||
|
_PLAIN_TEXT_PARTS_TYPES = {"text"}
|
||||||
|
# Content types that carry inline reasoning/thoughts.
|
||||||
|
_THINKING_TYPES = {"thoughts", "reasoning_recap"}
|
||||||
|
# Custom-Instructions / model-context types — direct fields, NOT parts.
|
||||||
|
_DIRECT_FIELD_CONTEXT_TYPES = {
|
||||||
|
"user_editable_context",
|
||||||
|
"model_editable_context",
|
||||||
|
}
|
||||||
|
# Known direct fields per context type. Anything not listed but non-null
|
||||||
|
# becomes an `unknown` block per the no-silent-drop-of-non-null-fields rule.
|
||||||
|
_USER_EDITABLE_CONTEXT_KNOWN_FIELDS = ("user_profile", "user_instructions")
|
||||||
|
_MODEL_EDITABLE_CONTEXT_KNOWN_FIELDS = (
|
||||||
|
"model_set_context",
|
||||||
|
"repository",
|
||||||
|
"repo_summary",
|
||||||
|
"structured_context",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_blocks_for_content(
|
||||||
|
content_type: str,
|
||||||
|
content_obj: dict,
|
||||||
|
role: str,
|
||||||
|
conv_id: str,
|
||||||
|
node_id: str,
|
||||||
|
report: LossReport,
|
||||||
|
author_name: str | None = None,
|
||||||
|
msg_metadata: dict | None = None,
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Dispatch on content_type and return a list of blocks for one message."""
|
||||||
|
|
||||||
|
if content_type in _PLAIN_TEXT_PARTS_TYPES:
|
||||||
|
return _extract_text_content_type_blocks(content_obj, conv_id, node_id, report)
|
||||||
|
|
||||||
|
if content_type == "multimodal_text":
|
||||||
|
return _extract_multimodal_blocks(content_obj, role, conv_id, node_id, report)
|
||||||
|
|
||||||
|
if content_type == "execution_output":
|
||||||
|
return _extract_execution_output_blocks(
|
||||||
|
content_obj, author_name, msg_metadata or {}, conv_id, node_id
|
||||||
|
)
|
||||||
|
|
||||||
|
if content_type == "system_error":
|
||||||
|
return _extract_system_error_blocks(content_obj, author_name)
|
||||||
|
|
||||||
|
if content_type == "tether_browsing_display":
|
||||||
|
return _extract_tether_browsing_display_blocks(
|
||||||
|
content_obj, author_name, conv_id, node_id
|
||||||
|
)
|
||||||
|
|
||||||
|
if content_type == "code":
|
||||||
|
code_text = content_obj.get("text") or "\n".join(
|
||||||
|
p for p in content_obj.get("parts", []) if isinstance(p, str)
|
||||||
|
)
|
||||||
|
language = content_obj.get("language", "") or ""
|
||||||
|
block = make_code_block(code_text, language)
|
||||||
|
return [block] if block else []
|
||||||
|
|
||||||
|
if content_type in _THINKING_TYPES:
|
||||||
|
text = _join_string_parts(content_obj)
|
||||||
|
block = make_thinking_block(text)
|
||||||
|
return [block] if block else []
|
||||||
|
|
||||||
|
if content_type in _DIRECT_FIELD_CONTEXT_TYPES:
|
||||||
|
return _extract_editable_context_blocks(
|
||||||
|
content_type, content_obj, conv_id, node_id, report
|
||||||
|
)
|
||||||
|
|
||||||
|
if content_type == "image_asset_pointer":
|
||||||
|
# Top-level image (rare — usually nested inside multimodal_text).
|
||||||
|
ref = content_obj.get("asset_pointer", "")
|
||||||
|
source = "user_upload" if role == "user" else "model_generated"
|
||||||
|
return [make_image_placeholder(ref=ref, source=source)]
|
||||||
|
|
||||||
|
# Unknown content_type → visible unknown block + WARNING + tally
|
||||||
|
keys = list(content_obj.keys())
|
||||||
|
logger.warning(
|
||||||
|
"[chatgpt] Unknown content_type %r in conversation %s message %s "
|
||||||
|
"— see plan §Data-loss visibility (rendering as unknown block)",
|
||||||
|
content_type,
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
report.record_unknown(content_type or "?")
|
||||||
|
return [
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type=content_type or "?",
|
||||||
|
observed_keys=keys,
|
||||||
|
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_text_content_type_blocks(
|
||||||
|
content_obj: dict, conv_id: str, node_id: str, report: LossReport
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Extract blocks for ``content_type == "text"``.
|
||||||
|
|
||||||
|
Plural-parts rule: emit ONE text block per message with all string parts
|
||||||
|
joined by ``\\n``. Don't emit one block per part.
|
||||||
|
|
||||||
|
Dict parts inside a text content_type message (the suspected o1/o3 reasoning
|
||||||
|
subpart shape ``{"summary": ..., "content": ...}``) are preserved as text
|
||||||
|
today — defensive behavior pending real-data capture in v0.4.1.
|
||||||
|
"""
|
||||||
|
parts = content_obj.get("parts", []) or []
|
||||||
|
string_chunks: list[str] = []
|
||||||
|
|
||||||
|
for part in parts:
|
||||||
|
if isinstance(part, str):
|
||||||
|
string_chunks.append(part)
|
||||||
|
elif isinstance(part, dict):
|
||||||
|
part_type = part.get("content_type", "")
|
||||||
|
if part_type == "text":
|
||||||
|
txt = part.get("text", "") or ""
|
||||||
|
if txt:
|
||||||
|
string_chunks.append(txt)
|
||||||
|
elif "content" in part:
|
||||||
|
# Suspected o1/o3 reasoning subpart. Defensive: preserve as text
|
||||||
|
# block (matches current behavior). v0.4.1 reclassifies once
|
||||||
|
# the real shape is captured live.
|
||||||
|
content_val = part.get("content", "") or ""
|
||||||
|
if content_val:
|
||||||
|
string_chunks.append(content_val)
|
||||||
|
elif part_type:
|
||||||
|
# Non-text dict part inside a text content_type — surface it.
|
||||||
|
logger.warning(
|
||||||
|
"[chatgpt] Unexpected %s part inside text content_type "
|
||||||
|
"in conversation %s message %s — rendering as unknown block",
|
||||||
|
part_type,
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
report.record_unknown(part_type)
|
||||||
|
# Inline mark in the joined text so order is preserved.
|
||||||
|
string_chunks.append(
|
||||||
|
f"\n[Unknown part: type={part_type}; "
|
||||||
|
f"keys={list(part.keys())[:10]}]\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
joined = "\n".join(c for c in string_chunks if c)
|
||||||
|
block = make_text_block(joined)
|
||||||
|
return [block] if block else []
|
||||||
|
|
||||||
|
|
||||||
|
def _join_string_parts(content_obj: dict) -> str:
|
||||||
|
"""Helper: join all string parts in ``parts`` with newlines."""
|
||||||
|
parts = content_obj.get("parts", []) or []
|
||||||
|
return "\n".join(p for p in parts if isinstance(p, str) and p)
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_multimodal_blocks(
|
||||||
|
content_obj: dict, role: str, conv_id: str, node_id: str, report: LossReport
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Extract blocks from a ``multimodal_text`` content object.
|
||||||
|
|
||||||
|
Walks ``parts`` in array order — order varies between user and assistant
|
||||||
|
turns, and the extractor preserves source ordering. Emits text +
|
||||||
|
image_placeholder + file_placeholder blocks per part.
|
||||||
|
"""
|
||||||
|
parts = content_obj.get("parts", []) or []
|
||||||
|
blocks: list[dict] = []
|
||||||
|
|
||||||
|
for part in parts:
|
||||||
|
if isinstance(part, str):
|
||||||
|
block = make_text_block(part)
|
||||||
|
if block:
|
||||||
|
blocks.append(block)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not isinstance(part, dict):
|
||||||
|
continue
|
||||||
|
|
||||||
|
part_type = part.get("content_type", "")
|
||||||
|
|
||||||
|
if part_type == "audio_transcription":
|
||||||
|
txt = part.get("text", "") or ""
|
||||||
|
block = make_text_block(txt)
|
||||||
|
if block:
|
||||||
|
blocks.append(block)
|
||||||
|
elif "text" not in part:
|
||||||
|
logger.warning(
|
||||||
|
"[chatgpt] audio_transcription part missing 'text' key "
|
||||||
|
"in conversation %s message %s",
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
report.record_extraction_failure("audio_transcription")
|
||||||
|
blocks.append(
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type="audio_transcription",
|
||||||
|
observed_keys=list(part.keys()),
|
||||||
|
reason=UNKNOWN_REASON_EXTRACTION_FAILED,
|
||||||
|
summary="expected key 'text' not found",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if part_type == "image_asset_pointer":
|
||||||
|
ref = part.get("asset_pointer", "")
|
||||||
|
source = "user_upload" if role == "user" else "model_generated"
|
||||||
|
mime = None
|
||||||
|
blocks.append(make_image_placeholder(ref=ref, source=source, mime=mime))
|
||||||
|
continue
|
||||||
|
|
||||||
|
if part_type == "audio_asset_pointer":
|
||||||
|
blocks.append(_audio_asset_placeholder(part))
|
||||||
|
continue
|
||||||
|
|
||||||
|
if part_type == "real_time_user_audio_video_asset_pointer":
|
||||||
|
# Wrapper carrying a nested audio_asset_pointer + optional video frames.
|
||||||
|
nested_audio = part.get("audio_asset_pointer")
|
||||||
|
if isinstance(nested_audio, dict):
|
||||||
|
blocks.append(_audio_asset_placeholder(nested_audio))
|
||||||
|
else:
|
||||||
|
logger.warning(
|
||||||
|
"[chatgpt] real_time_user_audio_video_asset_pointer missing "
|
||||||
|
"nested audio_asset_pointer in conversation %s message %s",
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
report.record_extraction_failure(
|
||||||
|
"real_time_user_audio_video_asset_pointer"
|
||||||
|
)
|
||||||
|
blocks.append(
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type="real_time_user_audio_video_asset_pointer",
|
||||||
|
observed_keys=list(part.keys()),
|
||||||
|
reason=UNKNOWN_REASON_EXTRACTION_FAILED,
|
||||||
|
summary="expected nested 'audio_asset_pointer' not found",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
frames = part.get("frames_asset_pointers") or []
|
||||||
|
if frames:
|
||||||
|
# Defensive: empty in all observed cases, but if non-empty
|
||||||
|
# surface as a separate file placeholder.
|
||||||
|
video_ref = part.get("video_container_asset_pointer") or "(video frames)"
|
||||||
|
blocks.append(
|
||||||
|
make_file_placeholder(
|
||||||
|
ref=str(video_ref),
|
||||||
|
mime="video/unknown",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Anything else inside multimodal_text — visible unknown block
|
||||||
|
logger.warning(
|
||||||
|
"[chatgpt] Unknown multimodal_text part type %r in conversation %s message %s",
|
||||||
|
part_type,
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
report.record_unknown(part_type or "?")
|
||||||
|
blocks.append(
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type=part_type or "?",
|
||||||
|
observed_keys=list(part.keys()),
|
||||||
|
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
return blocks
|
||||||
|
|
||||||
|
|
||||||
|
def _audio_asset_placeholder(audio_part: dict) -> dict:
|
||||||
|
"""Build a file_placeholder for an audio_asset_pointer dict.
|
||||||
|
|
||||||
|
Handles missing/zero metadata defensively.
|
||||||
|
"""
|
||||||
|
ref = audio_part.get("asset_pointer", "") or ""
|
||||||
|
fmt = audio_part.get("format") or "unknown"
|
||||||
|
size_bytes = audio_part.get("size_bytes")
|
||||||
|
if not isinstance(size_bytes, int) or size_bytes <= 0:
|
||||||
|
size_bytes = None
|
||||||
|
metadata = audio_part.get("metadata") or {}
|
||||||
|
start = metadata.get("start") if isinstance(metadata, dict) else None
|
||||||
|
end = metadata.get("end") if isinstance(metadata, dict) else None
|
||||||
|
duration: float | None = None
|
||||||
|
if isinstance(start, (int, float)) and isinstance(end, (int, float)):
|
||||||
|
diff = float(end) - float(start)
|
||||||
|
if diff > 0:
|
||||||
|
duration = diff
|
||||||
|
return make_file_placeholder(
|
||||||
|
ref=ref,
|
||||||
|
mime=f"audio/{fmt}" if fmt else "audio/unknown",
|
||||||
|
size_bytes=size_bytes,
|
||||||
|
duration_seconds=duration,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_editable_context_blocks(
|
||||||
|
content_type: str, content_obj: dict, conv_id: str, node_id: str, report: LossReport
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Extract blocks from user_editable_context / model_editable_context messages.
|
||||||
|
|
||||||
|
These have no ``parts`` field — they carry direct keys. Read all known
|
||||||
|
fields, emit one labeled fenced block per non-null known field, and emit an
|
||||||
|
``unknown`` block for any unrecognised non-null direct field (no-silent-drop
|
||||||
|
rule).
|
||||||
|
"""
|
||||||
|
if content_type == "user_editable_context":
|
||||||
|
known_fields: tuple[str, ...] = _USER_EDITABLE_CONTEXT_KNOWN_FIELDS
|
||||||
|
elif content_type == "model_editable_context":
|
||||||
|
known_fields = _MODEL_EDITABLE_CONTEXT_KNOWN_FIELDS
|
||||||
|
else:
|
||||||
|
known_fields = ()
|
||||||
|
|
||||||
|
blocks: list[dict] = []
|
||||||
|
label_kind = "Custom Instructions" if content_type == "user_editable_context" else "Model Context"
|
||||||
|
|
||||||
|
for field in known_fields:
|
||||||
|
value = content_obj.get(field)
|
||||||
|
if value is None or (isinstance(value, str) and not value.strip()):
|
||||||
|
continue
|
||||||
|
if isinstance(value, (dict, list)):
|
||||||
|
# Render as a JSON-rendered text block. _safe_fence will wrap it.
|
||||||
|
import json as _json
|
||||||
|
rendered = _json.dumps(value, indent=2, default=str, ensure_ascii=False)
|
||||||
|
else:
|
||||||
|
rendered = str(value)
|
||||||
|
label = f"**{label_kind} — {field}:**"
|
||||||
|
# Emit as text block; the renderer's _safe_fence wraps the raw value.
|
||||||
|
# We use a "labeled fenced block" pattern: header line + raw content
|
||||||
|
# joined inside one text block, where the renderer will leave it alone.
|
||||||
|
# To get the safe-fence wrap we use a code block (which calls _safe_fence
|
||||||
|
# internally and renders without language-hint corruption risk).
|
||||||
|
blocks.append(make_text_block(label))
|
||||||
|
code_block = make_code_block(rendered, language="")
|
||||||
|
if code_block:
|
||||||
|
blocks.append(code_block)
|
||||||
|
|
||||||
|
# Catch unknown non-null direct fields (no-silent-drop rule).
|
||||||
|
structural_keys = {"content_type", "parts"}
|
||||||
|
for key, value in content_obj.items():
|
||||||
|
if key in structural_keys or key in known_fields:
|
||||||
|
continue
|
||||||
|
if value is None:
|
||||||
|
continue
|
||||||
|
# Reject null/empty containers.
|
||||||
|
if isinstance(value, (str, list, dict)) and not value:
|
||||||
|
continue
|
||||||
|
logger.warning(
|
||||||
|
"[chatgpt] Unknown non-null field %r in %s message %s/%s",
|
||||||
|
key,
|
||||||
|
content_type,
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
report.record_unknown(f"{content_type}.{key}")
|
||||||
|
blocks.append(
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type=f"{content_type}.{key}",
|
||||||
|
observed_keys=list(content_obj.keys()),
|
||||||
|
reason=UNKNOWN_REASON_UNKNOWN_FIELD_IN_KNOWN_TYPE,
|
||||||
|
summary=f"unknown non-null field '{key}' in {content_type}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
return blocks
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_execution_output_blocks(
|
||||||
|
content_obj: dict,
|
||||||
|
author_name: str | None,
|
||||||
|
msg_metadata: dict,
|
||||||
|
conv_id: str,
|
||||||
|
node_id: str,
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Map a ChatGPT ``execution_output`` content (Code Interpreter / container.exec
|
||||||
|
/ python tool output) onto a ``tool_result`` block.
|
||||||
|
|
||||||
|
Locked shape (captured live during planning v0.4.1):
|
||||||
|
content.text → output
|
||||||
|
author.name → tool_name
|
||||||
|
metadata.aggregate_result.status → "error" → is_error=True
|
||||||
|
metadata.reasoning_title → summary
|
||||||
|
|
||||||
|
Empty ``content.text`` → skip (DEBUG log) — a tool that emits no output is
|
||||||
|
a transient artifact, not archival content.
|
||||||
|
"""
|
||||||
|
text = content_obj.get("text") or ""
|
||||||
|
if not text.strip():
|
||||||
|
logger.debug(
|
||||||
|
"[chatgpt] Skipping empty execution_output in conversation %s message %s",
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
aggregate = msg_metadata.get("aggregate_result") or {}
|
||||||
|
status = aggregate.get("status") if isinstance(aggregate, dict) else None
|
||||||
|
is_error = isinstance(status, str) and status.lower() == "error"
|
||||||
|
summary = msg_metadata.get("reasoning_title") or None
|
||||||
|
|
||||||
|
return [
|
||||||
|
make_tool_result_block(
|
||||||
|
output=text,
|
||||||
|
tool_name=author_name,
|
||||||
|
is_error=is_error,
|
||||||
|
summary=summary,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_system_error_blocks(
|
||||||
|
content_obj: dict,
|
||||||
|
author_name: str | None,
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Map a ChatGPT ``system_error`` content onto an error ``tool_result`` block.
|
||||||
|
|
||||||
|
Captured shape: ``{content_type, name, text}`` where ``text`` is the error
|
||||||
|
message (e.g. ``"Error: Error from browse service: 503"``). ``author.name``
|
||||||
|
identifies the failing tool (e.g. ``"web"``).
|
||||||
|
"""
|
||||||
|
text = content_obj.get("text") or ""
|
||||||
|
if not text:
|
||||||
|
text = "(error with no message)"
|
||||||
|
return [
|
||||||
|
make_tool_result_block(
|
||||||
|
output=text,
|
||||||
|
tool_name=author_name,
|
||||||
|
is_error=True,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_tether_browsing_display_blocks(
|
||||||
|
content_obj: dict,
|
||||||
|
author_name: str | None,
|
||||||
|
conv_id: str,
|
||||||
|
node_id: str,
|
||||||
|
) -> list[dict]:
|
||||||
|
"""Handle ChatGPT's ``tether_browsing_display`` content.
|
||||||
|
|
||||||
|
Captured live: most instances are **spinner placeholders** (transient UI
|
||||||
|
state — empty fields, ``metadata.command == "spinner"``). The actual
|
||||||
|
retrieval content arrives as a sibling/child ``multimodal_text`` message
|
||||||
|
that already extracts cleanly via the existing handler.
|
||||||
|
|
||||||
|
Locked behavior:
|
||||||
|
- If ``result`` AND ``summary`` are both empty → skip silently (DEBUG).
|
||||||
|
These are spinners; the real content is elsewhere.
|
||||||
|
- Otherwise (defensive: never observed populated in real data) → render
|
||||||
|
as a ``tool_result`` block carrying ``result`` as output and
|
||||||
|
``summary`` as the optional summary line.
|
||||||
|
"""
|
||||||
|
result = content_obj.get("result") or ""
|
||||||
|
summary = content_obj.get("summary") or ""
|
||||||
|
|
||||||
|
if not result.strip() and not summary.strip():
|
||||||
|
logger.debug(
|
||||||
|
"[chatgpt] Skipping tether_browsing_display spinner in "
|
||||||
|
"conversation %s message %s (empty result/summary)",
|
||||||
|
conv_id[:8],
|
||||||
|
node_id[:8],
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [
|
||||||
|
make_tool_result_block(
|
||||||
|
output=result or summary,
|
||||||
|
tool_name=author_name,
|
||||||
|
is_error=False,
|
||||||
|
summary=summary if result and summary else None,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|||||||
@@ -5,6 +5,17 @@ import os
|
|||||||
|
|
||||||
from curl_cffi import requests as curl_requests
|
from curl_cffi import requests as curl_requests
|
||||||
|
|
||||||
|
from src.blocks import (
|
||||||
|
UNKNOWN_REASON_EXTRACTION_FAILED,
|
||||||
|
UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
make_image_placeholder,
|
||||||
|
make_text_block,
|
||||||
|
make_thinking_block,
|
||||||
|
make_tool_result_block,
|
||||||
|
make_tool_use_block,
|
||||||
|
make_unknown_block,
|
||||||
|
)
|
||||||
|
from src.loss_report import LossReport
|
||||||
from src.providers.base import BaseProvider, ProviderError
|
from src.providers.base import BaseProvider, ProviderError
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -39,7 +50,7 @@ class ClaudeProvider(BaseProvider):
|
|||||||
"init",
|
"init",
|
||||||
RuntimeError(
|
RuntimeError(
|
||||||
"CLAUDE_SESSION_KEY is not set. "
|
"CLAUDE_SESSION_KEY is not set. "
|
||||||
"Run 'python -m src.main auth' to configure it."
|
"Run 'ai-chat-exporter auth' to configure it."
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
# Set sessionKey in the cookie jar
|
# Set sessionKey in the cookie jar
|
||||||
@@ -60,7 +71,7 @@ class ClaudeProvider(BaseProvider):
|
|||||||
"Note: Claude session keys are opaque — a 401 is the only expiry signal. "
|
"Note: Claude session keys are opaque — a 401 is the only expiry signal. "
|
||||||
"To refresh: open claude.ai in Chrome → F12 → Application → Cookies "
|
"To refresh: open claude.ai in Chrome → F12 → Application → Cookies "
|
||||||
"→ find 'sessionKey' → copy the value. "
|
"→ find 'sessionKey' → copy the value. "
|
||||||
"Then run 'python -m src.main auth' or update CLAUDE_SESSION_KEY in .env."
|
"Then run 'ai-chat-exporter auth' or update CLAUDE_SESSION_KEY in .env."
|
||||||
)
|
)
|
||||||
logger.error(msg)
|
logger.error(msg)
|
||||||
raise ProviderError(
|
raise ProviderError(
|
||||||
@@ -161,8 +172,9 @@ class ClaudeProvider(BaseProvider):
|
|||||||
|
|
||||||
return data
|
return data
|
||||||
|
|
||||||
def normalize_conversation(self, raw: dict) -> dict:
|
def normalize_conversation(self, raw: dict, loss_report: LossReport | None = None) -> dict:
|
||||||
"""Transform Claude raw schema to the common normalized schema."""
|
"""Transform Claude raw schema to the common normalized schema."""
|
||||||
|
report = loss_report if loss_report is not None else LossReport()
|
||||||
conv_id = raw.get("uuid") or raw.get("id", "")
|
conv_id = raw.get("uuid") or raw.get("id", "")
|
||||||
title = raw.get("name") or raw.get("title") or "Untitled"
|
title = raw.get("name") or raw.get("title") or "Untitled"
|
||||||
created_at = raw.get("created_at") or raw.get("create_time") or ""
|
created_at = raw.get("created_at") or raw.get("create_time") or ""
|
||||||
@@ -178,40 +190,37 @@ class ClaudeProvider(BaseProvider):
|
|||||||
|
|
||||||
# Messages
|
# Messages
|
||||||
raw_messages = raw.get("chat_messages") or raw.get("messages") or []
|
raw_messages = raw.get("chat_messages") or raw.get("messages") or []
|
||||||
messages = []
|
messages: list[dict] = []
|
||||||
|
|
||||||
for msg in raw_messages:
|
for msg in raw_messages:
|
||||||
role = _map_role(msg.get("sender") or msg.get("role", ""))
|
role = _map_role(msg.get("sender") or msg.get("role", ""))
|
||||||
if not role:
|
if not role:
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Content can be a string or a list of content blocks
|
content_raw = msg.get("content") if "content" in msg else msg.get("text", "")
|
||||||
content_raw = msg.get("content") or msg.get("text") or ""
|
blocks = _extract_claude_blocks(content_raw, conv_id, report)
|
||||||
content, skipped_types = _extract_claude_text(content_raw, conv_id)
|
|
||||||
|
|
||||||
for ctype in skipped_types:
|
|
||||||
logger.warning(
|
|
||||||
"[claude] Skipping %s content in conversation %s "
|
|
||||||
"— rich content not yet supported (see FUTURE.md)",
|
|
||||||
ctype,
|
|
||||||
conv_id[:8],
|
|
||||||
)
|
|
||||||
|
|
||||||
timestamp = msg.get("created_at") or msg.get("timestamp") or None
|
timestamp = msg.get("created_at") or msg.get("timestamp") or None
|
||||||
|
|
||||||
if content is None:
|
if not blocks:
|
||||||
logger.debug("[claude] Skipping empty message in conversation %s", conv_id[:8])
|
logger.debug("[claude] Skipping empty message in conversation %s", conv_id[:8])
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
content_type = "text"
|
||||||
|
|
||||||
messages.append(
|
messages.append(
|
||||||
{
|
{
|
||||||
"role": role,
|
"role": role,
|
||||||
"content": content,
|
"content_type": content_type,
|
||||||
"content_type": "text",
|
|
||||||
"timestamp": timestamp,
|
"timestamp": timestamp,
|
||||||
|
"blocks": blocks,
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
|
||||||
|
for _ in messages:
|
||||||
|
report.record_message()
|
||||||
|
report.record_conversation()
|
||||||
|
|
||||||
return {
|
return {
|
||||||
"id": conv_id,
|
"id": conv_id,
|
||||||
"title": title,
|
"title": title,
|
||||||
@@ -242,43 +251,134 @@ def _map_role(sender: str) -> str | None:
|
|||||||
return mapping.get(sender.lower()) if sender else None
|
return mapping.get(sender.lower()) if sender else None
|
||||||
|
|
||||||
|
|
||||||
def _extract_claude_text(
|
def _extract_claude_blocks(
|
||||||
content: str | list | dict, conv_id: str
|
content: str | list | dict | None, conv_id: str, report: LossReport
|
||||||
) -> tuple[str | None, list[str]]:
|
) -> list[dict]:
|
||||||
"""Extract plain text from a Claude content field.
|
"""Extract typed blocks from a Claude content field.
|
||||||
|
|
||||||
Returns:
|
Defensive dispatch — zero observed cases of rich Claude content in the
|
||||||
(text_or_None, list_of_skipped_content_types)
|
user's archive at planning time, so this is theory-only. Real shapes
|
||||||
|
will be locked in v0.4.1 once captured. Any unrecognised block type
|
||||||
|
surfaces as an `unknown` block + WARNING + tally.
|
||||||
"""
|
"""
|
||||||
skipped: list[str] = []
|
if content is None:
|
||||||
|
return []
|
||||||
|
|
||||||
if isinstance(content, str):
|
if isinstance(content, str):
|
||||||
text = content.strip()
|
block = make_text_block(content)
|
||||||
return (text if text else None), skipped
|
return [block] if block else []
|
||||||
|
|
||||||
if isinstance(content, list):
|
if isinstance(content, list):
|
||||||
parts: list[str] = []
|
blocks: list[dict] = []
|
||||||
for block in content:
|
for item in content:
|
||||||
if isinstance(block, str):
|
if isinstance(item, str):
|
||||||
parts.append(block)
|
block = make_text_block(item)
|
||||||
elif isinstance(block, dict):
|
if block:
|
||||||
btype = block.get("type", "text")
|
blocks.append(block)
|
||||||
if btype == "text":
|
elif isinstance(item, dict):
|
||||||
t = block.get("text", "").strip()
|
blocks.extend(_dispatch_claude_block(item, conv_id, report))
|
||||||
if t:
|
return blocks
|
||||||
parts.append(t)
|
|
||||||
else:
|
|
||||||
skipped.append(btype)
|
|
||||||
text = "\n".join(parts).strip()
|
|
||||||
return (text if text else None), skipped
|
|
||||||
|
|
||||||
if isinstance(content, dict):
|
if isinstance(content, dict):
|
||||||
btype = content.get("type", "text")
|
return _dispatch_claude_block(content, conv_id, report)
|
||||||
if btype == "text":
|
|
||||||
text = content.get("text", "").strip()
|
|
||||||
return (text if text else None), skipped
|
|
||||||
else:
|
|
||||||
skipped.append(btype)
|
|
||||||
return None, skipped
|
|
||||||
|
|
||||||
return None, skipped
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def _dispatch_claude_block(block: dict, conv_id: str, report: LossReport) -> list[dict]:
|
||||||
|
"""Translate one raw Claude content block into normalized blocks."""
|
||||||
|
btype = block.get("type", "text")
|
||||||
|
|
||||||
|
if btype == "text":
|
||||||
|
block_obj = make_text_block(block.get("text", "") or "")
|
||||||
|
return [block_obj] if block_obj else []
|
||||||
|
|
||||||
|
if btype == "thinking":
|
||||||
|
# Claude extended-thinking blocks may use 'thinking' or 'text' field.
|
||||||
|
text = block.get("thinking") or block.get("text") or ""
|
||||||
|
block_obj = make_thinking_block(text)
|
||||||
|
return [block_obj] if block_obj else []
|
||||||
|
|
||||||
|
if btype == "tool_use":
|
||||||
|
return [
|
||||||
|
make_tool_use_block(
|
||||||
|
name=block.get("name", "") or "",
|
||||||
|
input_data=block.get("input"),
|
||||||
|
tool_id=block.get("id"),
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
if btype == "tool_result":
|
||||||
|
# ``content`` may be a string or a list of nested blocks (recursive).
|
||||||
|
nested = block.get("content")
|
||||||
|
output = _flatten_tool_result_content(nested, conv_id, report)
|
||||||
|
return [
|
||||||
|
make_tool_result_block(
|
||||||
|
output=output,
|
||||||
|
tool_name=None,
|
||||||
|
is_error=bool(block.get("is_error")),
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
if btype == "image":
|
||||||
|
# Source shape is unverified; try the most likely fields.
|
||||||
|
source = block.get("source") or {}
|
||||||
|
ref = ""
|
||||||
|
if isinstance(source, dict):
|
||||||
|
ref = (
|
||||||
|
source.get("file_uuid")
|
||||||
|
or source.get("media_type")
|
||||||
|
or source.get("url")
|
||||||
|
or ""
|
||||||
|
)
|
||||||
|
return [make_image_placeholder(ref=ref or "(unknown)", source="user_upload")]
|
||||||
|
|
||||||
|
# Unknown block type
|
||||||
|
keys = list(block.keys())
|
||||||
|
logger.warning(
|
||||||
|
"[claude] Unknown block type %r in conversation %s "
|
||||||
|
"— see plan §Data-loss visibility (rendering as unknown block)",
|
||||||
|
btype,
|
||||||
|
conv_id[:8],
|
||||||
|
)
|
||||||
|
report.record_unknown(btype or "?")
|
||||||
|
return [
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type=btype or "?",
|
||||||
|
observed_keys=keys,
|
||||||
|
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _flatten_tool_result_content(
|
||||||
|
nested: object, conv_id: str, report: LossReport
|
||||||
|
) -> str:
|
||||||
|
"""Flatten Claude tool_result content (string OR list of nested blocks) to text.
|
||||||
|
|
||||||
|
Recurses into nested text blocks; any non-text nested block becomes a
|
||||||
|
visible inline marker so non-text content isn't silently dropped.
|
||||||
|
"""
|
||||||
|
if nested is None:
|
||||||
|
return ""
|
||||||
|
if isinstance(nested, str):
|
||||||
|
return nested
|
||||||
|
if isinstance(nested, list):
|
||||||
|
chunks: list[str] = []
|
||||||
|
for item in nested:
|
||||||
|
if isinstance(item, str):
|
||||||
|
chunks.append(item)
|
||||||
|
elif isinstance(item, dict):
|
||||||
|
btype = item.get("type", "text")
|
||||||
|
if btype == "text":
|
||||||
|
chunks.append(item.get("text", "") or "")
|
||||||
|
else:
|
||||||
|
keys = list(item.keys())[:10]
|
||||||
|
report.record_extraction_failure(f"tool_result.{btype}")
|
||||||
|
chunks.append(
|
||||||
|
f"[Unsupported nested {btype} block; keys={keys}]"
|
||||||
|
)
|
||||||
|
return "\n".join(c for c in chunks if c)
|
||||||
|
if isinstance(nested, dict):
|
||||||
|
return _flatten_tool_result_content([nested], conv_id, report)
|
||||||
|
return str(nested)
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ def build_export_path(
|
|||||||
created_at: ISO8601 creation timestamp (used for year folder).
|
created_at: ISO8601 creation timestamp (used for year folder).
|
||||||
filename: Already-generated filename from generate_filename().
|
filename: Already-generated filename from generate_filename().
|
||||||
structure: OUTPUT_STRUCTURE value. One of:
|
structure: OUTPUT_STRUCTURE value. One of:
|
||||||
"provider/project/year" (default)
|
"provider/project/year" (default) — project and year combined, e.g. no-project.2025/
|
||||||
"provider/project"
|
"provider/project"
|
||||||
"provider/year"
|
"provider/year"
|
||||||
|
|
||||||
@@ -64,14 +64,14 @@ def build_export_path(
|
|||||||
parts: list[str] = [provider]
|
parts: list[str] = [provider]
|
||||||
|
|
||||||
if structure == "provider/project/year":
|
if structure == "provider/project/year":
|
||||||
parts += [project_slug, year]
|
parts += [f"{project_slug}.{year}"]
|
||||||
elif structure == "provider/project":
|
elif structure == "provider/project":
|
||||||
parts += [project_slug]
|
parts += [project_slug]
|
||||||
elif structure == "provider/year":
|
elif structure == "provider/year":
|
||||||
parts += [year]
|
parts += [year]
|
||||||
else:
|
else:
|
||||||
# Unknown structure — fall back to default
|
# Unknown structure — fall back to default
|
||||||
parts += [project_slug, year]
|
parts += [f"{project_slug}.{year}"]
|
||||||
|
|
||||||
return base_dir.joinpath(*parts) / filename
|
return base_dir.joinpath(*parts) / filename
|
||||||
|
|
||||||
|
|||||||
157
tests/fixtures/chatgpt_conversation.json
vendored
157
tests/fixtures/chatgpt_conversation.json
vendored
@@ -8,12 +8,30 @@
|
|||||||
"node-root": {
|
"node-root": {
|
||||||
"id": "node-root",
|
"id": "node-root",
|
||||||
"parent": null,
|
"parent": null,
|
||||||
"children": ["node-1"],
|
"children": ["node-uec"],
|
||||||
"message": null
|
"message": null
|
||||||
},
|
},
|
||||||
|
"node-uec": {
|
||||||
|
"id": "node-uec",
|
||||||
|
"parent": "node-root",
|
||||||
|
"children": ["node-1"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-uec",
|
||||||
|
"author": {"role": "user"},
|
||||||
|
"create_time": null,
|
||||||
|
"content": {
|
||||||
|
"content_type": "user_editable_context",
|
||||||
|
"user_profile": "Preferred name: Jesse",
|
||||||
|
"user_instructions": "The user provided the additional info about how they would like you to respond:\n```Always cite sources.```"
|
||||||
|
},
|
||||||
|
"metadata": {
|
||||||
|
"is_visually_hidden_from_conversation": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
"node-1": {
|
"node-1": {
|
||||||
"id": "node-1",
|
"id": "node-1",
|
||||||
"parent": "node-root",
|
"parent": "node-uec",
|
||||||
"children": ["node-2"],
|
"children": ["node-2"],
|
||||||
"message": {
|
"message": {
|
||||||
"id": "node-1",
|
"id": "node-1",
|
||||||
@@ -28,7 +46,7 @@
|
|||||||
"node-2": {
|
"node-2": {
|
||||||
"id": "node-2",
|
"id": "node-2",
|
||||||
"parent": "node-1",
|
"parent": "node-1",
|
||||||
"children": ["node-3"],
|
"children": ["node-mm-user"],
|
||||||
"message": {
|
"message": {
|
||||||
"id": "node-2",
|
"id": "node-2",
|
||||||
"author": {"role": "assistant"},
|
"author": {"role": "assistant"},
|
||||||
@@ -39,18 +57,139 @@
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"node-3": {
|
"node-mm-user": {
|
||||||
"id": "node-3",
|
"id": "node-mm-user",
|
||||||
"parent": "node-2",
|
"parent": "node-2",
|
||||||
"children": [],
|
"children": ["node-mm-assistant"],
|
||||||
"message": {
|
"message": {
|
||||||
"id": "node-3",
|
"id": "node-mm-user",
|
||||||
"author": {"role": "user"},
|
"author": {"role": "user"},
|
||||||
"create_time": 1704067300.0,
|
"create_time": 1704067300.0,
|
||||||
"content": {
|
"content": {
|
||||||
"content_type": "image_asset_pointer",
|
"content_type": "multimodal_text",
|
||||||
"parts": [{"content_type": "image_asset_pointer", "asset_pointer": "file://some-image"}]
|
"parts": [
|
||||||
|
{"content_type": "audio_transcription", "text": "What is the capital of France?", "direction": "in", "decoding_id": null},
|
||||||
|
{"content_type": "real_time_user_audio_video_asset_pointer", "frames_asset_pointers": [], "video_container_asset_pointer": null, "audio_asset_pointer": {"content_type": "audio_asset_pointer", "asset_pointer": "sediment://file_user001", "size_bytes": 50000, "format": "wav", "metadata": {"start": 0.0, "end": 2.5}}, "audio_start_timestamp": 1.0}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"metadata": {"voice_mode_message": true}
|
||||||
}
|
}
|
||||||
|
},
|
||||||
|
"node-mm-assistant": {
|
||||||
|
"id": "node-mm-assistant",
|
||||||
|
"parent": "node-mm-user",
|
||||||
|
"children": ["node-mm-user-rev"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-mm-assistant",
|
||||||
|
"author": {"role": "assistant"},
|
||||||
|
"create_time": 1704067305.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "multimodal_text",
|
||||||
|
"parts": [
|
||||||
|
{"content_type": "audio_transcription", "text": "The capital of France is Paris.", "direction": "out", "decoding_id": null},
|
||||||
|
{"content_type": "audio_asset_pointer", "asset_pointer": "sediment://file_assistant001", "size_bytes": 80000, "format": "wav", "metadata": {"start": 0.0, "end": 3.2}}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"node-mm-user-rev": {
|
||||||
|
"id": "node-mm-user-rev",
|
||||||
|
"parent": "node-mm-assistant",
|
||||||
|
"children": ["node-image-only"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-mm-user-rev",
|
||||||
|
"author": {"role": "user"},
|
||||||
|
"create_time": 1704067400.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "multimodal_text",
|
||||||
|
"parts": [
|
||||||
|
{"content_type": "real_time_user_audio_video_asset_pointer", "frames_asset_pointers": [], "video_container_asset_pointer": null, "audio_asset_pointer": {"content_type": "audio_asset_pointer", "asset_pointer": "sediment://file_user002", "size_bytes": 30000, "format": "wav", "metadata": {"start": 0.0, "end": 1.5}}, "audio_start_timestamp": 5.0},
|
||||||
|
{"content_type": "audio_transcription", "text": "Tell me more please.", "direction": "in", "decoding_id": null}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"node-image-only": {
|
||||||
|
"id": "node-image-only",
|
||||||
|
"parent": "node-mm-user-rev",
|
||||||
|
"children": ["node-exec-output"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-image-only",
|
||||||
|
"author": {"role": "user"},
|
||||||
|
"create_time": 1704067500.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "multimodal_text",
|
||||||
|
"parts": [
|
||||||
|
{"content_type": "image_asset_pointer", "asset_pointer": "file-service://image001"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"node-exec-output": {
|
||||||
|
"id": "node-exec-output",
|
||||||
|
"parent": "node-image-only",
|
||||||
|
"children": ["node-exec-output-empty"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-exec-output",
|
||||||
|
"author": {"role": "tool", "name": "container.exec", "metadata": {}},
|
||||||
|
"create_time": 1704067600.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "execution_output",
|
||||||
|
"text": "Hello from container.exec\nLine 2 of output"
|
||||||
|
},
|
||||||
|
"metadata": {
|
||||||
|
"aggregate_result": {"status": "success", "messages": []},
|
||||||
|
"reasoning_title": "Reading skill documentation"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"node-exec-output-empty": {
|
||||||
|
"id": "node-exec-output-empty",
|
||||||
|
"parent": "node-exec-output",
|
||||||
|
"children": ["node-system-error"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-exec-output-empty",
|
||||||
|
"author": {"role": "tool", "name": "python", "metadata": {}},
|
||||||
|
"create_time": 1704067610.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "execution_output",
|
||||||
|
"text": ""
|
||||||
|
},
|
||||||
|
"metadata": {}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"node-system-error": {
|
||||||
|
"id": "node-system-error",
|
||||||
|
"parent": "node-exec-output-empty",
|
||||||
|
"children": ["node-tether-spinner"],
|
||||||
|
"message": {
|
||||||
|
"id": "node-system-error",
|
||||||
|
"author": {"role": "tool", "name": "web", "metadata": {}},
|
||||||
|
"create_time": 1704067620.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "system_error",
|
||||||
|
"name": "tool_error",
|
||||||
|
"text": "Error: Error from browse service: Error calling browse service: 503"
|
||||||
|
},
|
||||||
|
"metadata": {}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"node-tether-spinner": {
|
||||||
|
"id": "node-tether-spinner",
|
||||||
|
"parent": "node-system-error",
|
||||||
|
"children": [],
|
||||||
|
"message": {
|
||||||
|
"id": "node-tether-spinner",
|
||||||
|
"author": {"role": "tool", "name": "file_search", "metadata": {}},
|
||||||
|
"create_time": 1704067630.0,
|
||||||
|
"content": {
|
||||||
|
"content_type": "tether_browsing_display",
|
||||||
|
"result": "",
|
||||||
|
"summary": "",
|
||||||
|
"assets": null,
|
||||||
|
"tether_id": null
|
||||||
|
},
|
||||||
|
"metadata": {"command": "spinner", "status": "running"}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
9
tests/fixtures/claude_conversation.json
vendored
9
tests/fixtures/claude_conversation.json
vendored
@@ -30,6 +30,15 @@
|
|||||||
"sender": "human",
|
"sender": "human",
|
||||||
"created_at": "2024-06-10T14:45:00.000Z",
|
"created_at": "2024-06-10T14:45:00.000Z",
|
||||||
"content": "Thank you, that helped!"
|
"content": "Thank you, that helped!"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"uuid": "msg-004",
|
||||||
|
"sender": "human",
|
||||||
|
"created_at": "2024-06-10T14:50:00.000Z",
|
||||||
|
"content": [
|
||||||
|
{"type": "text", "text": "What about this image?"},
|
||||||
|
{"type": "image", "source": {"file_uuid": "claude-image-uuid-1", "media_type": "image/png"}}
|
||||||
|
]
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|||||||
176
tests/test_cli.py
Normal file
176
tests/test_cli.py
Normal file
@@ -0,0 +1,176 @@
|
|||||||
|
"""CLI-level tests using Click's CliRunner — no live API calls required."""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from click.testing import CliRunner
|
||||||
|
|
||||||
|
from src.cache import Cache
|
||||||
|
from src.main import _filter_by_project, cli
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# _filter_by_project (T-27)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestFilterByProject:
|
||||||
|
"""Unit tests for the project filter logic used by export/list/joplin."""
|
||||||
|
|
||||||
|
# ChatGPT conversations use the _project_name annotation key
|
||||||
|
def _chatgpt(self, conv_id, project_name):
|
||||||
|
return {"id": conv_id, "_project_name": project_name}
|
||||||
|
|
||||||
|
# Claude conversations use the project dict key
|
||||||
|
def _claude(self, conv_id, project_name):
|
||||||
|
proj = {"name": project_name} if project_name else None
|
||||||
|
return {"id": conv_id, "project": proj}
|
||||||
|
|
||||||
|
def test_none_filter_keeps_no_project_chatgpt(self):
|
||||||
|
convs = [self._chatgpt("a", None), self._chatgpt("b", "Python Course")]
|
||||||
|
result = _filter_by_project(convs, "none")
|
||||||
|
assert len(result) == 1
|
||||||
|
assert result[0]["id"] == "a"
|
||||||
|
|
||||||
|
def test_none_filter_keeps_no_project_claude(self):
|
||||||
|
convs = [self._claude("a", None), self._claude("b", "Python Course")]
|
||||||
|
result = _filter_by_project(convs, "none")
|
||||||
|
assert len(result) == 1
|
||||||
|
assert result[0]["id"] == "a"
|
||||||
|
|
||||||
|
def test_name_filter_case_insensitive(self):
|
||||||
|
convs = [
|
||||||
|
self._chatgpt("a", "Python Course"),
|
||||||
|
self._chatgpt("b", "Java Course"),
|
||||||
|
self._chatgpt("c", None),
|
||||||
|
]
|
||||||
|
result = _filter_by_project(convs, "PYTHON")
|
||||||
|
assert len(result) == 1
|
||||||
|
assert result[0]["id"] == "a"
|
||||||
|
|
||||||
|
def test_name_filter_substring_match(self):
|
||||||
|
convs = [
|
||||||
|
self._chatgpt("a", "Python Advanced Course"),
|
||||||
|
self._chatgpt("b", "Python Basics"),
|
||||||
|
self._chatgpt("c", "JavaScript"),
|
||||||
|
]
|
||||||
|
result = _filter_by_project(convs, "python")
|
||||||
|
assert len(result) == 2
|
||||||
|
assert {c["id"] for c in result} == {"a", "b"}
|
||||||
|
|
||||||
|
def test_no_matches_returns_empty(self):
|
||||||
|
convs = [self._chatgpt("a", "Python Course"), self._chatgpt("b", None)]
|
||||||
|
result = _filter_by_project(convs, "ruby")
|
||||||
|
assert result == []
|
||||||
|
|
||||||
|
def test_none_filter_excludes_all_with_projects(self):
|
||||||
|
convs = [self._chatgpt("a", "Project A"), self._chatgpt("b", "Project B")]
|
||||||
|
result = _filter_by_project(convs, "none")
|
||||||
|
assert result == []
|
||||||
|
|
||||||
|
def test_empty_string_project_treated_as_no_project(self):
|
||||||
|
convs = [{"id": "a", "_project_name": ""}, {"id": "b", "_project_name": "Real"}]
|
||||||
|
result = _filter_by_project(convs, "none")
|
||||||
|
assert len(result) == 1
|
||||||
|
assert result[0]["id"] == "a"
|
||||||
|
|
||||||
|
def test_claude_project_string_matched(self):
|
||||||
|
# Claude can also have project as a plain string
|
||||||
|
convs = [{"id": "a", "project": "python-course"}, {"id": "b", "project": None}]
|
||||||
|
result = _filter_by_project(convs, "python")
|
||||||
|
assert len(result) == 1
|
||||||
|
assert result[0]["id"] == "a"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# export --since validation (T-25)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestExportSinceValidation:
|
||||||
|
"""Test that --since with an invalid date exits cleanly with an error message."""
|
||||||
|
|
||||||
|
def _pre_populated_cache(self, tmp_path) -> Cache:
|
||||||
|
"""Create a cache that passes the ToS gate and first-run doctor check."""
|
||||||
|
cache = Cache(tmp_path)
|
||||||
|
cache.acknowledge_tos()
|
||||||
|
cache.mark_exported("chatgpt", "dummy-conv", {"updated_at": "2024-01-01T00:00:00Z"})
|
||||||
|
return cache
|
||||||
|
|
||||||
|
def test_invalid_since_date_exits_with_error(self, tmp_path):
|
||||||
|
self._pre_populated_cache(tmp_path)
|
||||||
|
|
||||||
|
runner = CliRunner(mix_stderr=True)
|
||||||
|
result = runner.invoke(
|
||||||
|
cli,
|
||||||
|
["--no-log-file", "export", "--since", "notadate"],
|
||||||
|
env={
|
||||||
|
"CHATGPT_SESSION_TOKEN": "eyJtesttoken",
|
||||||
|
"CACHE_DIR": str(tmp_path),
|
||||||
|
"EXPORT_DIR": str(tmp_path / "exports"),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
assert result.exit_code == 1
|
||||||
|
assert "Invalid --since date" in result.output
|
||||||
|
assert "YYYY-MM-DD" in result.output
|
||||||
|
|
||||||
|
def test_valid_since_date_does_not_error(self, tmp_path):
|
||||||
|
"""A valid date should not produce the invalid-date error (may fail later on API)."""
|
||||||
|
self._pre_populated_cache(tmp_path)
|
||||||
|
|
||||||
|
runner = CliRunner(mix_stderr=True)
|
||||||
|
result = runner.invoke(
|
||||||
|
cli,
|
||||||
|
["--no-log-file", "export", "--since", "2024-01-01"],
|
||||||
|
env={
|
||||||
|
"CHATGPT_SESSION_TOKEN": "eyJtesttoken",
|
||||||
|
"CACHE_DIR": str(tmp_path),
|
||||||
|
"EXPORT_DIR": str(tmp_path / "exports"),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
assert "Invalid --since date" not in result.output
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# LossReport summary
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestLossReportSummary:
|
||||||
|
"""The LossReport's format_summary() pinned format covers zero, top-5, and overflow cases."""
|
||||||
|
|
||||||
|
def test_zero_summary_uses_none_sentinel(self):
|
||||||
|
from src.loss_report import LossReport
|
||||||
|
|
||||||
|
report = LossReport()
|
||||||
|
out = report.format_summary()
|
||||||
|
assert "[export] Run summary:" in out
|
||||||
|
assert "conversations: 0" in out
|
||||||
|
assert "messages rendered: 0" in out
|
||||||
|
# Both "(none)" sentinels present — never empty parens
|
||||||
|
assert out.count("(none)") == 2
|
||||||
|
|
||||||
|
def test_top_5_breakdown(self):
|
||||||
|
from src.loss_report import LossReport
|
||||||
|
|
||||||
|
report = LossReport()
|
||||||
|
for raw_type in ("a", "b", "c", "d", "e", "f", "g"):
|
||||||
|
report.record_unknown(raw_type)
|
||||||
|
if raw_type == "a":
|
||||||
|
# Make 'a' the most common
|
||||||
|
for _ in range(4):
|
||||||
|
report.record_unknown("a")
|
||||||
|
out = report.format_summary()
|
||||||
|
# Top entry shown
|
||||||
|
assert "a=5" in out
|
||||||
|
# Overflow line present (7 types, top 5 + 2 more)
|
||||||
|
assert "+ 2 more types" in out
|
||||||
|
|
||||||
|
def test_messages_and_conversations_recorded(self):
|
||||||
|
from src.loss_report import LossReport
|
||||||
|
|
||||||
|
report = LossReport()
|
||||||
|
report.record_conversation()
|
||||||
|
report.record_message()
|
||||||
|
report.record_message()
|
||||||
|
out = report.format_summary()
|
||||||
|
assert "conversations: 1" in out
|
||||||
|
assert "messages rendered: 2" in out
|
||||||
56
tests/test_config.py
Normal file
56
tests/test_config.py
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
"""Tests for src/config.py — token validation logic (T-14)."""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import time
|
||||||
|
|
||||||
|
import jwt
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.config import _validate_chatgpt_token
|
||||||
|
|
||||||
|
|
||||||
|
class TestValidateChatGPTToken:
|
||||||
|
def test_expired_token_logs_warning(self, caplog):
|
||||||
|
# T-14: expired JWT must produce a clear warning
|
||||||
|
payload = {"exp": int(time.time()) - 3600} # expired 1 hour ago
|
||||||
|
token = jwt.encode(payload, "secret", algorithm="HS256")
|
||||||
|
with caplog.at_level(logging.WARNING, logger="src.config"):
|
||||||
|
result = _validate_chatgpt_token(token)
|
||||||
|
assert any("expired" in r.message.lower() for r in caplog.records)
|
||||||
|
assert result is not None # still returns the expiry datetime
|
||||||
|
|
||||||
|
def test_expiring_within_24h_logs_warning(self, caplog):
|
||||||
|
payload = {"exp": int(time.time()) + 3600} # expires in 1 hour
|
||||||
|
token = jwt.encode(payload, "secret", algorithm="HS256")
|
||||||
|
with caplog.at_level(logging.WARNING, logger="src.config"):
|
||||||
|
_validate_chatgpt_token(token)
|
||||||
|
assert any("less than 24 hours" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
def test_valid_token_no_expiry_warning(self, caplog):
|
||||||
|
payload = {"exp": int(time.time()) + 86400 * 5} # valid for 5 days
|
||||||
|
token = jwt.encode(payload, "secret", algorithm="HS256")
|
||||||
|
with caplog.at_level(logging.WARNING, logger="src.config"):
|
||||||
|
result = _validate_chatgpt_token(token)
|
||||||
|
assert not any("expired" in r.message.lower() for r in caplog.records)
|
||||||
|
assert result is not None
|
||||||
|
|
||||||
|
def test_token_without_exp_claim_logs_warning(self, caplog):
|
||||||
|
payload = {"sub": "user123"} # no exp
|
||||||
|
token = jwt.encode(payload, "secret", algorithm="HS256")
|
||||||
|
with caplog.at_level(logging.WARNING, logger="src.config"):
|
||||||
|
result = _validate_chatgpt_token(token)
|
||||||
|
assert any("'exp'" in r.message or "no 'exp'" in r.message for r in caplog.records)
|
||||||
|
assert result is None
|
||||||
|
|
||||||
|
def test_jwe_encrypted_token_returns_none(self, caplog):
|
||||||
|
# JWE tokens (alg=dir) cannot be decoded client-side — this is normal for ChatGPT
|
||||||
|
jwe_like = "eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIn0.fake.token.data.here"
|
||||||
|
with caplog.at_level(logging.DEBUG, logger="src.config"):
|
||||||
|
result = _validate_chatgpt_token(jwe_like)
|
||||||
|
assert result is None # cannot decode, but not an error
|
||||||
|
|
||||||
|
def test_non_jwt_string_logs_warning(self, caplog):
|
||||||
|
with caplog.at_level(logging.WARNING, logger="src.config"):
|
||||||
|
result = _validate_chatgpt_token("notajwttoken")
|
||||||
|
assert any("does not look like a JWT" in r.message for r in caplog.records)
|
||||||
|
assert result is None
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
"""Unit tests for src/exporters/."""
|
"""Unit tests for src/exporters/ and src/blocks.py."""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
import os
|
import os
|
||||||
@@ -7,6 +7,23 @@ from pathlib import Path
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from src.blocks import (
|
||||||
|
BLOCK_TYPE_TEXT,
|
||||||
|
UNKNOWN_REASON_EXTRACTION_FAILED,
|
||||||
|
UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
_blockquote_prefix,
|
||||||
|
_safe_fence,
|
||||||
|
make_code_block,
|
||||||
|
make_file_placeholder,
|
||||||
|
make_hidden_context_marker,
|
||||||
|
make_image_placeholder,
|
||||||
|
make_text_block,
|
||||||
|
make_thinking_block,
|
||||||
|
make_tool_result_block,
|
||||||
|
make_tool_use_block,
|
||||||
|
make_unknown_block,
|
||||||
|
render_blocks_to_markdown,
|
||||||
|
)
|
||||||
from src.exporters.markdown import MarkdownExporter, _yaml_escape, _format_timestamp
|
from src.exporters.markdown import MarkdownExporter, _yaml_escape, _format_timestamp
|
||||||
from src.exporters.json_export import JSONExporter
|
from src.exporters.json_export import JSONExporter
|
||||||
|
|
||||||
@@ -122,7 +139,7 @@ class TestMarkdownFilenameGeneration:
|
|||||||
def test_year_in_path(self, tmp_path):
|
def test_year_in_path(self, tmp_path):
|
||||||
exp = MarkdownExporter(tmp_path)
|
exp = MarkdownExporter(tmp_path)
|
||||||
path = exp.export(SAMPLE_CONV)
|
path = exp.export(SAMPLE_CONV)
|
||||||
assert "/2024/" in str(path)
|
assert ".2024/" in str(path)
|
||||||
|
|
||||||
def test_output_structure_provider_project(self, tmp_path):
|
def test_output_structure_provider_project(self, tmp_path):
|
||||||
exp = MarkdownExporter(tmp_path, output_structure="provider/project")
|
exp = MarkdownExporter(tmp_path, output_structure="provider/project")
|
||||||
@@ -199,6 +216,34 @@ class TestJSONExporter:
|
|||||||
assert " " in raw
|
assert " " in raw
|
||||||
|
|
||||||
|
|
||||||
|
class TestBothFormats:
|
||||||
|
"""T-38: Markdown and JSON exporters produce matching filenames for the same conversation."""
|
||||||
|
|
||||||
|
def test_both_formats_produce_files(self, tmp_path):
|
||||||
|
md_exp = MarkdownExporter(tmp_path)
|
||||||
|
json_exp = JSONExporter(tmp_path)
|
||||||
|
md_path = md_exp.export(SAMPLE_CONV)
|
||||||
|
json_path = json_exp.export(SAMPLE_CONV)
|
||||||
|
assert md_path.exists()
|
||||||
|
assert json_path.exists()
|
||||||
|
|
||||||
|
def test_both_formats_have_matching_stems(self, tmp_path):
|
||||||
|
md_exp = MarkdownExporter(tmp_path)
|
||||||
|
json_exp = JSONExporter(tmp_path)
|
||||||
|
md_path = md_exp.export(SAMPLE_CONV)
|
||||||
|
json_path = json_exp.export(SAMPLE_CONV)
|
||||||
|
assert md_path.suffix == ".md"
|
||||||
|
assert json_path.suffix == ".json"
|
||||||
|
assert md_path.stem == json_path.stem
|
||||||
|
|
||||||
|
def test_both_formats_same_directory(self, tmp_path):
|
||||||
|
md_exp = MarkdownExporter(tmp_path)
|
||||||
|
json_exp = JSONExporter(tmp_path)
|
||||||
|
md_path = md_exp.export(SAMPLE_CONV)
|
||||||
|
json_path = json_exp.export(SAMPLE_CONV)
|
||||||
|
assert md_path.parent == json_path.parent
|
||||||
|
|
||||||
|
|
||||||
class TestYamlEscape:
|
class TestYamlEscape:
|
||||||
def test_escapes_double_quotes(self):
|
def test_escapes_double_quotes(self):
|
||||||
assert _yaml_escape('Say "hello"') == 'Say \\"hello\\"'
|
assert _yaml_escape('Say "hello"') == 'Say \\"hello\\"'
|
||||||
@@ -222,3 +267,271 @@ class TestFormatTimestamp:
|
|||||||
|
|
||||||
def test_empty_string(self):
|
def test_empty_string(self):
|
||||||
assert _format_timestamp("") == ""
|
assert _format_timestamp("") == ""
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Block helpers and rendering
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestSafeFence:
|
||||||
|
def test_minimum_three_backticks(self):
|
||||||
|
assert _safe_fence("plain text") == "```"
|
||||||
|
|
||||||
|
def test_four_backticks_when_three_in_content(self):
|
||||||
|
assert _safe_fence("here ``` is a fence") == "````"
|
||||||
|
|
||||||
|
def test_five_backticks_when_four_in_content(self):
|
||||||
|
assert _safe_fence("here ```` is four") == "`````"
|
||||||
|
|
||||||
|
def test_handles_empty_string(self):
|
||||||
|
assert _safe_fence("") == "```"
|
||||||
|
|
||||||
|
def test_handles_run_at_end(self):
|
||||||
|
# Trailing run still counted
|
||||||
|
assert _safe_fence("text ending in ```") == "````"
|
||||||
|
|
||||||
|
|
||||||
|
class TestBlockquotePrefix:
|
||||||
|
def test_single_line(self):
|
||||||
|
assert _blockquote_prefix("hello") == "> hello"
|
||||||
|
|
||||||
|
def test_multi_line(self):
|
||||||
|
assert _blockquote_prefix("a\nb\nc") == "> a\n> b\n> c"
|
||||||
|
|
||||||
|
def test_empty_lines_become_naked_quote_marker(self):
|
||||||
|
assert _blockquote_prefix("a\n\nb") == "> a\n>\n> b"
|
||||||
|
|
||||||
|
def test_empty_string(self):
|
||||||
|
assert _blockquote_prefix("") == ">"
|
||||||
|
|
||||||
|
|
||||||
|
class TestBlockConstructors:
|
||||||
|
def test_make_text_block_returns_none_for_empty(self):
|
||||||
|
assert make_text_block("") is None
|
||||||
|
assert make_text_block(" ") is None
|
||||||
|
|
||||||
|
def test_make_text_block_returns_dict(self):
|
||||||
|
b = make_text_block("hello")
|
||||||
|
assert b == {"type": "text", "text": "hello"}
|
||||||
|
|
||||||
|
def test_make_code_block_returns_none_for_empty(self):
|
||||||
|
assert make_code_block("") is None
|
||||||
|
|
||||||
|
def test_make_thinking_block_returns_none_for_empty(self):
|
||||||
|
assert make_thinking_block("") is None
|
||||||
|
|
||||||
|
|
||||||
|
class TestRenderBlocks:
|
||||||
|
def test_text_block_renders_as_paragraph(self):
|
||||||
|
out = render_blocks_to_markdown([make_text_block("Hello world")])
|
||||||
|
assert out == "Hello world"
|
||||||
|
|
||||||
|
def test_blocks_separated_by_blank_line(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_text_block("first"), make_text_block("second")]
|
||||||
|
)
|
||||||
|
assert out == "first\n\nsecond"
|
||||||
|
|
||||||
|
def test_code_block_with_language(self):
|
||||||
|
out = render_blocks_to_markdown([make_code_block("print(1)", language="python")])
|
||||||
|
assert "```python" in out
|
||||||
|
assert "print(1)" in out
|
||||||
|
|
||||||
|
def test_thinking_block_uses_blockquote(self):
|
||||||
|
out = render_blocks_to_markdown([make_thinking_block("step 1\nstep 2")])
|
||||||
|
assert "**💭 Reasoning**" in out
|
||||||
|
assert "> step 1" in out
|
||||||
|
assert "> step 2" in out
|
||||||
|
|
||||||
|
def test_tool_use_renders_as_blockquote_with_safe_fence(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_tool_use_block("search", {"query": "test"})]
|
||||||
|
)
|
||||||
|
assert "> 🔧 **Tool: search**" in out
|
||||||
|
# Every line of the body is blockquote-prefixed
|
||||||
|
assert "> ```json" in out
|
||||||
|
assert "> }" in out
|
||||||
|
|
||||||
|
def test_tool_use_with_multiline_input(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_tool_use_block("complex", {"a": 1, "b": [{"x": "y"}]})]
|
||||||
|
)
|
||||||
|
# Prefix every line of multi-line JSON
|
||||||
|
for line in out.split("\n"):
|
||||||
|
assert line.startswith(">") or line == ""
|
||||||
|
|
||||||
|
def test_tool_result_success_uses_outbox_icon(self):
|
||||||
|
out = render_blocks_to_markdown([make_tool_result_block("OK")])
|
||||||
|
assert "📤 **Result**" in out
|
||||||
|
assert "❌" not in out
|
||||||
|
|
||||||
|
def test_tool_result_error_uses_x_icon(self):
|
||||||
|
out = render_blocks_to_markdown([make_tool_result_block("oops", is_error=True)])
|
||||||
|
assert "❌ **Result (error)**" in out
|
||||||
|
assert "📤" not in out
|
||||||
|
|
||||||
|
def test_tool_result_with_tool_name_in_header(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_tool_result_block("done", tool_name="container.exec")]
|
||||||
|
)
|
||||||
|
assert "📤 **Result: container.exec**" in out
|
||||||
|
|
||||||
|
def test_tool_result_error_with_tool_name(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_tool_result_block("503", tool_name="web", is_error=True)]
|
||||||
|
)
|
||||||
|
assert "❌ **Result (error): web**" in out
|
||||||
|
|
||||||
|
def test_tool_result_summary_renders_as_italic_line(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[
|
||||||
|
make_tool_result_block(
|
||||||
|
"output",
|
||||||
|
tool_name="container.exec",
|
||||||
|
summary="Reading skill documentation",
|
||||||
|
)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
# Summary line is italic, lives between header and fence,
|
||||||
|
# all inside the blockquote prefix.
|
||||||
|
assert "> *Reading skill documentation*" in out
|
||||||
|
# Order: header before summary before fence
|
||||||
|
header_idx = out.index("Result: container.exec")
|
||||||
|
summary_idx = out.index("Reading skill documentation")
|
||||||
|
fence_idx = out.index("output")
|
||||||
|
assert header_idx < summary_idx < fence_idx
|
||||||
|
|
||||||
|
def test_image_placeholder_rendering(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_image_placeholder(ref="file-123", source="user_upload")]
|
||||||
|
)
|
||||||
|
assert "🖼️ **Image attached**" in out
|
||||||
|
assert "`file-123`" in out
|
||||||
|
assert "user_upload" in out
|
||||||
|
assert "content not preserved" in out
|
||||||
|
|
||||||
|
def test_file_placeholder_with_metadata(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_file_placeholder(ref="sediment://x", mime="audio/wav", size_bytes=10240, duration_seconds=2.5)]
|
||||||
|
)
|
||||||
|
assert "📎 **File attached**" in out
|
||||||
|
assert "audio/wav" in out
|
||||||
|
assert "KB" in out
|
||||||
|
assert "2.50s" in out
|
||||||
|
|
||||||
|
def test_unknown_block_renders_with_keys(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type="future_x",
|
||||||
|
observed_keys=["foo", "bar"],
|
||||||
|
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
assert "⚠️ **Unsupported content**" in out
|
||||||
|
assert "future_x" in out
|
||||||
|
assert "`foo`" in out
|
||||||
|
assert "`bar`" in out
|
||||||
|
|
||||||
|
def test_unknown_extraction_failed_includes_summary(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[
|
||||||
|
make_unknown_block(
|
||||||
|
raw_type="audio_transcription",
|
||||||
|
observed_keys=["asset_pointer"],
|
||||||
|
reason=UNKNOWN_REASON_EXTRACTION_FAILED,
|
||||||
|
summary="expected key 'text' not found",
|
||||||
|
)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
assert "extraction_failed" in out
|
||||||
|
assert "expected key 'text' not found" in out
|
||||||
|
|
||||||
|
def test_hidden_context_marker(self):
|
||||||
|
out = render_blocks_to_markdown(
|
||||||
|
[make_hidden_context_marker("user_editable_context")]
|
||||||
|
)
|
||||||
|
assert "ℹ️ **Hidden context**" in out
|
||||||
|
assert "`user_editable_context`" in out
|
||||||
|
|
||||||
|
def test_safe_fence_prevents_runaway_code_block(self):
|
||||||
|
# Content contains an unbalanced opening fence — without _safe_fence
|
||||||
|
# this would corrupt downstream rendering.
|
||||||
|
evil_content = "before\n```Follow\ntext\nraw is: \"```"
|
||||||
|
block = make_code_block(evil_content)
|
||||||
|
out = render_blocks_to_markdown([block, make_text_block("after")])
|
||||||
|
# The 4-backtick wrap should be present
|
||||||
|
assert "````" in out
|
||||||
|
# The "after" text should appear OUTSIDE any code block — it follows
|
||||||
|
# the closing ```` fence.
|
||||||
|
assert out.endswith("after")
|
||||||
|
|
||||||
|
def test_block_order_preserved(self):
|
||||||
|
blocks = [
|
||||||
|
make_text_block("a"),
|
||||||
|
make_image_placeholder(ref="r1", source="user_upload"),
|
||||||
|
make_text_block("b"),
|
||||||
|
]
|
||||||
|
out = render_blocks_to_markdown(blocks)
|
||||||
|
assert out.index("a") < out.index("Image attached")
|
||||||
|
assert out.index("Image attached") < out.index("b")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Markdown exporter with blocks
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
SAMPLE_CONV_BLOCKS = {
|
||||||
|
"id": "blocks12345",
|
||||||
|
"title": "Blocks Conversation",
|
||||||
|
"provider": "claude",
|
||||||
|
"project": None,
|
||||||
|
"created_at": "2024-06-10T14:32:00Z",
|
||||||
|
"updated_at": "2024-06-10T15:00:00Z",
|
||||||
|
"message_count": 1,
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "assistant",
|
||||||
|
"content_type": "text",
|
||||||
|
"timestamp": None,
|
||||||
|
"blocks": [
|
||||||
|
{"type": "text", "text": "Here is the answer."},
|
||||||
|
{"type": "tool_use", "name": "search", "input": {"q": "x"}, "tool_id": "t1"},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class TestMarkdownExporterWithBlocks:
|
||||||
|
def test_renders_blocks(self, tmp_path):
|
||||||
|
exp = MarkdownExporter(tmp_path)
|
||||||
|
path = exp.export(SAMPLE_CONV_BLOCKS)
|
||||||
|
body = path.read_text()
|
||||||
|
assert "Here is the answer." in body
|
||||||
|
assert "🔧 **Tool: search**" in body
|
||||||
|
|
||||||
|
def test_falls_back_to_content_when_blocks_missing(self, tmp_path):
|
||||||
|
# Backward-compat: messages with `content` only (no `blocks`) still render.
|
||||||
|
exp = MarkdownExporter(tmp_path)
|
||||||
|
path = exp.export(SAMPLE_CONV) # SAMPLE_CONV has content only, no blocks
|
||||||
|
body = path.read_text()
|
||||||
|
assert "Hello, how are you?" in body
|
||||||
|
|
||||||
|
def test_skips_messages_with_neither_blocks_nor_content(self, tmp_path):
|
||||||
|
conv = {
|
||||||
|
**SAMPLE_CONV_BLOCKS,
|
||||||
|
"messages": [
|
||||||
|
{"role": "user", "content_type": "text", "timestamp": None, "blocks": []},
|
||||||
|
{"role": "assistant", "content_type": "text", "timestamp": None, "blocks": [
|
||||||
|
{"type": "text", "text": "I am here."}
|
||||||
|
]},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
exp = MarkdownExporter(tmp_path)
|
||||||
|
path = exp.export(conv)
|
||||||
|
body = path.read_text()
|
||||||
|
assert "I am here." in body
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ from unittest.mock import MagicMock, patch
|
|||||||
import pytest
|
import pytest
|
||||||
import requests
|
import requests
|
||||||
|
|
||||||
from src.joplin import JoplinClient, JoplinError, _http_error_message, _timeout_message, notebook_title
|
from src.joplin import JoplinClient, JoplinError, _http_error_message, _timeout_message, notebook_path
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -31,25 +31,29 @@ def _mock_response(json_data=None, text="", status_code=200):
|
|||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# notebook_title helper
|
# notebook_path helper
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
class TestNotebookTitle:
|
class TestNotebookPath:
|
||||||
def test_no_project(self):
|
def test_no_project(self):
|
||||||
assert notebook_title("chatgpt", None) == "ChatGPT - No Project"
|
assert notebook_path("chatgpt", None) == ("AI-ChatGPT", "No Project")
|
||||||
|
|
||||||
def test_no_project_string(self):
|
def test_no_project_string(self):
|
||||||
assert notebook_title("chatgpt", "no-project") == "ChatGPT - No Project"
|
assert notebook_path("chatgpt", "no-project") == ("AI-ChatGPT", "No Project")
|
||||||
|
|
||||||
def test_project_with_hyphens(self):
|
def test_project_with_hyphens(self):
|
||||||
assert notebook_title("chatgpt", "my-project") == "ChatGPT - My Project"
|
assert notebook_path("chatgpt", "my-project") == ("AI-ChatGPT", "My Project")
|
||||||
|
|
||||||
def test_claude_provider(self):
|
def test_claude_provider(self):
|
||||||
assert notebook_title("claude", "budget-tracker") == "Claude - Budget Tracker"
|
assert notebook_path("claude", "budget-tracker") == ("AI-Claude", "Budget Tracker")
|
||||||
|
|
||||||
def test_multi_word_project(self):
|
def test_multi_word_project(self):
|
||||||
assert notebook_title("claude", "ai-research-notes") == "Claude - Ai Research Notes"
|
assert notebook_path("claude", "ai-research-notes") == ("AI-Claude", "Ai Research Notes")
|
||||||
|
|
||||||
|
def test_returns_tuple(self):
|
||||||
|
result = notebook_path("chatgpt", "some-project")
|
||||||
|
assert isinstance(result, tuple) and len(result) == 2
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -236,18 +240,30 @@ class TestListNotebooks:
|
|||||||
|
|
||||||
|
|
||||||
class TestGetOrCreateNotebook:
|
class TestGetOrCreateNotebook:
|
||||||
def test_returns_existing_notebook_id(self):
|
def test_returns_existing_root_notebook_id(self):
|
||||||
client = _make_client()
|
client = _make_client()
|
||||||
with patch("requests.get") as mock_get:
|
with patch("requests.get") as mock_get:
|
||||||
mock_get.return_value = _mock_response(
|
mock_get.return_value = _mock_response(
|
||||||
json_data={
|
json_data={
|
||||||
"items": [{"id": "nb-existing", "title": "ChatGPT - No Project"}],
|
"items": [{"id": "nb-existing", "title": "AI-ChatGPT", "parent_id": ""}],
|
||||||
"has_more": False,
|
"has_more": False,
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
nb_id = client.get_or_create_notebook("ChatGPT - No Project")
|
nb_id = client.get_or_create_notebook("AI-ChatGPT")
|
||||||
assert nb_id == "nb-existing"
|
assert nb_id == "nb-existing"
|
||||||
|
|
||||||
|
def test_returns_existing_child_notebook_id(self):
|
||||||
|
client = _make_client()
|
||||||
|
with patch("requests.get") as mock_get:
|
||||||
|
mock_get.return_value = _mock_response(
|
||||||
|
json_data={
|
||||||
|
"items": [{"id": "nb-child", "title": "No Project", "parent_id": "nb-parent"}],
|
||||||
|
"has_more": False,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
nb_id = client.get_or_create_notebook("No Project", parent_id="nb-parent")
|
||||||
|
assert nb_id == "nb-child"
|
||||||
|
|
||||||
def test_creates_new_notebook_when_not_found(self):
|
def test_creates_new_notebook_when_not_found(self):
|
||||||
client = _make_client()
|
client = _make_client()
|
||||||
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
|
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
|
||||||
@@ -255,26 +271,103 @@ class TestGetOrCreateNotebook:
|
|||||||
json_data={"items": [], "has_more": False}
|
json_data={"items": [], "has_more": False}
|
||||||
)
|
)
|
||||||
mock_post.return_value = _mock_response(
|
mock_post.return_value = _mock_response(
|
||||||
json_data={"id": "nb-new", "title": "ChatGPT - New Project"}
|
json_data={"id": "nb-new", "title": "AI-ChatGPT"}
|
||||||
)
|
)
|
||||||
nb_id = client.get_or_create_notebook("ChatGPT - New Project")
|
nb_id = client.get_or_create_notebook("AI-ChatGPT")
|
||||||
assert nb_id == "nb-new"
|
assert nb_id == "nb-new"
|
||||||
mock_post.assert_called_once()
|
mock_post.assert_called_once()
|
||||||
|
|
||||||
|
def test_creates_child_notebook_with_parent_id(self):
|
||||||
|
client = _make_client()
|
||||||
|
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
|
||||||
|
mock_get.return_value = _mock_response(
|
||||||
|
json_data={"items": [], "has_more": False}
|
||||||
|
)
|
||||||
|
mock_post.return_value = _mock_response(
|
||||||
|
json_data={"id": "nb-child", "title": "My Project"}
|
||||||
|
)
|
||||||
|
nb_id = client.get_or_create_notebook("My Project", parent_id="nb-parent")
|
||||||
|
assert nb_id == "nb-child"
|
||||||
|
_, kwargs = mock_post.call_args
|
||||||
|
assert kwargs["json"]["parent_id"] == "nb-parent"
|
||||||
|
|
||||||
|
def test_does_not_include_parent_id_for_root(self):
|
||||||
|
client = _make_client()
|
||||||
|
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
|
||||||
|
mock_get.return_value = _mock_response(json_data={"items": [], "has_more": False})
|
||||||
|
mock_post.return_value = _mock_response(json_data={"id": "nb-root", "title": "AI-Claude"})
|
||||||
|
client.get_or_create_notebook("AI-Claude")
|
||||||
|
_, kwargs = mock_post.call_args
|
||||||
|
assert "parent_id" not in kwargs["json"]
|
||||||
|
|
||||||
def test_caches_notebook_after_first_load(self):
|
def test_caches_notebook_after_first_load(self):
|
||||||
client = _make_client()
|
client = _make_client()
|
||||||
with patch("requests.get") as mock_get:
|
with patch("requests.get") as mock_get:
|
||||||
mock_get.return_value = _mock_response(
|
mock_get.return_value = _mock_response(
|
||||||
json_data={
|
json_data={
|
||||||
"items": [{"id": "nb1", "title": "Claude - No Project"}],
|
"items": [{"id": "nb1", "title": "AI-Claude", "parent_id": ""}],
|
||||||
"has_more": False,
|
"has_more": False,
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
# Call twice — GET /folders should only happen once
|
# Call twice — GET /folders should only happen once
|
||||||
client.get_or_create_notebook("Claude - No Project")
|
client.get_or_create_notebook("AI-Claude")
|
||||||
client.get_or_create_notebook("Claude - No Project")
|
client.get_or_create_notebook("AI-Claude")
|
||||||
assert mock_get.call_count == 1
|
assert mock_get.call_count == 1
|
||||||
|
|
||||||
|
def test_different_parent_ids_are_distinct_cache_entries(self):
|
||||||
|
"""Same title under different parents are different notebooks."""
|
||||||
|
client = _make_client()
|
||||||
|
with patch("requests.get") as mock_get:
|
||||||
|
mock_get.return_value = _mock_response(
|
||||||
|
json_data={
|
||||||
|
"items": [
|
||||||
|
{"id": "nb-a", "title": "No Project", "parent_id": "parent-chatgpt"},
|
||||||
|
{"id": "nb-b", "title": "No Project", "parent_id": "parent-claude"},
|
||||||
|
],
|
||||||
|
"has_more": False,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
id_a = client.get_or_create_notebook("No Project", parent_id="parent-chatgpt")
|
||||||
|
id_b = client.get_or_create_notebook("No Project", parent_id="parent-claude")
|
||||||
|
assert id_a == "nb-a"
|
||||||
|
assert id_b == "nb-b"
|
||||||
|
|
||||||
|
|
||||||
|
class TestGetOrCreateNotebookPath:
|
||||||
|
def test_creates_two_level_path(self):
|
||||||
|
client = _make_client()
|
||||||
|
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
|
||||||
|
mock_get.return_value = _mock_response(json_data={"items": [], "has_more": False})
|
||||||
|
mock_post.side_effect = [
|
||||||
|
_mock_response(json_data={"id": "nb-parent", "title": "AI-ChatGPT"}),
|
||||||
|
_mock_response(json_data={"id": "nb-child", "title": "No Project"}),
|
||||||
|
]
|
||||||
|
leaf_id = client.get_or_create_notebook_path(["AI-ChatGPT", "No Project"])
|
||||||
|
assert leaf_id == "nb-child"
|
||||||
|
assert mock_post.call_count == 2
|
||||||
|
# Second POST should use the parent's ID
|
||||||
|
_, kwargs = mock_post.call_args_list[1]
|
||||||
|
assert kwargs["json"]["parent_id"] == "nb-parent"
|
||||||
|
|
||||||
|
def test_reuses_existing_parent_for_new_child(self):
|
||||||
|
client = _make_client()
|
||||||
|
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
|
||||||
|
mock_get.return_value = _mock_response(
|
||||||
|
json_data={
|
||||||
|
"items": [{"id": "nb-parent", "title": "AI-Claude", "parent_id": ""}],
|
||||||
|
"has_more": False,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
mock_post.return_value = _mock_response(
|
||||||
|
json_data={"id": "nb-child", "title": "Budget Tracker"}
|
||||||
|
)
|
||||||
|
leaf_id = client.get_or_create_notebook_path(["AI-Claude", "Budget Tracker"])
|
||||||
|
assert leaf_id == "nb-child"
|
||||||
|
# Only one POST — the parent already existed
|
||||||
|
assert mock_post.call_count == 1
|
||||||
|
_, kwargs = mock_post.call_args
|
||||||
|
assert kwargs["json"]["parent_id"] == "nb-parent"
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# create_note
|
# create_note
|
||||||
|
|||||||
@@ -1,19 +1,53 @@
|
|||||||
"""Unit tests for src/providers/ using fixture files."""
|
"""Unit tests for src/providers/ using fixture files."""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
import logging
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from src.blocks import (
|
||||||
|
BLOCK_TYPE_FILE_PLACEHOLDER,
|
||||||
|
BLOCK_TYPE_HIDDEN_CONTEXT_MARKER,
|
||||||
|
BLOCK_TYPE_IMAGE_PLACEHOLDER,
|
||||||
|
BLOCK_TYPE_TEXT,
|
||||||
|
BLOCK_TYPE_THINKING,
|
||||||
|
BLOCK_TYPE_TOOL_RESULT,
|
||||||
|
BLOCK_TYPE_TOOL_USE,
|
||||||
|
BLOCK_TYPE_UNKNOWN,
|
||||||
|
render_blocks_to_markdown,
|
||||||
|
)
|
||||||
|
from src.loss_report import LossReport
|
||||||
|
|
||||||
FIXTURES = Path(__file__).parent / "fixtures"
|
FIXTURES = Path(__file__).parent / "fixtures"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _block_types(message: dict) -> list[str]:
|
||||||
|
return [b.get("type") for b in (message.get("blocks") or [])]
|
||||||
|
|
||||||
|
|
||||||
|
def _first_block(message: dict, block_type: str) -> dict | None:
|
||||||
|
for b in message.get("blocks") or []:
|
||||||
|
if b.get("type") == block_type:
|
||||||
|
return b
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# ChatGPT
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
class TestChatGPTNormalization:
|
class TestChatGPTNormalization:
|
||||||
"""Test ChatGPTProvider.normalize_conversation() using fixture data."""
|
"""ChatGPT normalize_conversation block-extraction behavior."""
|
||||||
|
|
||||||
def _get_provider(self):
|
def _get_provider(self):
|
||||||
from src.providers.chatgpt import ChatGPTProvider
|
from src.providers.chatgpt import ChatGPTProvider
|
||||||
# Bypass __init__ token check
|
|
||||||
p = ChatGPTProvider.__new__(ChatGPTProvider)
|
p = ChatGPTProvider.__new__(ChatGPTProvider)
|
||||||
import requests
|
import requests
|
||||||
p._session = requests.Session()
|
p._session = requests.Session()
|
||||||
@@ -31,7 +65,6 @@ class TestChatGPTNormalization:
|
|||||||
assert result["id"] == "chatgpt-conv-001"
|
assert result["id"] == "chatgpt-conv-001"
|
||||||
assert result["title"] == "Python Async Tutorial"
|
assert result["title"] == "Python Async Tutorial"
|
||||||
assert result["provider"] == "chatgpt"
|
assert result["provider"] == "chatgpt"
|
||||||
# No entry in _project_map → project is None
|
|
||||||
assert result["project"] is None
|
assert result["project"] is None
|
||||||
assert result["created_at"] != ""
|
assert result["created_at"] != ""
|
||||||
assert result["updated_at"] != ""
|
assert result["updated_at"] != ""
|
||||||
@@ -46,7 +79,6 @@ class TestChatGPTNormalization:
|
|||||||
assert result["id"] == "chatgpt-conv-002"
|
assert result["id"] == "chatgpt-conv-002"
|
||||||
|
|
||||||
def test_normalizes_with_project_from_map(self):
|
def test_normalizes_with_project_from_map(self):
|
||||||
"""Project name from _project_map (populated by fetch_all_conversations) flows through."""
|
|
||||||
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
p._project_map["chatgpt-conv-001"] = "My Research Project"
|
p._project_map["chatgpt-conv-001"] = "My Research Project"
|
||||||
@@ -54,33 +86,134 @@ class TestChatGPTNormalization:
|
|||||||
|
|
||||||
assert result["project"] == "My Research Project"
|
assert result["project"] == "My Research Project"
|
||||||
|
|
||||||
def test_extracts_text_messages(self):
|
def test_text_message_emits_text_block(self):
|
||||||
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
assert len(result["messages"]) >= 2
|
|
||||||
user_msgs = [m for m in result["messages"] if m["role"] == "user"]
|
user_msgs = [m for m in result["messages"] if m["role"] == "user"]
|
||||||
assert any("async" in m["content"].lower() for m in user_msgs)
|
# The "How does async/await..." message
|
||||||
|
async_msgs = [
|
||||||
|
m for m in user_msgs
|
||||||
|
if any(
|
||||||
|
"async" in (b.get("text") or "").lower()
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert async_msgs, "expected a user message about async/await"
|
||||||
|
assert _block_types(async_msgs[0]) == [BLOCK_TYPE_TEXT]
|
||||||
|
|
||||||
def test_skips_non_text_content_with_warning(self, caplog):
|
def test_code_block_preserved_with_language(self):
|
||||||
import logging
|
|
||||||
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
with caplog.at_level(logging.WARNING):
|
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
# The fixture has an image_asset_pointer node — should be warned about
|
|
||||||
assert any(
|
assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
|
||||||
"image_asset_pointer" in r.message or "rich content" in r.message
|
# The first assistant message is the async/await answer with a python fence
|
||||||
for r in caplog.records
|
text_block = _first_block(assistant_msgs[0], BLOCK_TYPE_TEXT)
|
||||||
|
assert text_block is not None
|
||||||
|
assert "```python" in text_block["text"]
|
||||||
|
|
||||||
|
def test_multimodal_voice_user_message(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
# node-mm-user: audio_transcription "What is the capital of France?"
|
||||||
|
# + real_time_user_audio_video_asset_pointer wrapping a sediment:// URL
|
||||||
|
capital_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
"capital of france" in (b.get("text") or "").lower()
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
)
|
)
|
||||||
|
]
|
||||||
|
assert capital_msgs, "expected the audio_transcription text to surface"
|
||||||
|
types = _block_types(capital_msgs[0])
|
||||||
|
assert BLOCK_TYPE_TEXT in types
|
||||||
|
assert BLOCK_TYPE_FILE_PLACEHOLDER in types
|
||||||
|
|
||||||
|
file_block = _first_block(capital_msgs[0], BLOCK_TYPE_FILE_PLACEHOLDER)
|
||||||
|
assert file_block["ref"].startswith("sediment://")
|
||||||
|
assert file_block["mime"] == "audio/wav"
|
||||||
|
assert file_block["size_bytes"] == 50000
|
||||||
|
assert file_block["duration_seconds"] == pytest.approx(2.5)
|
||||||
|
|
||||||
|
def test_multimodal_voice_reverse_order_preserved(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
# node-mm-user-rev has parts in REVERSE order: asset first, transcription second.
|
||||||
|
rev_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
"tell me more" in (b.get("text") or "").lower()
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert rev_msgs, "expected the reverse-order voice message"
|
||||||
|
types = _block_types(rev_msgs[0])
|
||||||
|
# Order preserved: file_placeholder before text
|
||||||
|
assert types == [BLOCK_TYPE_FILE_PLACEHOLDER, BLOCK_TYPE_TEXT]
|
||||||
|
|
||||||
|
def test_image_only_user_message_renders(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
image_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(b.get("type") == BLOCK_TYPE_IMAGE_PLACEHOLDER for b in (m.get("blocks") or []))
|
||||||
|
]
|
||||||
|
assert image_msgs, "image-only user message should now render"
|
||||||
|
|
||||||
|
def test_user_editable_context_emits_blocks(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
# The user_editable_context message has user_profile + user_instructions.
|
||||||
|
# It should now appear (was silently dropped pre-v0.4.0).
|
||||||
|
uec_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
"Custom Instructions" in (b.get("text") or "")
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert uec_msgs, "user_editable_context should be visible in output"
|
||||||
|
# Hidden context marker should be prepended.
|
||||||
|
assert uec_msgs[0]["blocks"][0]["type"] == BLOCK_TYPE_HIDDEN_CONTEXT_MARKER
|
||||||
|
|
||||||
|
def test_user_editable_context_uses_safe_fence(self):
|
||||||
|
"""The user_instructions value contains embedded triple-backticks; the rendered
|
||||||
|
Markdown must use a fence longer than 3 backticks so embedded fences are inert.
|
||||||
|
"""
|
||||||
|
from src.blocks import render_blocks_to_markdown
|
||||||
|
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
uec_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
"Custom Instructions" in (b.get("text") or "")
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert uec_msgs
|
||||||
|
rendered = render_blocks_to_markdown(uec_msgs[0]["blocks"])
|
||||||
|
# Content has ``` inside, so the wrap fence must be at least 4 backticks.
|
||||||
|
assert "````" in rendered, "expected a 4+ backtick safe-fence wrap"
|
||||||
|
|
||||||
def test_message_roles_are_valid(self):
|
def test_message_roles_are_valid(self):
|
||||||
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
for msg in result["messages"]:
|
for msg in result["messages"]:
|
||||||
assert msg["role"] in ("user", "assistant", "system")
|
assert msg["role"] in ("user", "assistant", "system", "tool")
|
||||||
|
|
||||||
def test_message_count_matches(self):
|
def test_message_count_matches(self):
|
||||||
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
@@ -88,16 +221,82 @@ class TestChatGPTNormalization:
|
|||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
assert result["message_count"] == len(result["messages"])
|
assert result["message_count"] == len(result["messages"])
|
||||||
|
|
||||||
def test_code_fence_preserved(self):
|
def test_loss_report_records_messages(self):
|
||||||
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
report = LossReport()
|
||||||
all_content = " ".join(m["content"] for m in result["messages"])
|
result = p.normalize_conversation(raw, report)
|
||||||
assert "```python" in all_content
|
assert report.messages_rendered == len(result["messages"])
|
||||||
|
assert report.conversations == 1
|
||||||
|
|
||||||
|
|
||||||
|
class TestChatGPTUnknownContent:
|
||||||
|
"""Unrecognised content types should produce visible unknown blocks + WARNING + tally."""
|
||||||
|
|
||||||
|
def _get_provider(self):
|
||||||
|
from src.providers.chatgpt import ChatGPTProvider
|
||||||
|
p = ChatGPTProvider.__new__(ChatGPTProvider)
|
||||||
|
import requests
|
||||||
|
p._session = requests.Session()
|
||||||
|
p._org_id = None
|
||||||
|
p._project_ids = []
|
||||||
|
p._project_map = {}
|
||||||
|
p._project_name_cache = {}
|
||||||
|
return p
|
||||||
|
|
||||||
|
def _make_unknown_conv(self):
|
||||||
|
return {
|
||||||
|
"id": "test-unknown",
|
||||||
|
"title": "Test",
|
||||||
|
"create_time": 1700000000.0,
|
||||||
|
"update_time": 1700000001.0,
|
||||||
|
"mapping": {
|
||||||
|
"root": {"id": "root", "message": None, "parent": None, "children": ["msg1"]},
|
||||||
|
"msg1": {
|
||||||
|
"id": "msg1",
|
||||||
|
"message": {
|
||||||
|
"id": "msg1",
|
||||||
|
"author": {"role": "user"},
|
||||||
|
"content": {
|
||||||
|
"content_type": "future_unknown_type_xyz",
|
||||||
|
"some_field": "value",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"parent": "root",
|
||||||
|
"children": [],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
def test_unknown_content_type_produces_unknown_block(self):
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(self._make_unknown_conv())
|
||||||
|
assert any(
|
||||||
|
b.get("type") == BLOCK_TYPE_UNKNOWN
|
||||||
|
for m in result["messages"]
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_unknown_content_type_logs_warning(self, caplog):
|
||||||
|
p = self._get_provider()
|
||||||
|
with caplog.at_level(logging.WARNING):
|
||||||
|
p.normalize_conversation(self._make_unknown_conv())
|
||||||
|
assert any("future_unknown_type_xyz" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
def test_unknown_content_type_increments_loss_report(self):
|
||||||
|
p = self._get_provider()
|
||||||
|
report = LossReport()
|
||||||
|
p.normalize_conversation(self._make_unknown_conv(), report)
|
||||||
|
assert report.unknown_blocks["future_unknown_type_xyz"] == 1
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Claude
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
class TestClaudeNormalization:
|
class TestClaudeNormalization:
|
||||||
"""Test ClaudeProvider.normalize_conversation() using fixture data."""
|
"""Claude normalize_conversation block-extraction behavior."""
|
||||||
|
|
||||||
def _get_provider(self):
|
def _get_provider(self):
|
||||||
from src.providers.claude import ClaudeProvider
|
from src.providers.claude import ClaudeProvider
|
||||||
@@ -117,55 +316,138 @@ class TestClaudeNormalization:
|
|||||||
assert result["provider"] == "claude"
|
assert result["provider"] == "claude"
|
||||||
assert result["project"] == "StarTOS Packaging"
|
assert result["project"] == "StarTOS Packaging"
|
||||||
assert result["created_at"] == "2024-06-10T14:32:00.000Z"
|
assert result["created_at"] == "2024-06-10T14:32:00.000Z"
|
||||||
assert isinstance(result["messages"], list)
|
|
||||||
|
|
||||||
def test_normalizes_without_project(self):
|
def test_normalizes_without_project(self):
|
||||||
raw = json.loads((FIXTURES / "claude_no_project.json").read_text())
|
raw = json.loads((FIXTURES / "claude_no_project.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
assert result["project"] is None
|
assert result["project"] is None
|
||||||
assert result["id"] == "claude-conv-002"
|
|
||||||
|
|
||||||
def test_string_content_extracted(self):
|
def test_string_content_emits_text_block(self):
|
||||||
raw = json.loads((FIXTURES / "claude_no_project.json").read_text())
|
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
assert any("Docker" in m["content"] for m in result["messages"])
|
thanks_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
"thank you" in (b.get("text") or "").lower()
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert thanks_msgs
|
||||||
|
|
||||||
def test_list_content_extracted(self):
|
def test_list_content_emits_blocks_in_order(self):
|
||||||
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
|
assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
|
||||||
assert any("manifest" in m["content"].lower() for m in assistant_msgs)
|
# msg-002 has text + tool_use, in that order.
|
||||||
|
assert assistant_msgs
|
||||||
|
types = _block_types(assistant_msgs[0])
|
||||||
|
assert BLOCK_TYPE_TEXT in types
|
||||||
|
assert BLOCK_TYPE_TOOL_USE in types
|
||||||
|
# Order preserved
|
||||||
|
assert types.index(BLOCK_TYPE_TEXT) < types.index(BLOCK_TYPE_TOOL_USE)
|
||||||
|
|
||||||
def test_non_text_blocks_skipped_with_warning(self, caplog):
|
def test_tool_use_block_fields(self):
|
||||||
import logging
|
|
||||||
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
with caplog.at_level(logging.WARNING):
|
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
# The fixture has a tool_use block — should warn
|
|
||||||
|
assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
|
||||||
|
tool_block = _first_block(assistant_msgs[0], BLOCK_TYPE_TOOL_USE)
|
||||||
|
assert tool_block["name"] == "search"
|
||||||
|
assert tool_block["input"] == {"query": "startOS docs"}
|
||||||
|
assert tool_block["tool_id"] == "tool-001"
|
||||||
|
|
||||||
|
def test_image_block_emits_image_placeholder(self):
|
||||||
|
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
msg004 = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(b.get("type") == BLOCK_TYPE_IMAGE_PLACEHOLDER for b in (m.get("blocks") or []))
|
||||||
|
]
|
||||||
|
assert msg004
|
||||||
|
img = _first_block(msg004[0], BLOCK_TYPE_IMAGE_PLACEHOLDER)
|
||||||
|
assert img["ref"] == "claude-image-uuid-1"
|
||||||
|
|
||||||
|
def test_unknown_block_type_records_loss(self):
|
||||||
|
from src.blocks import BLOCK_TYPE_UNKNOWN as _UNK
|
||||||
|
raw = {
|
||||||
|
"uuid": "test-unknown",
|
||||||
|
"name": "T",
|
||||||
|
"chat_messages": [
|
||||||
|
{
|
||||||
|
"uuid": "m1",
|
||||||
|
"sender": "human",
|
||||||
|
"content": [{"type": "future_block_xyz", "data": "..."}],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
p = self._get_provider()
|
||||||
|
report = LossReport()
|
||||||
|
result = p.normalize_conversation(raw, report)
|
||||||
assert any(
|
assert any(
|
||||||
"tool_use" in r.message or "rich content" in r.message
|
b.get("type") == _UNK
|
||||||
for r in caplog.records
|
for m in result["messages"]
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
)
|
)
|
||||||
|
assert report.unknown_blocks["future_block_xyz"] == 1
|
||||||
|
|
||||||
def test_message_count_matches(self):
|
def test_thinking_block(self):
|
||||||
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
raw = {
|
||||||
|
"uuid": "thinking-test",
|
||||||
|
"name": "T",
|
||||||
|
"chat_messages": [
|
||||||
|
{
|
||||||
|
"uuid": "m1",
|
||||||
|
"sender": "assistant",
|
||||||
|
"content": [
|
||||||
|
{"type": "thinking", "thinking": "Let me reason about this."},
|
||||||
|
{"type": "text", "text": "Here's the answer."},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
assert result["message_count"] == len(result["messages"])
|
types = _block_types(result["messages"][0])
|
||||||
|
assert BLOCK_TYPE_THINKING in types
|
||||||
|
assert BLOCK_TYPE_TEXT in types
|
||||||
|
|
||||||
def test_roles_normalized(self):
|
def test_tool_result_with_nested_text_blocks(self):
|
||||||
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
raw = {
|
||||||
|
"uuid": "tool-result-test",
|
||||||
|
"name": "T",
|
||||||
|
"chat_messages": [
|
||||||
|
{
|
||||||
|
"uuid": "m1",
|
||||||
|
"sender": "assistant",
|
||||||
|
"content": [
|
||||||
|
{
|
||||||
|
"type": "tool_result",
|
||||||
|
"tool_use_id": "tool-001",
|
||||||
|
"content": [
|
||||||
|
{"type": "text", "text": "search hit 1"},
|
||||||
|
{"type": "text", "text": "search hit 2"},
|
||||||
|
],
|
||||||
|
"is_error": False,
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
p = self._get_provider()
|
p = self._get_provider()
|
||||||
result = p.normalize_conversation(raw)
|
result = p.normalize_conversation(raw)
|
||||||
for msg in result["messages"]:
|
tool_result = _first_block(result["messages"][0], BLOCK_TYPE_TOOL_RESULT)
|
||||||
assert msg["role"] in ("user", "assistant", "system")
|
assert tool_result is not None
|
||||||
|
assert "search hit 1" in tool_result["output"]
|
||||||
|
assert "search hit 2" in tool_result["output"]
|
||||||
|
assert tool_result["is_error"] is False
|
||||||
|
|
||||||
def test_human_sender_maps_to_user(self):
|
def test_human_sender_maps_to_user(self):
|
||||||
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
||||||
@@ -174,3 +456,188 @@ class TestClaudeNormalization:
|
|||||||
roles = {m["role"] for m in result["messages"]}
|
roles = {m["role"] for m in result["messages"]}
|
||||||
assert "user" in roles
|
assert "user" in roles
|
||||||
assert "human" not in roles
|
assert "human" not in roles
|
||||||
|
|
||||||
|
def test_loss_report_messages_recorded(self):
|
||||||
|
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
report = LossReport()
|
||||||
|
result = p.normalize_conversation(raw, report)
|
||||||
|
assert report.messages_rendered == len(result["messages"])
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# v0.4.1 — execution_output, system_error, tether_browsing_display, conv_id
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestChatGPTToolOutputs:
|
||||||
|
"""v0.4.1 ChatGPT tool-role content_types map onto tool_result blocks."""
|
||||||
|
|
||||||
|
def _get_provider(self):
|
||||||
|
from src.providers.chatgpt import ChatGPTProvider
|
||||||
|
p = ChatGPTProvider.__new__(ChatGPTProvider)
|
||||||
|
import requests
|
||||||
|
p._session = requests.Session()
|
||||||
|
p._org_id = None
|
||||||
|
p._project_ids = []
|
||||||
|
p._project_map = {}
|
||||||
|
p._project_name_cache = {}
|
||||||
|
return p
|
||||||
|
|
||||||
|
def test_execution_output_emits_tool_result_with_metadata(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
exec_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
b.get("type") == BLOCK_TYPE_TOOL_RESULT
|
||||||
|
and b.get("tool_name") == "container.exec"
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert exec_msgs, "expected execution_output to render as tool_result"
|
||||||
|
block = next(
|
||||||
|
b for b in exec_msgs[0]["blocks"] if b.get("type") == BLOCK_TYPE_TOOL_RESULT
|
||||||
|
)
|
||||||
|
assert block["output"].startswith("Hello from container.exec")
|
||||||
|
assert block["is_error"] is False
|
||||||
|
assert block["summary"] == "Reading skill documentation"
|
||||||
|
|
||||||
|
def test_execution_output_message_role_is_tool(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
tool_msgs = [m for m in result["messages"] if m["role"] == "tool"]
|
||||||
|
assert tool_msgs, "tool-role messages must pass through (filter lifted in v0.4.0)"
|
||||||
|
|
||||||
|
def test_empty_execution_output_skipped(self, caplog):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
with caplog.at_level(logging.DEBUG, logger="src.providers.chatgpt"):
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
# The empty execution_output (author.name="python") must NOT appear.
|
||||||
|
python_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
b.get("type") == BLOCK_TYPE_TOOL_RESULT and b.get("tool_name") == "python"
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert not python_msgs, "empty execution_output should be skipped"
|
||||||
|
assert any("Skipping empty execution_output" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
def test_system_error_emits_error_tool_result(self):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
web_err = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
b.get("type") == BLOCK_TYPE_TOOL_RESULT
|
||||||
|
and b.get("tool_name") == "web"
|
||||||
|
and b.get("is_error") is True
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert web_err, "system_error should render as tool_result with is_error=True"
|
||||||
|
block = next(b for b in web_err[0]["blocks"] if b.get("tool_name") == "web")
|
||||||
|
assert "503" in block["output"]
|
||||||
|
|
||||||
|
def test_tether_browsing_display_spinner_skipped(self, caplog):
|
||||||
|
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
|
||||||
|
p = self._get_provider()
|
||||||
|
with caplog.at_level(logging.DEBUG, logger="src.providers.chatgpt"):
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
|
||||||
|
spinner_msgs = [
|
||||||
|
m for m in result["messages"]
|
||||||
|
if any(
|
||||||
|
b.get("type") == BLOCK_TYPE_TOOL_RESULT and b.get("tool_name") == "file_search"
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
]
|
||||||
|
assert not spinner_msgs, "spinner tether_browsing_display should be skipped"
|
||||||
|
assert any("tether_browsing_display spinner" in r.message for r in caplog.records)
|
||||||
|
|
||||||
|
def test_tether_browsing_display_populated_renders_defensively(self):
|
||||||
|
"""Defensive case (never observed in real data) — populated browse renders."""
|
||||||
|
conv = {
|
||||||
|
"id": "test-tether",
|
||||||
|
"title": "T",
|
||||||
|
"create_time": 1700000000.0,
|
||||||
|
"update_time": 1700000001.0,
|
||||||
|
"mapping": {
|
||||||
|
"root": {"id": "root", "message": None, "parent": None, "children": ["m1"]},
|
||||||
|
"m1": {
|
||||||
|
"id": "m1",
|
||||||
|
"parent": "root",
|
||||||
|
"children": [],
|
||||||
|
"message": {
|
||||||
|
"id": "m1",
|
||||||
|
"author": {"role": "tool", "name": "browser"},
|
||||||
|
"content": {
|
||||||
|
"content_type": "tether_browsing_display",
|
||||||
|
"result": "Found 3 results about kubernetes ingress.",
|
||||||
|
"summary": "ingress search",
|
||||||
|
"assets": None,
|
||||||
|
"tether_id": None,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(conv)
|
||||||
|
assert any(
|
||||||
|
b.get("type") == BLOCK_TYPE_TOOL_RESULT and b.get("tool_name") == "browser"
|
||||||
|
for m in result["messages"]
|
||||||
|
for b in (m.get("blocks") or [])
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestChatGPTConvIdFallback:
|
||||||
|
"""v0.4.1: live ChatGPT detail responses use conversation_id, not id."""
|
||||||
|
|
||||||
|
def _get_provider(self):
|
||||||
|
from src.providers.chatgpt import ChatGPTProvider
|
||||||
|
p = ChatGPTProvider.__new__(ChatGPTProvider)
|
||||||
|
import requests
|
||||||
|
p._session = requests.Session()
|
||||||
|
p._org_id = None
|
||||||
|
p._project_ids = []
|
||||||
|
p._project_map = {}
|
||||||
|
p._project_name_cache = {}
|
||||||
|
return p
|
||||||
|
|
||||||
|
def test_falls_back_to_conversation_id(self):
|
||||||
|
raw = {
|
||||||
|
"conversation_id": "live-chatgpt-uuid",
|
||||||
|
"title": "T",
|
||||||
|
"create_time": 1700000000.0,
|
||||||
|
"update_time": 1700000001.0,
|
||||||
|
"mapping": {
|
||||||
|
"root": {"id": "root", "message": None, "parent": None, "children": []},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
assert result["id"] == "live-chatgpt-uuid"
|
||||||
|
|
||||||
|
def test_id_takes_precedence_when_both_present(self):
|
||||||
|
raw = {
|
||||||
|
"id": "from-id",
|
||||||
|
"conversation_id": "from-conversation-id",
|
||||||
|
"title": "T",
|
||||||
|
"create_time": 1700000000.0,
|
||||||
|
"update_time": 1700000001.0,
|
||||||
|
"mapping": {
|
||||||
|
"root": {"id": "root", "message": None, "parent": None, "children": []},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
p = self._get_provider()
|
||||||
|
result = p.normalize_conversation(raw)
|
||||||
|
assert result["id"] == "from-id"
|
||||||
|
|||||||
147
tests/test_utils.py
Normal file
147
tests/test_utils.py
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
"""Tests for src/utils.py — filename generation, path building, redaction."""
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.utils import (
|
||||||
|
build_export_path,
|
||||||
|
format_token_status,
|
||||||
|
generate_filename,
|
||||||
|
redact_secrets,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestGenerateFilename:
|
||||||
|
def test_basic_format(self):
|
||||||
|
name = generate_filename("Hello World", "abc12345def", "2024-06-10T14:00:00Z")
|
||||||
|
assert name == "2024-06-10_hello-world_abc12345.md"
|
||||||
|
|
||||||
|
def test_special_chars_slugified(self):
|
||||||
|
# T-36: titles with punctuation must produce safe, OS-compatible filenames
|
||||||
|
name = generate_filename("What's this?! A test.", "abc12345", "2024-06-01T00:00:00Z")
|
||||||
|
assert "?" not in name
|
||||||
|
assert "!" not in name
|
||||||
|
assert "'" not in name
|
||||||
|
assert " " not in name
|
||||||
|
assert name.startswith("2024-06-01_")
|
||||||
|
assert name.endswith("_abc12345.md")
|
||||||
|
|
||||||
|
def test_unicode_chars_handled(self):
|
||||||
|
name = generate_filename("Héllo Wörld", "abc12345", "2024-06-01T00:00:00Z")
|
||||||
|
assert " " not in name
|
||||||
|
assert name.endswith("_abc12345.md")
|
||||||
|
|
||||||
|
def test_empty_title_becomes_untitled(self):
|
||||||
|
name = generate_filename("", "abc12345", "2024-06-01T00:00:00Z")
|
||||||
|
assert "untitled" in name
|
||||||
|
|
||||||
|
def test_id_truncated_to_8_chars(self):
|
||||||
|
name = generate_filename("Test", "abcdefghijklmnop", "2024-06-01T00:00:00Z")
|
||||||
|
assert name.endswith("_abcdefgh.md")
|
||||||
|
|
||||||
|
def test_long_title_truncated(self):
|
||||||
|
long_title = "a" * 200
|
||||||
|
name = generate_filename(long_title, "abc12345", "2024-06-01T00:00:00Z")
|
||||||
|
# Slug is capped at 60 chars by max_length
|
||||||
|
slug_part = name.split("_")[1]
|
||||||
|
assert len(slug_part) <= 60
|
||||||
|
|
||||||
|
def test_date_comes_from_created_at(self):
|
||||||
|
name = generate_filename("Test", "abc12345", "2023-11-25T00:00:00Z")
|
||||||
|
assert name.startswith("2023-11-25_")
|
||||||
|
|
||||||
|
|
||||||
|
class TestBuildExportPath:
|
||||||
|
def test_default_structure_provider_project_year(self):
|
||||||
|
path = build_export_path(
|
||||||
|
Path("/exports"), "claude", "my-project", "2024-06-01T00:00:00Z", "file.md"
|
||||||
|
)
|
||||||
|
assert str(path) == "/exports/claude/my-project.2024/file.md"
|
||||||
|
|
||||||
|
def test_no_project_uses_no_project_slug(self):
|
||||||
|
path = build_export_path(
|
||||||
|
Path("/exports"), "chatgpt", None, "2024-06-01T00:00:00Z", "file.md"
|
||||||
|
)
|
||||||
|
assert "no-project.2024" in str(path)
|
||||||
|
|
||||||
|
def test_provider_project_structure_omits_year(self):
|
||||||
|
path = build_export_path(
|
||||||
|
Path("/exports"), "claude", "proj", "2024-06-01T00:00:00Z", "file.md",
|
||||||
|
structure="provider/project",
|
||||||
|
)
|
||||||
|
assert "2024" not in str(path)
|
||||||
|
assert "proj" in str(path)
|
||||||
|
|
||||||
|
def test_provider_year_structure_omits_project(self):
|
||||||
|
path = build_export_path(
|
||||||
|
Path("/exports"), "claude", "proj", "2024-06-01T00:00:00Z", "file.md",
|
||||||
|
structure="provider/year",
|
||||||
|
)
|
||||||
|
assert "proj" not in str(path)
|
||||||
|
assert "2024" in str(path)
|
||||||
|
|
||||||
|
def test_project_name_with_spaces_is_slugified(self):
|
||||||
|
path = build_export_path(
|
||||||
|
Path("/exports"), "claude", "My Project Name!", "2024-06-01T00:00:00Z", "file.md"
|
||||||
|
)
|
||||||
|
assert " " not in str(path)
|
||||||
|
assert "!" not in str(path)
|
||||||
|
|
||||||
|
|
||||||
|
class TestRedactSecrets:
|
||||||
|
def test_token_value_redacted(self):
|
||||||
|
data = {"token": "supersecret"}
|
||||||
|
result = redact_secrets(data)
|
||||||
|
assert result["token"] == "[REDACTED]"
|
||||||
|
|
||||||
|
def test_session_key_redacted(self):
|
||||||
|
result = redact_secrets({"sessionKey": "abc123"})
|
||||||
|
assert result["sessionKey"] == "[REDACTED]"
|
||||||
|
|
||||||
|
def test_non_sensitive_key_unchanged(self):
|
||||||
|
result = redact_secrets({"title": "My Chat", "id": "abc123"})
|
||||||
|
assert result["title"] == "My Chat"
|
||||||
|
assert result["id"] == "abc123"
|
||||||
|
|
||||||
|
def test_nested_dict_redacted(self):
|
||||||
|
data = {"user": {"token": "secret", "name": "Alice"}}
|
||||||
|
result = redact_secrets(data)
|
||||||
|
assert result["user"]["token"] == "[REDACTED]"
|
||||||
|
assert result["user"]["name"] == "Alice"
|
||||||
|
|
||||||
|
def test_list_of_dicts(self):
|
||||||
|
data = [{"password": "p@ss"}, {"title": "chat"}]
|
||||||
|
result = redact_secrets(data)
|
||||||
|
assert result[0]["password"] == "[REDACTED]"
|
||||||
|
assert result[1]["title"] == "chat"
|
||||||
|
|
||||||
|
|
||||||
|
class TestFormatTokenStatus:
|
||||||
|
def test_none_token_returns_not_set(self):
|
||||||
|
assert format_token_status(None) == "[NOT SET]"
|
||||||
|
|
||||||
|
def test_empty_token_returns_not_set(self):
|
||||||
|
assert format_token_status("") == "[NOT SET]"
|
||||||
|
|
||||||
|
def test_set_token_no_expiry(self):
|
||||||
|
assert format_token_status("sometoken") == "[SET]"
|
||||||
|
|
||||||
|
def test_expired_token(self):
|
||||||
|
from datetime import datetime, timezone, timedelta
|
||||||
|
expiry = datetime.now(tz=timezone.utc) - timedelta(days=1)
|
||||||
|
result = format_token_status("tok", expiry)
|
||||||
|
assert "EXPIRED" in result
|
||||||
|
|
||||||
|
def test_expiring_today_shows_hours(self):
|
||||||
|
from datetime import datetime, timezone, timedelta
|
||||||
|
expiry = datetime.now(tz=timezone.utc) + timedelta(hours=3)
|
||||||
|
result = format_token_status("tok", expiry)
|
||||||
|
assert "expires in" in result
|
||||||
|
assert "h" in result
|
||||||
|
|
||||||
|
def test_expiring_in_days(self):
|
||||||
|
from datetime import datetime, timezone, timedelta
|
||||||
|
expiry = datetime.now(tz=timezone.utc) + timedelta(days=10, hours=12)
|
||||||
|
result = format_token_status("tok", expiry)
|
||||||
|
assert "10 days" in result
|
||||||
Reference in New Issue
Block a user