Compare commits

...

10 Commits

Author SHA1 Message Date
JesseMarkowitz
557994f7d9 fix: persist created_at in cache so Joplin note titles get date prefix
mark_exported() was discarding created_at from the metadata dict because
it wasn't in the hardcoded stored-key list, so the joplin sync always
saw an empty date and omitted the prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:36:21 -04:00
JesseMarkowitz
e9b2e42893 feat: v0.5.0 — nested Joplin notebooks, date-prefixed note titles, flat year folders
Joplin notebooks now use a two-level hierarchy: AI-ChatGPT / <project> and
AI-Claude / <project> instead of a single flat title. Note titles are prefixed
with the conversation created_at date (YYYY-MM-DD). Export folders collapse
provider/project/year into a single provider/project.year directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:05:39 -04:00
JesseMarkowitz
68e8d532be feat: v0.4.1 — ChatGPT tool-output content types and conv_id fix
First real-data export against v0.4.0 surfaced 66 unknown blocks across
three content types — captured live and added.

Added:
- execution_output (Code Interpreter / container.exec / python tool
  output) → tool_result block. output=content.text,
  tool_name=author.name, is_error=metadata.aggregate_result.status,
  summary=metadata.reasoning_title
- system_error → error tool_result with tool_name=author.name
- tether_browsing_display: spinner placeholders (empty result+summary)
  skip silently with DEBUG log; defensive populated-case branch maps
  to tool_result (untested in real data)
- tool_result block schema: optional `summary` field rendered as
  italic line between header and fence
- tool_result rendering: tool_name appears in header when present
  (e.g. `📤 Result: container.exec`); existing tool_name=None calls
  unchanged
- _ROLE_LABELS["tool"] = ("🔧 Tool", "tool")

Fixed:
- chatgpt.normalize_conversation reads `conversation_id` as fallback
  for `id`. Live API uses conversation_id; fixtures use id.
  Pre-fix: empty id in YAML frontmatter and missing context in
  WARNING logs.

Tests: 11 new (192 total, 0 failures). Fixture extended with 4
tool-output cases (execution_output success, empty execution_output
that should skip, system_error, tether_browsing_display spinner).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 09:25:55 -04:00
JesseMarkowitz
473d02f71a feat: v0.4.0 — rich content support with typed blocks and loss visibility
Extracts per-message content into a typed `blocks` list (text, code,
thinking, tool_use, tool_result, image_placeholder, file_placeholder,
unknown) and renders them at exporter write time. Voice transcripts,
Custom Instructions, and image references now appear in exports
instead of being silently dropped.

Foundation:
- src/blocks.py: pure block constructors, _safe_fence (fence-corruption
  defense, verified live in Joplin), _blockquote_prefix, render
- src/loss_report.py: per-run tally surfaced as INFO summary at end of
  export so silently-dropped data becomes visible

Providers:
- ChatGPT: dispatch on content_type produces typed blocks; voice shapes
  (audio_transcription, audio_asset_pointer, real_time_user_audio_video_
  asset_pointer) locked from live DevTools capture; Custom Instructions
  bug fix (parts-vs-direct-fields); role filter lifted; hidden-context
  marker driven by is_visually_hidden_from_conversation flag
- Claude: defensive dispatch for text/thinking/tool_use/tool_result/image
  with recursive nested-block flattening; untested against real rich-
  content data — fix-forward in v0.4.1

Exporter:
- Markdown renders from blocks at write time via render_blocks_to_markdown;
  backward-compat fallback to content for any pre-v0.4.0 cached data

Tests:
- 27 new tests across providers, exporters, CLI; fixtures rebuilt with
  real-shape ChatGPT voice + Custom Instructions cases
- 181/181 pass

Behavior changes (intentional):
- JSON output omits content; consumers should read blocks
- Per-conversation message counts increase (Custom Instructions, image-
  only, tool-only messages now appear)
- Existing exports not auto-re-rendered; users wanting fresh output run
  cache --clear then export

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-04 23:17:18 -04:00
JesseMarkowitz
4798edcea7 docs: update README for chunked ChatGPT session cookies
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 22:32:01 -04:00
JesseMarkowitz
19bfdaecbe fix: v0.2.1 — chunked ChatGPT cookies and Claude project path
- Support __Secure-next-auth.session-token.0/.1 split cookies; ChatGPT
  now issues tokens that exceed the 4KB per-cookie limit and must be
  sent as two named chunks or the auth endpoint returns no accessToken.
  Add CHATGPT_SESSION_TOKEN_1 env var; update auth wizard instructions.

- Fix Claude conversations exported to wrong directory when project name
  is present in the listing but absent from the detail endpoint response.
  Explicitly propagate "project" alongside _-prefixed annotation keys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 22:32:14 -04:00
JesseMarkowitz
4ccd918eb1 fix: list command shows Claude titles and fits 80-col terminals
Claude's list endpoint returns conversations with a `name` field rather
than `title`, so every Claude row was falling through to "Untitled".
Also set no_wrap + ellipsis overflow and tune column widths so the table
renders one row per conversation in Windows Command Prompt (80 cols).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-08 14:49:58 -04:00
JesseMarkowitz
a869e8c7ba fix for project files written to wrong directory 2026-03-30 15:25:18 -04:00
JesseMarkowitz
340293ab94 fix for project files not extracted 2026-03-30 13:22:05 -04:00
JesseMarkowitz
050cd49124 updated to run on Windows and add est capabilities 2026-03-30 11:08:05 -04:00
25 changed files with 2992 additions and 310 deletions

View File

@@ -6,9 +6,12 @@
# --- ChatGPT --- # --- ChatGPT ---
# How to get: open chatgpt.com in Chrome → F12 → Application tab # How to get: open chatgpt.com in Chrome → F12 → Application tab
# → Cookies → https://chatgpt.com → find "__Secure-next-auth.session-token" → copy Value # → Cookies → https://chatgpt.com → find the two cookie chunks:
# Token type: JWT (starts with "eyJ"). Typically valid for ~7 days. # __Secure-next-auth.session-token.0 (starts with "eyJ") → CHATGPT_SESSION_TOKEN
# __Secure-next-auth.session-token.1 (the remainder) → CHATGPT_SESSION_TOKEN_1
# Token type: JWE. Typically valid for ~7 days.
CHATGPT_SESSION_TOKEN= CHATGPT_SESSION_TOKEN=
CHATGPT_SESSION_TOKEN_1=
# ChatGPT Projects (optional): comma-separated list of project gizmo IDs. # ChatGPT Projects (optional): comma-separated list of project gizmo IDs.
# Project conversations are NOT included in the default /conversations listing. # Project conversations are NOT included in the default /conversations listing.
@@ -46,9 +49,9 @@ JOPLIN_API_URL=http://localhost:41184
# JOPLIN_REQUEST_TIMEOUT=30 # JOPLIN_REQUEST_TIMEOUT=30
# --- Cache --- # --- Cache ---
# Where the sync manifest and logs are stored (default: ~/.ai-chat-exporter) # Where the sync manifest is stored (default: ./cache, inside the install directory)
CACHE_DIR=~/.ai-chat-exporter CACHE_DIR=./cache
# --- Logging --- # --- Logging ---
# Log file path. Set to "none" to disable file logging. # Log file path. Set to "none" to disable file logging.
LOG_FILE=~/.ai-chat-exporter/logs/exporter.log LOG_FILE=./cache/logs/exporter.log

4
.gitignore vendored
View File

@@ -25,10 +25,14 @@ exports/
!CHANGELOG.md !CHANGELOG.md
# Cache and logs # Cache and logs
cache/
.ai-chat-exporter/ .ai-chat-exporter/
logs/ logs/
*.log *.log
# Test tracking
test-plan.csv
# Editor / OS # Editor / OS
.DS_Store .DS_Store
.idea/ .idea/

View File

@@ -3,6 +3,57 @@
All notable changes to this project will be documented here. All notable changes to this project will be documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). Format follows [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [0.4.1] - Unreleased
### Added
- ChatGPT `execution_output` (Code Interpreter / `container.exec` / `python`) renders as a `tool_result` block with `tool_name` from `author.name`, `is_error` from `metadata.aggregate_result.status`, and the optional `summary` line populated from `metadata.reasoning_title`. Captured live during planning.
- ChatGPT `system_error` content (e.g. browse-service 503) renders as an error `tool_result` block with `tool_name` from `author.name` (typically `"web"`).
- ChatGPT `tether_browsing_display` populated case (defensive, not observed in real data) renders as a `tool_result` block; transient spinner placeholders (empty `result`+`summary`) skip silently with DEBUG log.
- `tool_result` block schema gains optional `summary: str | None` field, rendered as italic line between header and fenced output.
- `tool_result` rendering shows `tool_name` in the header when present (e.g. `📤 **Result: container.exec**`); when absent, header stays as `📤 **Result**` (no regression).
- Markdown exporter: `_ROLE_LABELS["tool"] = ("🔧 Tool", "tool")` so tool-role messages render under a recognisable header instead of the generic fallback.
- 11 new tests covering all four cases plus the conv_id fallback (192 total, all passing).
### Fixed
- ChatGPT `normalize_conversation` now reads `conversation_id` as a fallback for `id`. Live ChatGPT detail responses use `conversation_id` at top level; fixtures and listing summaries use `id`. Without the fallback, normalized conversations had empty `id` (visible as blank `conversation_id:` in YAML frontmatter and missing context in WARNING log lines).
### Migration
- No new schema breaks; `tool_result` blocks gain a `summary` field that defaults to None on legacy data. Existing exports re-render cleanly with the cache-clear-and-export workflow from v0.4.0.
## [0.4.0] - Unreleased
### Added
- Rich content support: messages now carry an ordered `blocks` list (text, code, thinking, tool_use, tool_result, citation, image_placeholder, file_placeholder, unknown)
- ChatGPT voice mode: `audio_transcription` parts render as text blocks; `audio_asset_pointer` and `real_time_user_audio_video_asset_pointer` render as `📎 File attached` placeholders with size and duration metadata
- ChatGPT Custom Instructions: `user_editable_context` and `model_editable_context` messages now appear in exports (were silently dropped — pre-existing bug fixed); rendered with a `> Hidden context` marker driven by the `is_visually_hidden_from_conversation` flag
- Image placeholders for `image_asset_pointer` parts (uploads + DALL-E) inside `multimodal_text` and at message level
- Defensive Claude block extraction: `text`, `thinking`, `tool_use`, `tool_result` (including nested-block flattening), `image` blocks (untested against real data; will fix-forward in v0.4.1 if real shapes diverge)
- `LossReport` summary table emitted at end of every `export` run, breaking down `unknown blocks` and `extraction failures` by raw type so silently-dropped data becomes visible
- `_safe_fence` helper picks a fence longer than any backtick run in extracted content, preventing embedded triple-backticks from corrupting downstream rendering (verified live in Joplin during planning)
- `unknown` blocks render as `> ⚠️ Unsupported content` with the raw type, observed top-level keys, and reason — so future API additions are visible rather than silent
### Changed
- ChatGPT role filter (previously dropped `tool` and `system` messages) is **lifted**: all roles now route through normal extraction; truly empty messages skip via the existing empty-content guard
- Markdown rendering moves from provider-time to exporter-write-time. Providers produce blocks; exporters call `render_blocks_to_markdown` at write time. This unblocks future Obsidian/HTML exporters
- `BaseProvider.normalize_conversation` signature now accepts an optional `LossReport` parameter (breaking change for any future custom subclass; FileProvider hasn't shipped yet)
- `o1`/`o3` reasoning subparts inside `text` content_type messages remain rendered as plain text (defensive; reclassification to `thinking` block deferred until live shape is captured)
### Fixed
- `user_editable_context` / `model_editable_context` extraction (parts-vs-direct-fields mismatch) — Custom Instructions are no longer silently dropped from every conversation
### Migration
- Existing exports are not re-rendered automatically. To pick up v0.4.0 rendering for previously exported conversations:
```
python -m src.main cache --clear
python -m src.main export --provider all
```
- JSON exports: messages now contain `blocks` (typed structured content) and may omit the legacy `content` field. External consumers reading JSON should prefer `blocks`.
- Per-conversation message counts may increase: previously-dropped Custom Instructions, image-only user turns, and tool-only assistant turns now appear.
### Out of scope (deferred to v0.5.0+)
- Binary downloads of images and audio assets (placeholders show metadata only; `content not preserved in this export`)
- Joplin resource upload for embedded media
- Filename resolution for `file-XYZ` / `sediment://` references
- Speculative ChatGPT types (`tether_browsing_display`, `tether_quote`) and DALL-E assistant images — fall through to `unknown` blocks if encountered
## [0.2.0] - Unreleased ## [0.2.0] - Unreleased
### Added ### Added
- Joplin import automation: `joplin` command syncs exported Markdown files to Joplin as notes - Joplin import automation: `joplin` command syncs exported Markdown files to Joplin as notes

View File

@@ -7,6 +7,7 @@ of these additions straightforward.
**Completed:** **Completed:**
- v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output - v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
- v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation) - v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation)
- v0.4.0 — Rich content support: typed message blocks (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown); ChatGPT voice transcripts as text + audio placeholders; Custom Instructions extraction; data-loss visibility via `LossReport` summary and visible `unknown` blocks
--- ---
@@ -58,26 +59,43 @@ export command to accept a pre-downloaded export ZIP or JSON.
--- ---
## Rich Content Support (v0.4.0) ## Binary Content Downloads (v0.5.0)
Currently only text content is exported. Future versions should handle: v0.4.0 ships placeholders for images and audio assets but does not download
the binary content. The `_safe_fence`-wrapped placeholders include the asset
reference (`sediment://...` or `file-service://...`), MIME type, size, and
duration where available; the actual bytes are not preserved.
### Claude Next steps:
- Artifacts (code, documents, HTML) — export as separate files, link from Markdown - Download attached images alongside the Markdown export, save under a
- Uploaded images — download and embed or link `media/` sibling directory with a stable filename derived from the asset
- Extended thinking/reasoning blocks — include as collapsible sections reference.
- Tool call results and web search citations — include as footnotes or appendices - Replace `image_placeholder` rendering with an inline `![](relative/path)`
reference once the file is on disk.
- Joplin integration: upload binaries as Joplin resources via `POST /resources`,
rewrite the rendered Markdown to use `:/resourceId` references, and track
the resource ID in the cache manifest so re-syncs stay idempotent.
- DALL-E images on the assistant side: not observed in this user's data; the
code path exists (`source = "model_generated"`) but is untested.
### ChatGPT The block-level schema is already in place — only the file-fetch + rewrite
- DALL-E generated images — download and embed or link layer needs to be added. See the `image_placeholder` and `file_placeholder`
- Code Interpreter outputs — export code and results block definitions in `src/blocks.py`.
- File attachments — download and reference
- Voice transcripts — include as text
Implementation note: the normalized message schema already includes a ## Reclassify o1/o3 Reasoning Subparts (v0.4.1)
`content_type` field placeholder. When this work begins, extend the schema
rather than replacing it. Non-text content already logs a WARNING when v0.4.0 leaves dict parts inside `text` content_type messages with shape
encountered so users can see what was skipped. `{"summary": ..., "content": ...}` rendered as plain text (defensive — the
shape was inferred from a code comment, not captured live). Once a real
reasoning conversation is captured, reclassify these as `thinking` blocks.
## Suppress Hidden Context (v0.4.x)
If Custom Instructions duplication across conversations becomes a storage
problem, add `EXPORTER_INCLUDE_HIDDEN_CONTEXT=false` env var. The toggle is
a single `os.getenv()` check at the start of
`_extract_editable_context_blocks` in `src/providers/chatgpt.py` — return
empty list if disabled.
--- ---

114
README.md
View File

@@ -28,6 +28,8 @@ This tool is designed for a single user backing up their own conversations. Do n
## Installation ## Installation
### Linux / macOS
```bash ```bash
git clone <repo-url> git clone <repo-url>
cd ai-chat-exporter cd ai-chat-exporter
@@ -36,6 +38,37 @@ source .venv/bin/activate
pip install -e ".[dev]" pip install -e ".[dev]"
``` ```
### Windows
No admin access required. Run these in **Command Prompt** (`cmd.exe`) — it's the simplest option on Windows because it doesn't have PowerShell's script execution policy restrictions.
```bat
git clone <repo-url>
cd ai-chat-exporter
python -m venv .venv
.venv\Scripts\activate
pip install -e ".[dev]"
```
All `ai-chat-exporter` commands work identically in Command Prompt.
**Using PowerShell instead?** If you prefer PowerShell, you may need to allow script execution first (one-time, current user only):
```powershell
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
```
Then activate the venv and run commands the same way.
**Prerequisites:**
- Python 3.11 or later — install from [python.org](https://www.python.org/downloads/windows/). During installation, tick **"Add Python to PATH"**.
- Git — install from [git-scm.com](https://git-scm.com/) if not already present.
**Notes:**
- The cache manifest and logs are stored in `cache\` inside the install directory — the same as on Linux.
- File permission hardening (`chmod 600`) is silently ignored on Windows — not a concern for single-user desktop use.
- Joplin Web Clipper runs on `localhost:41184` on all platforms; no configuration changes needed.
--- ---
## First Run: Run Doctor ## First Run: Run Doctor
@@ -43,7 +76,7 @@ pip install -e ".[dev]"
Before anything else, validate your setup: Before anything else, validate your setup:
```bash ```bash
python -m src.main doctor ai-chat-exporter doctor
``` ```
This checks token presence, format, expiry, directory permissions, disk space, and live API connectivity. Fix any failures before proceeding. This checks token presence, format, expiry, directory permissions, disk space, and live API connectivity. Fix any failures before proceeding.
@@ -58,7 +91,7 @@ Session tokens are how your browser stays logged in. This tool uses them to acce
| Provider | Cookie Name | Lifetime | Expiry Detection | | Provider | Cookie Name | Lifetime | Expiry Detection |
|----------|-------------|----------|-----------------| |----------|-------------|----------|-----------------|
| ChatGPT | `__Secure-next-auth.session-token` | ~7 days | JWT `exp` claim (decoded automatically) | | ChatGPT | `__Secure-next-auth.session-token.0` + `.1` | ~7 days | JWT `exp` claim (decoded automatically) |
| Claude | `sessionKey` | ~30 days | Only detectable via 401 response | | Claude | `sessionKey` | ~30 days | Only detectable via 401 response |
### Finding Tokens in Chrome DevTools ### Finding Tokens in Chrome DevTools
@@ -69,14 +102,18 @@ Session tokens are how your browser stays logged in. This tool uses them to acce
4. In the left panel, expand **Cookies** and click the site URL 4. In the left panel, expand **Cookies** and click the site URL
5. Find the cookie by name and copy its **Value** 5. Find the cookie by name and copy its **Value**
**ChatGPT:** go to `https://chatgpt.com` → find `__Secure-next-auth.session-token` → copy Value (starts with `eyJ`) **ChatGPT:** go to `https://chatgpt.com` → find **two** cookies:
- `__Secure-next-auth.session-token.0` — copy Value (starts with `eyJ`) → `CHATGPT_SESSION_TOKEN`
- `__Secure-next-auth.session-token.1` — copy Value → `CHATGPT_SESSION_TOKEN_1`
ChatGPT splits large session tokens across two cookies to stay under the browser's 4KB cookie limit. Both are required.
**Claude:** go to `https://claude.ai` → find `sessionKey` → copy Value **Claude:** go to `https://claude.ai` → find `sessionKey` → copy Value
### When Tokens Expire ### When Tokens Expire
When a token expires you'll see a `401 Unauthorized` error. To refresh: When a token expires you'll see a `401 Unauthorized` error. To refresh:
- Re-run the `auth` wizard: `python -m src.main auth` - Re-run the `auth` wizard: `ai-chat-exporter auth`
- Or manually update the value in your `.env` file - Or manually update the value in your `.env` file
--- ---
@@ -86,7 +123,7 @@ When a token expires you'll see a `401 Unauthorized` error. To refresh:
The easiest way to configure tokens is the interactive wizard: The easiest way to configure tokens is the interactive wizard:
```bash ```bash
python -m src.main auth ai-chat-exporter auth
``` ```
This walks you through finding your token, validates it, shows the expiry date (ChatGPT only), and offers to write it to your `.env` automatically. Tokens are never echoed to the terminal. This walks you through finding your token, validates it, shows the expiry date (ChatGPT only), and offers to write it to your `.env` automatically. Tokens are never echoed to the terminal.
@@ -105,7 +142,8 @@ cp .env.example .env
| Variable | Description | | Variable | Description |
|----------|-------------| |----------|-------------|
| `CHATGPT_SESSION_TOKEN` | Your ChatGPT JWT session token (`eyJ…`) | | `CHATGPT_SESSION_TOKEN` | ChatGPT session token chunk `.0` (starts with `eyJ…`) |
| `CHATGPT_SESSION_TOKEN_1` | ChatGPT session token chunk `.1` (the remainder) |
| `CHATGPT_PROJECT_IDS` | Comma-separated ChatGPT project IDs (see below) | | `CHATGPT_PROJECT_IDS` | Comma-separated ChatGPT project IDs (see below) |
| `CLAUDE_SESSION_KEY` | Your Claude session key | | `CLAUDE_SESSION_KEY` | Your Claude session key |
@@ -128,8 +166,8 @@ cp .env.example .env
| Variable | Default | Description | | Variable | Default | Description |
|----------|---------|-------------| |----------|---------|-------------|
| `CACHE_DIR` | `~/.ai-chat-exporter` | Where to store the sync manifest | | `CACHE_DIR` | `./cache` | Where to store the sync manifest |
| `LOG_FILE` | `~/.ai-chat-exporter/logs/exporter.log` | Log file path (`none` to disable) | | `LOG_FILE` | `./cache/logs/exporter.log` | Log file path (`none` to disable) |
--- ---
@@ -218,7 +256,7 @@ Each provider+project combination maps to a flat Joplin notebook created automat
### `auth` — Interactive token setup ### `auth` — Interactive token setup
```bash ```bash
python -m src.main auth ai-chat-exporter auth
``` ```
Guided wizard to find and save session tokens and ChatGPT project IDs. Detects OS and shows the correct DevTools shortcut. Guided wizard to find and save session tokens and ChatGPT project IDs. Detects OS and shows the correct DevTools shortcut.
@@ -226,7 +264,7 @@ Guided wizard to find and save session tokens and ChatGPT project IDs. Detects O
### `doctor` — Health check ### `doctor` — Health check
```bash ```bash
python -m src.main doctor ai-chat-exporter doctor
``` ```
Checks: token presence, JWT validity and expiry, directory permissions, disk space, live API reachability. Exits with code 0 if all pass, 1 if any fail. Checks: token presence, JWT validity and expiry, directory permissions, disk space, live API reachability. Exits with code 0 if all pass, 1 if any fail.
@@ -235,31 +273,31 @@ Checks: token presence, JWT validity and expiry, directory permissions, disk spa
```bash ```bash
# Export everything (new/updated only) # Export everything (new/updated only)
python -m src.main export ai-chat-exporter export
# Single provider # Single provider
python -m src.main export --provider claude ai-chat-exporter export --provider claude
# JSON output # JSON output
python -m src.main export --format json ai-chat-exporter export --format json
# Both Markdown and JSON # Both Markdown and JSON
python -m src.main export --format both ai-chat-exporter export --format both
# Only conversations updated since a date # Only conversations updated since a date
python -m src.main export --since 2024-06-01 ai-chat-exporter export --since 2024-06-01
# Only conversations in a specific project (case-insensitive substring) # Only conversations in a specific project (case-insensitive substring)
python -m src.main export --project "learning python" ai-chat-exporter export --project "learning python"
# Only conversations outside any project # Only conversations outside any project
python -m src.main export --project none ai-chat-exporter export --project none
# Write to a custom directory # Write to a custom directory
python -m src.main export --output /path/to/my/notes ai-chat-exporter export --output /path/to/my/notes
# Preview without writing anything # Preview without writing anything
python -m src.main export --dry-run ai-chat-exporter export --dry-run
``` ```
Options: `--provider [chatgpt|claude|all]`, `--format [markdown|json|both]`, `--output PATH`, `--since YYYY-MM-DD`, `--project NAME`, `--dry-run` Options: `--provider [chatgpt|claude|all]`, `--format [markdown|json|both]`, `--output PATH`, `--since YYYY-MM-DD`, `--project NAME`, `--dry-run`
@@ -268,16 +306,16 @@ Options: `--provider [chatgpt|claude|all]`, `--format [markdown|json|both]`, `--
```bash ```bash
# List all conversations for all providers # List all conversations for all providers
python -m src.main list ai-chat-exporter list
# Single provider # Single provider
python -m src.main list --provider chatgpt ai-chat-exporter list --provider chatgpt
# Filter by project # Filter by project
python -m src.main list --project "learning python" ai-chat-exporter list --project "learning python"
# Only conversations outside any project # Only conversations outside any project
python -m src.main list --project none ai-chat-exporter list --project none
``` ```
Fetches and displays all conversations without exporting them. Useful for verifying what the tool can see before running an export. Fetches and displays all conversations without exporting them. Useful for verifying what the tool can see before running an export.
@@ -286,19 +324,19 @@ Fetches and displays all conversations without exporting them. Useful for verify
```bash ```bash
# Sync all pending conversations to Joplin # Sync all pending conversations to Joplin
python -m src.main joplin ai-chat-exporter joplin
# Preview what would be synced without sending anything # Preview what would be synced without sending anything
python -m src.main joplin --dry-run ai-chat-exporter joplin --dry-run
# Sync a single provider # Sync a single provider
python -m src.main joplin --provider chatgpt ai-chat-exporter joplin --provider chatgpt
# Sync only conversations in a specific project # Sync only conversations in a specific project
python -m src.main joplin --project "learning python" ai-chat-exporter joplin --project "learning python"
# Sync only conversations outside any project # Sync only conversations outside any project
python -m src.main joplin --project none ai-chat-exporter joplin --project none
``` ```
Reads the local export cache and pushes each exported Markdown file to Joplin as a note. Notebooks are created automatically. Re-running is safe — notes are updated (not duplicated). Reads the local export cache and pushes each exported Markdown file to Joplin as a note. Notebooks are created automatically. Re-running is safe — notes are updated (not duplicated).
@@ -315,20 +353,20 @@ Options: `--provider [chatgpt|claude|all]`, `--project NAME`, `--dry-run`
```bash ```bash
# Show statistics # Show statistics
python -m src.main cache --show ai-chat-exporter cache --show
# Clear all cached entries (forces full re-export next run) # Clear all cached entries (forces full re-export next run)
python -m src.main cache --clear ai-chat-exporter cache --clear
# Clear a single provider # Clear a single provider
python -m src.main cache --clear --provider claude ai-chat-exporter cache --clear --provider claude
``` ```
--- ---
## How the Cache Works ## How the Cache Works
The cache manifest lives at `~/.ai-chat-exporter/manifest.json` and records every exported conversation: its title, project, `updated_at` timestamp, output file path, and (after Joplin sync) the Joplin note ID. The cache manifest lives at `cache/manifest.json` (inside the install directory) and records every exported conversation: its title, project, `updated_at` timestamp, output file path, and (after Joplin sync) the Joplin note ID.
On every `export` run: On every `export` run:
1. Fetch the full conversation list from the provider 1. Fetch the full conversation list from the provider
@@ -343,7 +381,7 @@ On every `joplin` run:
**This design makes every run inherently resumable.** If the tool is interrupted for any reason — rate limit, network drop, Ctrl+C, crash — simply re-run the same command. It will skip already-processed conversations and continue from where it stopped. **This design makes every run inherently resumable.** If the tool is interrupted for any reason — rate limit, network drop, Ctrl+C, crash — simply re-run the same command. It will skip already-processed conversations and continue from where it stopped.
To force a full re-export: `python -m src.main cache --clear` then re-run export. To force a full re-export: `ai-chat-exporter cache --clear` then re-run export.
--- ---
@@ -351,7 +389,7 @@ To force a full re-export: `python -m src.main cache --clear` then re-run export
### `401 Unauthorized` ### `401 Unauthorized`
Your session token has expired. Your session token has expired.
- Run `python -m src.main auth` to get a new token interactively - Run `ai-chat-exporter auth` to get a new token interactively
- Or manually copy a fresh cookie value into your `.env` file - Or manually copy a fresh cookie value into your `.env` file
Note: Claude's `sessionKey` is an opaque string — the only way to know it's expired is the 401 error. ChatGPT JWTs have an `exp` claim that the `doctor` command can decode and display. Note: Claude's `sessionKey` is an opaque string — the only way to know it's expired is the 401 error. ChatGPT JWTs have an `exp` claim that the `doctor` command can decode and display.
@@ -388,13 +426,13 @@ Make sure you've added the project IDs to `CHATGPT_PROJECT_IDS` in your `.env`.
The provider's internal API may have changed. Run with `--debug`, sanitize the output (remove any personal content), and check the project's GitHub Issues for known fixes. The provider's internal API may have changed. Run with `--debug`, sanitize the output (remove any personal content), and check the project's GitHub Issues for known fixes.
### Non-text content warnings ### Non-text content warnings
Images, code interpreter outputs, DALL-E generations, and Claude artifacts are not exported in v0.2.0. A WARNING is logged for each skipped item. See `FUTURE.md` for the roadmap. Since v0.4.0, rich content is preserved as typed blocks in the export. ChatGPT voice transcripts render as text and audio assets as `📎 File attached` placeholders with size and duration metadata. Images render as `🖼️ Image attached` placeholders showing the asset reference. Custom Instructions appear under a `> Hidden context` marker. Anything the extractor doesn't recognise renders as a visible `> ⚠️ Unsupported content` block naming the type and observed keys, *and* increments a counter in the post-export summary so you can tell whether real content is being silently skipped. Binary downloads (the actual image/audio bytes) are still deferred — see `FUTURE.md` v0.5.0.
### Empty export / all conversations skipped ### Empty export / all conversations skipped
No new or updated conversations since your last run. To verify: `python -m src.main cache --show`. To force a full re-export: `python -m src.main cache --clear`. No new or updated conversations since your last run. To verify: `ai-chat-exporter cache --show`. To force a full re-export: `ai-chat-exporter cache --clear`.
### Filing a bug report ### Filing a bug report
1. Run with `--debug`: `python -m src.main export --debug 2>&1 | tee debug.log` 1. Run with `--debug`: `ai-chat-exporter export --debug 2>&1 | tee debug.log`
2. Remove any personal conversation content from `debug.log` 2. Remove any personal conversation content from `debug.log`
3. Open a GitHub Issue with the sanitized log and the exact command you ran 3. Open a GitHub Issue with the sanitized log and the exact command you ran
@@ -406,7 +444,7 @@ See `FUTURE.md` for planned features:
- **v0.2.x** — `export --force` flag; `joplin --force` flag; per-conversation cache reset - **v0.2.x** — `export --force` flag; `joplin --force` flag; per-conversation cache reset
- **v0.3.0** — Official API fallback: parse export ZIP files from ChatGPT/Claude settings - **v0.3.0** — Official API fallback: parse export ZIP files from ChatGPT/Claude settings
- **v0.4.0** — Rich content: images, artifacts, code interpreter output, extended thinking - **v0.4.x / v0.5.0** — Binary content downloads (images, audio bytes) and Joplin resource upload; reclassify o1/o3 reasoning subparts; optional `EXPORTER_INCLUDE_HIDDEN_CONTEXT` toggle
- **v0.5.0** — Watch/scheduled mode; Obsidian vault output - **v0.5.0** — Watch/scheduled mode; Obsidian vault output
--- ---

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "ai-chat-exporter" name = "ai-chat-exporter"
version = "0.2.0" version = "0.4.1"
description = "Export ChatGPT and Claude conversation history to Markdown for personal archival in Joplin" description = "Export ChatGPT and Claude conversation history to Markdown for personal archival in Joplin"
requires-python = ">=3.11" requires-python = ">=3.11"
dependencies = [ dependencies = [

339
src/blocks.py Normal file
View File

@@ -0,0 +1,339 @@
"""Typed content blocks for normalized messages.
Providers produce ordered lists of blocks; exporters render them. Living outside
``src/providers/`` deliberately — blocks are a separate concern from extraction
or rendering, shared by both layers.
Block dicts always have ``type`` set to one of the BLOCK_TYPE_* constants.
Construct via the ``make_*`` helpers; never build dicts by hand. The ``unknown``
block constructor REQUIRES a corresponding WARNING log + ``LossReport`` tally
at the call site — see plan §Data-loss visibility.
"""
import json
from typing import Any
BLOCK_TYPE_TEXT = "text"
BLOCK_TYPE_CODE = "code"
BLOCK_TYPE_THINKING = "thinking"
BLOCK_TYPE_TOOL_USE = "tool_use"
BLOCK_TYPE_TOOL_RESULT = "tool_result"
BLOCK_TYPE_CITATION = "citation"
BLOCK_TYPE_IMAGE_PLACEHOLDER = "image_placeholder"
BLOCK_TYPE_FILE_PLACEHOLDER = "file_placeholder"
BLOCK_TYPE_UNKNOWN = "unknown"
BLOCK_TYPE_HIDDEN_CONTEXT_MARKER = "hidden_context_marker"
UNKNOWN_REASON_UNKNOWN_TYPE = "unknown_type"
UNKNOWN_REASON_EXTRACTION_FAILED = "extraction_failed"
UNKNOWN_REASON_ALL_BLOCKS_FAILED = "all_blocks_failed"
UNKNOWN_REASON_UNKNOWN_FIELD_IN_KNOWN_TYPE = "unknown_field_in_known_type"
_OBSERVED_KEYS_LIMIT = 10
# ---------------------------------------------------------------------------
# Constructors
# ---------------------------------------------------------------------------
def make_text_block(text: str) -> dict | None:
"""Return a text block, or None if the text is empty/whitespace-only.
Returning None lets callers do ``if block: blocks.append(block)`` and prune
empty blocks at construction time. See plan §Finalizing the message dict.
"""
if not isinstance(text, str) or not text.strip():
return None
return {"type": BLOCK_TYPE_TEXT, "text": text}
def make_code_block(code: str, language: str = "") -> dict | None:
"""Return a code block, or None if code is empty."""
if not isinstance(code, str) or not code.strip():
return None
return {"type": BLOCK_TYPE_CODE, "language": language or "", "code": code}
def make_thinking_block(text: str) -> dict | None:
"""Return a thinking block, or None if empty."""
if not isinstance(text, str) or not text.strip():
return None
return {"type": BLOCK_TYPE_THINKING, "text": text}
def make_tool_use_block(name: str, input_data: Any, tool_id: str | None = None) -> dict:
"""Return a tool_use block.
Always returns a block (no None) — tool calls are meaningful even with
empty inputs.
"""
return {
"type": BLOCK_TYPE_TOOL_USE,
"name": name or "",
"input": input_data if input_data is not None else {},
"tool_id": tool_id,
}
def make_tool_result_block(
output: str,
tool_name: str | None = None,
is_error: bool = False,
summary: str | None = None,
) -> dict:
"""Return a tool_result block.
``summary`` is an optional short human label rendered between header and
fence (e.g. ChatGPT's ``metadata.reasoning_title`` for execution_output).
"""
return {
"type": BLOCK_TYPE_TOOL_RESULT,
"tool_name": tool_name,
"output": output if isinstance(output, str) else str(output),
"is_error": bool(is_error),
"summary": summary,
}
def make_citation_block(
url: str,
title: str | None = None,
snippet: str | None = None,
) -> dict | None:
if not url:
return None
return {
"type": BLOCK_TYPE_CITATION,
"url": url,
"title": title,
"snippet": snippet,
}
def make_image_placeholder(
ref: str,
source: str = "unknown",
mime: str | None = None,
) -> dict:
"""source ∈ {'user_upload', 'model_generated', 'unknown'}."""
return {
"type": BLOCK_TYPE_IMAGE_PLACEHOLDER,
"ref": ref or "",
"source": source,
"mime": mime,
}
def make_file_placeholder(
ref: str,
filename: str | None = None,
mime: str | None = None,
size_bytes: int | None = None,
duration_seconds: float | None = None,
) -> dict:
return {
"type": BLOCK_TYPE_FILE_PLACEHOLDER,
"ref": ref or "",
"filename": filename,
"mime": mime,
"size_bytes": size_bytes,
"duration_seconds": duration_seconds,
}
def make_unknown_block(
raw_type: str,
observed_keys: list[str] | None = None,
reason: str = UNKNOWN_REASON_UNKNOWN_TYPE,
summary: str | None = None,
) -> dict:
"""Construct an unknown block.
Every call site MUST also emit a WARNING log and increment a LossReport
tally — see plan §Data-loss visibility. The block surfaces the loss at
read time; the WARNING surfaces it at run time. Both signals matter.
"""
keys = list(observed_keys or [])[:_OBSERVED_KEYS_LIMIT]
return {
"type": BLOCK_TYPE_UNKNOWN,
"raw_type": raw_type,
"observed_keys": keys,
"reason": reason,
"summary": summary,
}
def make_hidden_context_marker(content_type: str) -> dict:
"""A short prepend block that flags the surrounding message as hidden context.
Driven by the ``metadata.is_visually_hidden_from_conversation`` flag, not by
content_type matching. The marker tells the reader "this message is
hidden in the source UI; we're showing it here for archival fidelity."
"""
return {
"type": BLOCK_TYPE_HIDDEN_CONTEXT_MARKER,
"content_type": content_type or "",
}
# ---------------------------------------------------------------------------
# Rendering
# ---------------------------------------------------------------------------
def render_blocks_to_markdown(blocks: list[dict]) -> str:
"""Render an ordered list of blocks to a single Markdown string.
Blocks are joined with one blank line between them. Pure function; no I/O.
"""
if not blocks:
return ""
rendered: list[str] = []
for block in blocks:
chunk = _render_one(block)
if chunk:
rendered.append(chunk)
return "\n\n".join(rendered)
def _render_one(block: dict) -> str:
btype = block.get("type", "")
if btype == BLOCK_TYPE_TEXT:
return block.get("text", "")
if btype == BLOCK_TYPE_CODE:
lang = block.get("language") or ""
code = block.get("code", "")
fence = _safe_fence(code)
return f"{fence}{lang}\n{code}\n{fence}"
if btype == BLOCK_TYPE_THINKING:
text = block.get("text", "")
quoted = _blockquote_prefix(text)
return f"**💭 Reasoning**\n\n{quoted}"
if btype == BLOCK_TYPE_TOOL_USE:
name = block.get("name", "")
input_data = block.get("input", {})
body_json = json.dumps(input_data, indent=2, sort_keys=False, default=str, ensure_ascii=False)
fence = _safe_fence(body_json)
body = f"{fence}json\n{body_json}\n{fence}"
quoted = _blockquote_prefix(f"🔧 **Tool: {name}**\n{body}")
return quoted
if btype == BLOCK_TYPE_TOOL_RESULT:
output = block.get("output", "")
is_error = bool(block.get("is_error"))
tool_name = block.get("tool_name") or ""
summary = block.get("summary") or ""
icon = "" if is_error else "📤"
label = "Result (error)" if is_error else "Result"
if tool_name:
header = f"{icon} **{label}: {tool_name}**"
else:
header = f"{icon} **{label}**"
fence = _safe_fence(output)
body = f"{fence}\n{output}\n{fence}"
if summary:
inner = f"{header}\n*{summary}*\n{body}"
else:
inner = f"{header}\n{body}"
return _blockquote_prefix(inner)
if btype == BLOCK_TYPE_CITATION:
url = block.get("url", "")
title = block.get("title") or url
return f"[{title}]({url})"
if btype == BLOCK_TYPE_IMAGE_PLACEHOLDER:
ref = block.get("ref", "")
source = block.get("source", "unknown")
mime = block.get("mime")
meta_parts = [source] if source else []
if mime:
meta_parts.append(mime)
meta_parts.append("content not preserved in this export")
meta = ", ".join(meta_parts)
return f"> 🖼️ **Image attached** — `{ref}` ({meta})"
if btype == BLOCK_TYPE_FILE_PLACEHOLDER:
ref = block.get("ref", "")
filename = block.get("filename")
label = filename or ref
mime = block.get("mime")
size_bytes = block.get("size_bytes")
duration = block.get("duration_seconds")
meta_parts: list[str] = []
if mime:
meta_parts.append(mime)
if isinstance(size_bytes, int) and size_bytes > 0:
kb = size_bytes / 1024
meta_parts.append(f"{kb:.1f} KB" if kb < 1024 else f"{kb / 1024:.2f} MB")
if isinstance(duration, (int, float)) and duration > 0:
meta_parts.append(f"{duration:.2f}s")
meta_parts.append("content not preserved in this export")
meta = ", ".join(meta_parts)
return f"> 📎 **File attached** — `{label}` ({meta})"
if btype == BLOCK_TYPE_UNKNOWN:
raw_type = block.get("raw_type", "?")
reason = block.get("reason", UNKNOWN_REASON_UNKNOWN_TYPE)
keys = block.get("observed_keys") or []
summary = block.get("summary")
first_line = f"⚠️ **Unsupported content** — type `{raw_type}` ({reason})"
lines = [first_line]
if summary:
lines.append(summary)
if keys:
keys_str = ", ".join(f"`{k}`" for k in keys)
lines.append(f"Keys observed: {keys_str}")
return _blockquote_prefix("\n".join(lines))
if btype == BLOCK_TYPE_HIDDEN_CONTEXT_MARKER:
ctype = block.get("content_type", "")
return f"> **Hidden context** — `{ctype}`"
# Defensive: a block of unrecognised local type (shouldn't happen if
# constructors are used). Render as visible warning rather than dropping.
return f"> ⚠️ **Internal: unrecognised block type** — `{btype}`"
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _safe_fence(text: str) -> str:
"""Return a backtick fence longer than the longest run of backticks in ``text``.
CommonMark requires the closing fence to be at least as long as the opening
fence. Picking N+1 (where N = longest run in content) ensures the content's
own backticks are inert. Minimum is 3.
Verified live against Joplin during planning — see plan
§Backtick-corruption defense.
"""
if not isinstance(text, str):
return "```"
longest_run = 0
current_run = 0
for ch in text:
if ch == "`":
current_run += 1
if current_run > longest_run:
longest_run = current_run
else:
current_run = 0
fence_len = max(3, longest_run + 1)
return "`" * fence_len
def _blockquote_prefix(text: str) -> str:
"""Prefix every line of ``text`` with ``> `` so the whole block renders as a quote.
Empty source lines become ``>`` (no trailing space) so blockquote continuity
is preserved without trailing-whitespace noise.
"""
if not isinstance(text, str):
return ""
out_lines: list[str] = []
for line in text.split("\n"):
if line == "":
out_lines.append(">")
else:
out_lines.append(f"> {line}")
return "\n".join(out_lines)

View File

@@ -87,6 +87,7 @@ class Cache:
self._data[provider][conv_id] = { self._data[provider][conv_id] = {
"title": metadata.get("title", ""), "title": metadata.get("title", ""),
"project": metadata.get("project"), "project": metadata.get("project"),
"created_at": metadata.get("created_at", ""),
"updated_at": metadata.get("updated_at", ""), "updated_at": metadata.get("updated_at", ""),
"exported_at": datetime.now(tz=timezone.utc).isoformat(), "exported_at": datetime.now(tz=timezone.utc).isoformat(),
"file_path": metadata.get("file_path", ""), "file_path": metadata.get("file_path", ""),

View File

@@ -28,6 +28,7 @@ class ConfigError(Exception):
@dataclass @dataclass
class Config: class Config:
chatgpt_session_token: str | None chatgpt_session_token: str | None
chatgpt_session_token_1: str | None
claude_session_key: str | None claude_session_key: str | None
export_dir: Path export_dir: Path
output_structure: str output_structure: str
@@ -55,11 +56,12 @@ def load_config() -> Config:
load_dotenv(override=False) load_dotenv(override=False)
chatgpt_token = os.getenv("CHATGPT_SESSION_TOKEN", "").strip() or None chatgpt_token = os.getenv("CHATGPT_SESSION_TOKEN", "").strip() or None
chatgpt_token_1 = os.getenv("CHATGPT_SESSION_TOKEN_1", "").strip() or None
claude_key = os.getenv("CLAUDE_SESSION_KEY", "").strip() or None claude_key = os.getenv("CLAUDE_SESSION_KEY", "").strip() or None
export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser() export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser()
output_structure = os.getenv("OUTPUT_STRUCTURE", "provider/project/year").strip() output_structure = os.getenv("OUTPUT_STRUCTURE", "provider/project/year").strip()
cache_dir = Path(os.getenv("CACHE_DIR", "~/.ai-chat-exporter")).expanduser() cache_dir = Path(os.getenv("CACHE_DIR", "./cache")).expanduser()
log_file = os.getenv("LOG_FILE", "~/.ai-chat-exporter/logs/exporter.log").strip() log_file = os.getenv("LOG_FILE", "./cache/logs/exporter.log").strip()
# Joplin # Joplin
joplin_token = os.getenv("JOPLIN_API_TOKEN", "").strip() or None joplin_token = os.getenv("JOPLIN_API_TOKEN", "").strip() or None
@@ -101,7 +103,7 @@ def load_config() -> Config:
if not chatgpt_token and not claude_key: if not chatgpt_token and not claude_key:
logger.warning( logger.warning(
"Neither CHATGPT_SESSION_TOKEN nor CLAUDE_SESSION_KEY is set. " "Neither CHATGPT_SESSION_TOKEN nor CLAUDE_SESSION_KEY is set. "
"Run 'python -m src.main auth' to configure credentials." "Run 'ai-chat-exporter auth' to configure credentials."
) )
# Create and validate output directory # Create and validate output directory
@@ -127,6 +129,7 @@ def load_config() -> Config:
config = Config( config = Config(
chatgpt_session_token=chatgpt_token, chatgpt_session_token=chatgpt_token,
chatgpt_session_token_1=chatgpt_token_1,
claude_session_key=claude_key, claude_session_key=claude_key,
export_dir=export_dir, export_dir=export_dir,
output_structure=output_structure, output_structure=output_structure,
@@ -173,7 +176,7 @@ def _validate_chatgpt_token(token: str) -> datetime | None:
if delta.total_seconds() < 0: if delta.total_seconds() < 0:
logger.warning( logger.warning(
"CHATGPT_SESSION_TOKEN expired at %s. " "CHATGPT_SESSION_TOKEN expired at %s. "
"Run 'python -m src.main auth' to refresh it.", "Run 'ai-chat-exporter auth' to refresh it.",
expiry.strftime("%Y-%m-%d %H:%M UTC"), expiry.strftime("%Y-%m-%d %H:%M UTC"),
) )
elif delta.total_seconds() < 86400: elif delta.total_seconds() < 86400:

View File

@@ -6,6 +6,7 @@ from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
from typing import Any from typing import Any
from src.blocks import render_blocks_to_markdown
from src.utils import build_export_path, generate_filename from src.utils import build_export_path, generate_filename
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -15,6 +16,7 @@ _ROLE_LABELS = {
"user": ("🧑 Human", "user"), "user": ("🧑 Human", "user"),
"assistant": ("🤖 Assistant", "assistant"), "assistant": ("🤖 Assistant", "assistant"),
"system": ("⚙️ System", "system"), "system": ("⚙️ System", "system"),
"tool": ("🔧 Tool", "tool"),
} }
@@ -125,10 +127,17 @@ class MarkdownExporter:
# Messages # Messages
for msg in messages: for msg in messages:
role = msg.get("role", "user") role = msg.get("role", "user")
content = msg.get("content", "") blocks = msg.get("blocks") or []
timestamp = msg.get("timestamp") timestamp = msg.get("timestamp")
if not content or not content.strip(): # Prefer rendering from blocks (v0.4.0+). Backward-compat fallback:
# if blocks is missing/empty AND content exists, render content as-is.
if blocks:
body = render_blocks_to_markdown(blocks)
else:
body = msg.get("content", "") or ""
if not body or not body.strip():
logger.warning( logger.warning(
"[markdown] Skipping empty/whitespace message in conversation %s", "[markdown] Skipping empty/whitespace message in conversation %s",
conv_id[:8], conv_id[:8],
@@ -143,7 +152,7 @@ class MarkdownExporter:
else: else:
lines.append("") lines.append("")
lines.append(content) lines.append(body)
lines.append("") lines.append("")
lines.append("---") lines.append("---")
lines.append("") lines.append("")

View File

@@ -32,8 +32,8 @@ class JoplinClient:
def __init__(self, base_url: str, token: str) -> None: def __init__(self, base_url: str, token: str) -> None:
self._base_url = base_url.rstrip("/") self._base_url = base_url.rstrip("/")
self._token = token self._token = token
# In-memory cache of notebook title → ID to avoid repeated GET /folders # In-memory cache: (parent_id | None, title) → folder ID
self._notebook_cache: dict[str, str] = {} self._notebook_cache: dict[tuple[str | None, str], str] = {}
self._notebooks_loaded = False self._notebooks_loaded = False
logger.debug("[joplin] Client initialised with base_url=%s", self._base_url) logger.debug("[joplin] Client initialised with base_url=%s", self._base_url)
@@ -89,13 +89,13 @@ class JoplinClient:
"""Return all Joplin notebooks (folders), handling pagination. """Return all Joplin notebooks (folders), handling pagination.
Returns: Returns:
List of folder dicts with at least ``id`` and ``title`` keys. List of folder dicts with at least ``id``, ``title``, and ``parent_id`` keys.
""" """
results: list[dict] = [] results: list[dict] = []
page = 1 page = 1
while True: while True:
logger.debug("[joplin] GET /folders page=%d", page) logger.debug("[joplin] GET /folders page=%d", page)
resp = self._get("/folders", params={"page": page, "fields": "id,title"}) resp = self._get("/folders", params={"page": page, "fields": "id,title,parent_id"})
items = resp.get("items", []) items = resp.get("items", [])
results.extend(items) results.extend(items)
logger.debug("[joplin] /folders page=%d%d items, has_more=%s", page, len(items), resp.get("has_more")) logger.debug("[joplin] /folders page=%d%d items, has_more=%s", page, len(items), resp.get("has_more"))
@@ -104,11 +104,12 @@ class JoplinClient:
page += 1 page += 1
return results return results
def get_or_create_notebook(self, title: str) -> str: def get_or_create_notebook(self, title: str, parent_id: str | None = None) -> str:
"""Return the Joplin folder ID for ``title``, creating it if needed. """Return the Joplin folder ID for ``title`` under ``parent_id``, creating if needed.
Args: Args:
title: Notebook display name (e.g. "ChatGPT - My Project"). title: Notebook display name.
parent_id: ID of the parent folder, or None for a root notebook.
Returns: Returns:
Joplin folder ID string. Joplin folder ID string.
@@ -116,19 +117,40 @@ class JoplinClient:
if not self._notebooks_loaded: if not self._notebooks_loaded:
self._load_notebook_cache() self._load_notebook_cache()
if title in self._notebook_cache: key = (parent_id, title)
folder_id = self._notebook_cache[title] if key in self._notebook_cache:
logger.debug("[joplin] Notebook cache hit: %r%s", title, folder_id) folder_id = self._notebook_cache[key]
logger.debug("[joplin] Notebook cache hit: %r (parent=%s) → %s", title, parent_id, folder_id)
return folder_id return folder_id
# Not found — create it # Not found — create it
logger.info("[joplin] Creating notebook: %r", title) logger.info("[joplin] Creating notebook: %r (parent=%s)", title, parent_id)
resp = self._post("/folders", {"title": title}) data: dict = {"title": title}
if parent_id:
data["parent_id"] = parent_id
resp = self._post("/folders", data)
folder_id = resp["id"] folder_id = resp["id"]
self._notebook_cache[title] = folder_id self._notebook_cache[key] = folder_id
logger.debug("[joplin] Notebook created: %r%s", title, folder_id) logger.debug("[joplin] Notebook created: %r%s", title, folder_id)
return folder_id return folder_id
def get_or_create_notebook_path(self, path: list[str]) -> str:
"""Ensure a nested notebook path exists and return the leaf folder ID.
Creates intermediate notebooks as needed.
Args:
path: Ordered list of notebook names, e.g. ["AI-ChatGPT", "No Project"].
Returns:
Joplin folder ID of the deepest (leaf) notebook.
"""
parent_id: str | None = None
for name in path:
parent_id = self.get_or_create_notebook(name, parent_id)
assert parent_id is not None
return parent_id
# ------------------------------------------------------------------ # ------------------------------------------------------------------
# Notes # Notes
# ------------------------------------------------------------------ # ------------------------------------------------------------------
@@ -233,11 +255,14 @@ class JoplinClient:
def _load_notebook_cache(self) -> None: def _load_notebook_cache(self) -> None:
logger.debug("[joplin] Loading notebook list from Joplin…") logger.debug("[joplin] Loading notebook list from Joplin…")
notebooks = self.list_notebooks() notebooks = self.list_notebooks()
self._notebook_cache = {nb["title"]: nb["id"] for nb in notebooks} self._notebook_cache = {
(nb.get("parent_id") or None, nb["title"]): nb["id"]
for nb in notebooks
}
self._notebooks_loaded = True self._notebooks_loaded = True
logger.debug("[joplin] Notebook cache loaded: %d notebooks", len(self._notebook_cache)) logger.debug("[joplin] Notebook cache loaded: %d notebooks", len(self._notebook_cache))
for title, folder_id in self._notebook_cache.items(): for (parent_id, title), folder_id in self._notebook_cache.items():
logger.debug("[joplin] %r%s", title, folder_id) logger.debug("[joplin] (%s) %r%s", parent_id or "root", title, folder_id)
# ------------------------------------------------------------------ # ------------------------------------------------------------------
@@ -285,19 +310,21 @@ def _http_error_message(method: str, path: str, e: requests.exceptions.HTTPError
_PROVIDER_DISPLAY = { _PROVIDER_DISPLAY = {
"chatgpt": "ChatGPT", "chatgpt": "AI-ChatGPT",
"claude": "Claude", "claude": "AI-Claude",
} }
def notebook_title(provider: str, project: str | None) -> str: def notebook_path(provider: str, project: str | None) -> tuple[str, str]:
"""Derive a flat Joplin notebook title from provider and project name. """Return (parent_notebook, child_notebook) for the given provider and project.
The parent is the top-level provider notebook; the child is the project name.
Examples: Examples:
notebook_title("chatgpt", "no-project")"ChatGPT - No Project" notebook_path("chatgpt", None) ("AI-ChatGPT", "No Project")
notebook_title("claude", "budget-tracker") → "Claude - Budget Tracker" notebook_path("chatgpt", "no-project") ("AI-ChatGPT", "No Project")
notebook_title("chatgpt", None)"ChatGPT - No Project" notebook_path("claude", "budget-tracker") ("AI-Claude", "Budget Tracker")
""" """
prov_display = _PROVIDER_DISPLAY.get(provider, provider.capitalize()) parent = _PROVIDER_DISPLAY.get(provider, f"AI-{provider.capitalize()}")
proj = (project or "no-project").replace("-", " ").title() child = (project or "no-project").replace("-", " ").title()
return f"{prov_display} - {proj}" return parent, child

85
src/loss_report.py Normal file
View File

@@ -0,0 +1,85 @@
"""Per-export-run tally for content that was dropped or partially extracted.
Surfaces the loss visibility that the rest of the system promises in its
output (visible ``unknown`` blocks). The summary emitted at the end of
each export is the load-bearing operator-facing signal: if a real content
type starts being silently dropped, this is where it shows up.
Pass a single instance through ``BaseProvider.normalize_conversation`` and
read it back in ``src/main.py`` after the export loop. No global state.
"""
from collections import Counter
from dataclasses import dataclass, field
_TOP_N_BREAKDOWN = 5
@dataclass
class LossReport:
"""Counters for things that didn't render cleanly in an export run."""
# Type-keyed counters. Values are int counts.
unknown_blocks: Counter = field(default_factory=Counter)
extraction_failures: Counter = field(default_factory=Counter)
filtered_roles: Counter = field(default_factory=Counter)
# Aggregate counters
messages_rendered: int = 0
conversations: int = 0
# Recording -------------------------------------------------------------
def record_unknown(self, raw_type: str) -> None:
self.unknown_blocks[raw_type or "?"] += 1
def record_extraction_failure(self, raw_type: str) -> None:
self.extraction_failures[raw_type or "?"] += 1
def record_filtered_role(self, role: str) -> None:
self.filtered_roles[role or "?"] += 1
def record_message(self) -> None:
self.messages_rendered += 1
def record_conversation(self) -> None:
self.conversations += 1
# Summary ---------------------------------------------------------------
def format_summary(self) -> str:
"""Return a multi-line summary table suitable for INFO logging.
Format pinned by plan §Post-export summary — "(none)" sentinel when a
counter is empty, top-5 breakdown with "+ N more types" overflow.
"""
lines: list[str] = ["[export] Run summary:"]
lines.append(f" conversations: {self.conversations}")
lines.append(f" messages rendered: {self.messages_rendered}")
lines.extend(_format_section("unknown blocks: ", self.unknown_blocks))
lines.extend(_format_section("extraction failures: ", self.extraction_failures))
lines.append(
" filtered roles: "
"(filter lifted in v0.4.0 — counter retained for future use, expected 0)"
)
if self.filtered_roles:
for role, count in self.filtered_roles.most_common(_TOP_N_BREAKDOWN):
lines.append(f" {role}={count}")
return "\n".join(lines)
def _format_section(label: str, counter: Counter) -> list[str]:
"""Render one counter section: header line + indented breakdown lines."""
total = sum(counter.values())
header = f" {label} {total}"
if total == 0:
return [header, " (none)"]
lines = [header]
most_common = counter.most_common()
for raw_type, count in most_common[:_TOP_N_BREAKDOWN]:
lines.append(f" {raw_type}={count}")
if len(most_common) > _TOP_N_BREAKDOWN:
remainder = len(most_common) - _TOP_N_BREAKDOWN
lines.append(f" + {remainder} more types")
return lines

View File

@@ -16,6 +16,7 @@ from rich.table import Table
from src.cache import Cache, CacheError from src.cache import Cache, CacheError
from src.config import ConfigError from src.config import ConfigError
from src.logging_config import setup_logging from src.logging_config import setup_logging
from src.loss_report import LossReport
from src.providers.base import ProviderError from src.providers.base import ProviderError
console = Console() console = Console()
@@ -70,7 +71,7 @@ def cli(ctx: click.Context, verbose: bool, quiet: bool, debug: bool, no_log_file
# Determine log file path from env (setup_logging handles "none") # Determine log file path from env (setup_logging handles "none")
import os import os
log_file = os.getenv("LOG_FILE", "~/.ai-chat-exporter/logs/exporter.log") log_file = os.getenv("LOG_FILE", "./cache/logs/exporter.log")
setup_logging(level=level, log_file=log_file, no_log_file=no_log_file) setup_logging(level=level, log_file=log_file, no_log_file=no_log_file)
@@ -79,7 +80,7 @@ def cli(ctx: click.Context, verbose: bool, quiet: bool, debug: bool, no_log_file
# Initialise cache (needed for ToS gate on every command) # Initialise cache (needed for ToS gate on every command)
import os import os
cache_dir = Path(os.getenv("CACHE_DIR", "~/.ai-chat-exporter")).expanduser() cache_dir = Path(os.getenv("CACHE_DIR", "./cache")).expanduser()
try: try:
cache = Cache(cache_dir) cache = Cache(cache_dir)
except CacheError as e: except CacheError as e:
@@ -140,7 +141,7 @@ def auth(ctx: click.Context) -> None:
if configure_claude: if configure_claude:
_auth_claude(os_name) _auth_claude(os_name)
console.print("\n[green]Done! Run 'python -m src.main doctor' to verify your setup.[/green]") console.print("\n[green]Done! Run 'ai-chat-exporter doctor' to verify your setup.[/green]")
def _auth_chatgpt(os_name: str) -> None: def _auth_chatgpt(os_name: str) -> None:
@@ -153,15 +154,19 @@ def _auth_chatgpt(os_name: str) -> None:
else: else:
console.print("2. Press [bold]F12[/bold] to open DevTools → Application tab.") console.print("2. Press [bold]F12[/bold] to open DevTools → Application tab.")
console.print("3. Expand [bold]Cookies[/bold] → [bold]https://chatgpt.com[/bold]") console.print("3. Expand [bold]Cookies[/bold] → [bold]https://chatgpt.com[/bold]")
console.print("4. Find [bold]__Secure-next-auth.session-token[/bold] → copy the Value.") console.print("4. ChatGPT splits the session token across two cookies:")
console.print(" (Token starts with 'eyJ...' — it is a long JWT string)") console.print(" [bold]__Secure-next-auth.session-token.0[/bold] (starts with 'eyJ')")
console.print("5. Paste it below (input is hidden).\n") console.print(" [bold]__Secure-next-auth.session-token.1[/bold] (the remainder)")
console.print(" Copy each Value in turn and paste below.")
console.print(" (If you only see one cookie without a .0/.1 suffix, paste it for .0 and leave .1 blank.)\n")
token = click.prompt("ChatGPT session token", hide_input=True, default="", show_default=False).strip() token = click.prompt("ChatGPT session token (.0)", hide_input=True, default="", show_default=False).strip()
if not token: if not token:
console.print("[yellow]Skipped ChatGPT token.[/yellow]") console.print("[yellow]Skipped ChatGPT token.[/yellow]")
return return
token_1 = click.prompt("ChatGPT session token (.1, leave blank if absent)", hide_input=True, default="", show_default=False).strip() or None
# Validate # Validate
if not token.startswith("eyJ"): if not token.startswith("eyJ"):
console.print("[yellow]Warning: token doesn't look like a JWT (expected 'eyJ...').[/yellow]") console.print("[yellow]Warning: token doesn't look like a JWT (expected 'eyJ...').[/yellow]")
@@ -178,7 +183,28 @@ def _auth_chatgpt(os_name: str) -> None:
except Exception: except Exception:
console.print("[yellow]Could not decode token expiry.[/yellow]") console.print("[yellow]Could not decode token expiry.[/yellow]")
# Live validation — exchange session token for an access token
_valid = False
_error: str | None = None
with console.status("[dim]Validating token with ChatGPT API…[/dim]"):
try:
from src.providers.chatgpt import ChatGPTProvider
_prov = ChatGPTProvider(session_token=token, session_token_1=token_1)
_prov._fetch_access_token()
_valid = True
except ProviderError as e:
_error = str(e.original)
except Exception as e:
_error = str(e)
if _valid:
console.print("[green]✓ Token verified — connected to ChatGPT API.[/green]")
else:
console.print(f"[red]✗ Token validation failed: {_error}[/red]")
_write_token_to_env("CHATGPT_SESSION_TOKEN", token) _write_token_to_env("CHATGPT_SESSION_TOKEN", token)
if token_1:
_write_token_to_env("CHATGPT_SESSION_TOKEN_1", token_1)
# --- ChatGPT Projects --- # --- ChatGPT Projects ---
console.print("\n[bold]ChatGPT Projects (optional)[/bold]") console.print("\n[bold]ChatGPT Projects (optional)[/bold]")
@@ -231,7 +257,25 @@ def _auth_claude(os_name: str) -> None:
console.print("[yellow]Skipped Claude token.[/yellow]") console.print("[yellow]Skipped Claude token.[/yellow]")
return return
console.print("[green]Claude session key saved.[/green]") # Live validation — fetch org ID (the first call any Claude operation makes)
_valid = False
_error: str | None = None
with console.status("[dim]Validating token with Claude API…[/dim]"):
try:
from src.providers.claude import ClaudeProvider
_prov = ClaudeProvider(session_key=key)
_prov._get_org_id()
_valid = True
except ProviderError as e:
_error = str(e.original)
except Exception as e:
_error = str(e)
if _valid:
console.print("[green]✓ Token verified — connected to Claude API.[/green]")
else:
console.print(f"[red]✗ Token validation failed: {_error}[/red]")
_write_token_to_env("CLAUDE_SESSION_KEY", key) _write_token_to_env("CLAUDE_SESSION_KEY", key)
@@ -341,7 +385,7 @@ def _run_doctor_checks() -> list[dict]:
# Directories # Directories
export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser() export_dir = Path(os.getenv("EXPORT_DIR", "./exports")).expanduser()
cache_dir = Path(os.getenv("CACHE_DIR", "~/.ai-chat-exporter")).expanduser() cache_dir = Path(os.getenv("CACHE_DIR", "./cache")).expanduser()
for label, dirpath in [("Export dir writable", export_dir), ("Cache dir writable", cache_dir)]: for label, dirpath in [("Export dir writable", export_dir), ("Cache dir writable", cache_dir)]:
try: try:
@@ -365,7 +409,8 @@ def _run_doctor_checks() -> list[dict]:
if chatgpt_token: if chatgpt_token:
try: try:
from src.providers.chatgpt import ChatGPTProvider from src.providers.chatgpt import ChatGPTProvider
p = ChatGPTProvider(chatgpt_token) chatgpt_token_1 = os.getenv("CHATGPT_SESSION_TOKEN_1", "").strip() or None
p = ChatGPTProvider(chatgpt_token, session_token_1=chatgpt_token_1)
results = p.list_conversations(offset=0, limit=1) results = p.list_conversations(offset=0, limit=1)
add("ChatGPT API reachable", True, f"Got {len(results)} result(s)") add("ChatGPT API reachable", True, f"Got {len(results)} result(s)")
except ProviderError as e: except ProviderError as e:
@@ -496,7 +541,7 @@ def export(
providers_to_run = _resolve_providers(provider, cfg) providers_to_run = _resolve_providers(provider, cfg)
if not providers_to_run: if not providers_to_run:
err_console.print( err_console.print(
"[red]No providers configured. Run 'python -m src.main auth' to set up tokens.[/red]" "[red]No providers configured. Run 'ai-chat-exporter auth' to set up tokens.[/red]"
) )
sys.exit(1) sys.exit(1)
@@ -510,6 +555,9 @@ def export(
# Summary counters # Summary counters
summary: dict[str, dict[str, int]] = {} summary: dict[str, dict[str, int]] = {}
# Single LossReport tracks data-loss visibility across all providers in this run.
loss_report = LossReport()
for prov_name, prov_instance in providers_to_run: for prov_name, prov_instance in providers_to_run:
summary[prov_name] = {"exported": 0, "skipped": 0, "failed": 0} summary[prov_name] = {"exported": 0, "skipped": 0, "failed": 0}
@@ -557,7 +605,17 @@ def export(
conv_id = raw_conv.get("id") or raw_conv.get("uuid", "unknown") conv_id = raw_conv.get("id") or raw_conv.get("uuid", "unknown")
try: try:
full_raw = prov_instance.get_conversation(conv_id) full_raw = prov_instance.get_conversation(conv_id)
normalized = prov_instance.normalize_conversation(full_raw) # Propagate metadata from the listing summary into the full
# detail so normalize_conversation can use it.
# - Keys starting with "_" are provider annotations
# (e.g. _project_name injected by ChatGPT project fetching).
# - "project" is included explicitly because Claude's detail
# endpoint omits it even though the listing returns it.
_PROPAGATE_KEYS = {"project"}
for key, val in raw_conv.items():
if (key.startswith("_") or key in _PROPAGATE_KEYS) and key not in full_raw:
full_raw[key] = val
normalized = prov_instance.normalize_conversation(full_raw, loss_report)
exported_path: Path | None = None exported_path: Path | None = None
if md_exporter: if md_exporter:
@@ -569,6 +627,7 @@ def export(
cache.mark_exported(prov_name, conv_id, { cache.mark_exported(prov_name, conv_id, {
"title": normalized.get("title", ""), "title": normalized.get("title", ""),
"project": normalized.get("project"), "project": normalized.get("project"),
"created_at": normalized.get("created_at", ""),
"updated_at": normalized.get("updated_at", ""), "updated_at": normalized.get("updated_at", ""),
"file_path": str(exported_path) if exported_path else "", "file_path": str(exported_path) if exported_path else "",
}) })
@@ -588,6 +647,10 @@ def export(
if not dry_run: if not dry_run:
_print_export_summary(summary) _print_export_summary(summary)
# Emit the data-loss summary at INFO level so it lands in the log file
# AND the operator's console (default level is INFO).
for line in loss_report.format_summary().split("\n"):
logger.info(line)
def _resolve_providers(provider: str, cfg) -> list[tuple[str, object]]: def _resolve_providers(provider: str, cfg) -> list[tuple[str, object]]:
@@ -618,6 +681,7 @@ def _resolve_providers(provider: str, cfg) -> list[tuple[str, object]]:
"chatgpt", "chatgpt",
ChatGPTProvider( ChatGPTProvider(
session_token=cfg.chatgpt_session_token, session_token=cfg.chatgpt_session_token,
session_token_1=cfg.chatgpt_session_token_1,
project_ids=cfg.chatgpt_project_ids, project_ids=cfg.chatgpt_project_ids,
), ),
)) ))
@@ -757,18 +821,26 @@ def list_conversations(ctx: click.Context, provider: str, project_filter: str |
if project_filter is not None: if project_filter is not None:
all_convs = _filter_by_project(all_convs, project_filter) all_convs = _filter_by_project(all_convs, project_filter)
table = Table() # no_wrap + overflow="ellipsis" prevents Rich from wrapping cells to
table.add_column("Title") # multiple lines on narrow terminals (e.g. Windows Command Prompt),
table.add_column("Project") # which can otherwise make the output look garbled. Widths are tuned
table.add_column("Updated") # to fit within an 80-column terminal.
table.add_column("ID") # Total width budget for 80-column terminals:
# borders (5) + padding (4 cols * 2) = 13 chars of overhead
# remaining 67 chars split: 34 title + 15 project + 10 date + 8 id
table = Table(show_lines=False, expand=False, padding=(0, 1))
table.add_column("Title", no_wrap=True, overflow="ellipsis", max_width=34)
table.add_column("Project", no_wrap=True, overflow="ellipsis", max_width=15)
table.add_column("Updated", no_wrap=True, min_width=10)
table.add_column("ID", no_wrap=True, min_width=8)
for conv in all_convs: for conv in all_convs:
title = conv.get("title") or "Untitled" # ChatGPT uses "title"; Claude uses "name".
title = conv.get("title") or conv.get("name") or "Untitled"
project = _raw_project_name(conv) or "" project = _raw_project_name(conv) or ""
updated = (conv.get("updated_at") or conv.get("update_time") or "")[:10] updated = (conv.get("updated_at") or conv.get("update_time") or "")[:10]
conv_id = (conv.get("id") or conv.get("uuid") or "")[:8] conv_id = (conv.get("id") or conv.get("uuid") or "")[:8]
table.add_row(title[:60], project[:30], updated, conv_id) table.add_row(title, project, updated, conv_id)
console.print(table) console.print(table)
console.print(f"Total: {len(all_convs)} conversations") console.print(f"Total: {len(all_convs)} conversations")
@@ -845,9 +917,9 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
via its local REST API. Requires Joplin desktop to be running with the via its local REST API. Requires Joplin desktop to be running with the
Web Clipper service enabled. Web Clipper service enabled.
Notebooks are created automatically based on provider and project: Notebooks are created automatically as nested folders:
exports/chatgpt/my-project/"ChatGPT - My Project" notebook chatgpt / my-project → AI-ChatGPT / My Project
exports/claude/no-project/ "Claude - No Project" notebook claude / no-project → AI-Claude / No Project
Re-running is safe: notes are updated (not duplicated) on subsequent runs. Re-running is safe: notes are updated (not duplicated) on subsequent runs.
@@ -873,7 +945,7 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
) )
sys.exit(1) sys.exit(1)
from src.joplin import JoplinClient, JoplinError, notebook_title from src.joplin import JoplinClient, JoplinError, notebook_path
client = JoplinClient(cfg.joplin_api_url, cfg.joplin_api_token) client = JoplinClient(cfg.joplin_api_url, cfg.joplin_api_token)
@@ -953,7 +1025,9 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
for conv_id, entry in pending: for conv_id, entry in pending:
file_path = entry.get("file_path", "") file_path = entry.get("file_path", "")
title = entry.get("title") or "Untitled" raw_title = entry.get("title") or "Untitled"
created_date = (entry.get("created_at") or "")[:10]
title = f"{created_date} {raw_title}" if created_date else raw_title
project = entry.get("project") or None project = entry.get("project") or None
existing_note_id = entry.get("joplin_note_id") existing_note_id = entry.get("joplin_note_id")
action = "update" if existing_note_id else "create" action = "update" if existing_note_id else "create"
@@ -968,9 +1042,9 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
body = Path(file_path).read_text(encoding="utf-8") body = Path(file_path).read_text(encoding="utf-8")
logger.debug("[joplin] Read %d chars from %s", len(body), file_path) logger.debug("[joplin] Read %d chars from %s", len(body), file_path)
# Get or create the notebook # Get or create the nested notebook
nb_title = notebook_title(prov_name, project) nb_path = notebook_path(prov_name, project)
notebook_id = client.get_or_create_notebook(nb_title) notebook_id = client.get_or_create_notebook_path(list(nb_path))
if existing_note_id: if existing_note_id:
client.update_note(existing_note_id, title, body) client.update_note(existing_note_id, title, body)
@@ -1007,7 +1081,7 @@ def joplin(ctx: click.Context, provider: str, project_filter: str | None, dry_ru
def _print_joplin_dry_run_table(prov_name: str, pending: list[tuple[str, dict]]) -> None: def _print_joplin_dry_run_table(prov_name: str, pending: list[tuple[str, dict]]) -> None:
from src.joplin import notebook_title from src.joplin import notebook_path
table = Table(title=f"[DRY RUN] {prov_name.upper()} — Would sync {len(pending)} conversation(s)") table = Table(title=f"[DRY RUN] {prov_name.upper()} — Would sync {len(pending)} conversation(s)")
table.add_column("Title") table.add_column("Title")
@@ -1016,9 +1090,12 @@ def _print_joplin_dry_run_table(prov_name: str, pending: list[tuple[str, dict]])
table.add_column("Action") table.add_column("Action")
for conv_id, entry in pending[:50]: for conv_id, entry in pending[:50]:
title = entry.get("title") or "Untitled" raw_title = entry.get("title") or "Untitled"
created_date = (entry.get("created_at") or "")[:10]
title = f"{created_date} {raw_title}" if created_date else raw_title
project = entry.get("project") or "no-project" project = entry.get("project") or "no-project"
nb = notebook_title(prov_name, entry.get("project")) parent, child = notebook_path(prov_name, entry.get("project"))
nb = f"{parent} / {child}"
action = "update" if entry.get("joplin_note_id") else "create" action = "update" if entry.get("joplin_note_id") else "create"
table.add_row(title[:50], project[:30], nb, action) table.add_row(title[:50], project[:30], nb, action)

View File

@@ -9,6 +9,7 @@ from typing import Any
import requests import requests
from src.loss_report import LossReport
from src.utils import redact_secrets from src.utils import redact_secrets
# curl_cffi has its own exception hierarchy (rooted at CurlError → OSError), # curl_cffi has its own exception hierarchy (rooted at CurlError → OSError),
@@ -89,8 +90,14 @@ class BaseProvider(ABC):
"""Return the full conversation detail for a single ID.""" """Return the full conversation detail for a single ID."""
@abstractmethod @abstractmethod
def normalize_conversation(self, raw: dict) -> dict: def normalize_conversation(self, raw: dict, loss_report: LossReport | None = None) -> dict:
"""Transform provider-specific schema to the common normalized schema.""" """Transform provider-specific schema to the common normalized schema.
``loss_report`` accumulates counts of dropped/unhandled content so the
export loop can surface a single summary at the end. When None, providers
construct a throwaway local report (so calling normalize_conversation in
isolation, e.g. from tests, doesn't crash).
"""
# ------------------------------------------------------------------ # ------------------------------------------------------------------
# Concrete helpers # Concrete helpers
@@ -326,7 +333,7 @@ class BaseProvider(ABC):
msg = ( msg = (
f"[{self.provider_name}] Authentication failed (401 Unauthorized). " f"[{self.provider_name}] Authentication failed (401 Unauthorized). "
"Your session token has likely expired. " "Your session token has likely expired. "
"Run 'python -m src.main auth' to refresh your token." "Run 'ai-chat-exporter auth' to refresh your token."
) )
logger.error(msg) logger.error(msg)
raise ProviderError( raise ProviderError(

View File

@@ -25,6 +25,20 @@ from typing import Any
from curl_cffi import requests as curl_requests from curl_cffi import requests as curl_requests
from src.blocks import (
UNKNOWN_REASON_EXTRACTION_FAILED,
UNKNOWN_REASON_UNKNOWN_FIELD_IN_KNOWN_TYPE,
UNKNOWN_REASON_UNKNOWN_TYPE,
make_code_block,
make_file_placeholder,
make_hidden_context_marker,
make_image_placeholder,
make_text_block,
make_thinking_block,
make_tool_result_block,
make_unknown_block,
)
from src.loss_report import LossReport
from src.providers.base import BaseProvider, ProviderError, REQUEST_TIMEOUT from src.providers.base import BaseProvider, ProviderError, REQUEST_TIMEOUT
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -56,6 +70,7 @@ class ChatGPTProvider(BaseProvider):
def __init__( def __init__(
self, self,
session_token: str | None = None, session_token: str | None = None,
session_token_1: str | None = None,
project_ids: list[str] | None = None, project_ids: list[str] | None = None,
) -> None: ) -> None:
# Pass a curl_cffi session to the base class instead of a requests.Session. # Pass a curl_cffi session to the base class instead of a requests.Session.
@@ -77,11 +92,15 @@ class ChatGPTProvider(BaseProvider):
"init", "init",
RuntimeError( RuntimeError(
"CHATGPT_SESSION_TOKEN is not set. " "CHATGPT_SESSION_TOKEN is not set. "
"Run 'python -m src.main auth' to configure it." "Run 'ai-chat-exporter auth' to configure it."
), ),
) )
self._session_token = token self._session_token = token
# Second chunk of the session token (ChatGPT splits large cookies into
# __Secure-next-auth.session-token.0 and .1 to stay under the 4KB limit).
token_1 = session_token_1 or os.getenv("CHATGPT_SESSION_TOKEN_1", "").strip() or None
# Project gizmo IDs (g-p-xxx) whose conversations we'll fetch. # Project gizmo IDs (g-p-xxx) whose conversations we'll fetch.
# ChatGPT project conversations do not appear in the default # ChatGPT project conversations do not appear in the default
# /conversations listing — they require explicit project IDs. # /conversations listing — they require explicit project IDs.
@@ -93,13 +112,24 @@ class ChatGPTProvider(BaseProvider):
# Cache of project_id → display name (avoids re-fetching gizmo details) # Cache of project_id → display name (avoids re-fetching gizmo details)
self._project_name_cache: dict[str, str] = {} self._project_name_cache: dict[str, str] = {}
# Set the session cookie in the cookie jar # ChatGPT now splits large session cookies into .0 / .1 chunks.
# Always send both named chunks; the server reassembles them.
self._session.cookies.set( self._session.cookies.set(
"__Secure-next-auth.session-token", "__Secure-next-auth.session-token.0",
token, token,
domain="chatgpt.com", domain="chatgpt.com",
path="/", path="/",
) )
if token_1:
self._session.cookies.set(
"__Secure-next-auth.session-token.1",
token_1,
domain="chatgpt.com",
path="/",
)
logger.debug("[chatgpt] Set both session cookie chunks (.0 and .1)")
else:
logger.debug("[chatgpt] Set session cookie chunk .0 only (no .1 configured)")
# Set only Referer and sec-fetch-* headers for the auth exchange. # Set only Referer and sec-fetch-* headers for the auth exchange.
# Origin is intentionally omitted: Chrome does not send Origin on # Origin is intentionally omitted: Chrome does not send Origin on
@@ -157,7 +187,7 @@ class ChatGPTProvider(BaseProvider):
"fetch_access_token", "fetch_access_token",
RuntimeError( RuntimeError(
"No accessToken in /api/auth/session response. " "No accessToken in /api/auth/session response. "
"Your session token may be expired — run 'python -m src.main auth' to refresh." "Your session token may be expired — run 'ai-chat-exporter auth' to refresh."
), ),
) )
return access_token return access_token
@@ -169,7 +199,7 @@ class ChatGPTProvider(BaseProvider):
"The session token is used to obtain a short-lived access token via /api/auth/session. " "The session token is used to obtain a short-lived access token via /api/auth/session. "
"To refresh: open chatgpt.com in Chrome → F12 → Application → Cookies " "To refresh: open chatgpt.com in Chrome → F12 → Application → Cookies "
"→ find '__Secure-next-auth.session-token' → copy the value. " "→ find '__Secure-next-auth.session-token' → copy the value. "
"Then run 'python -m src.main auth' or update CHATGPT_SESSION_TOKEN in .env." "Then run 'ai-chat-exporter auth' or update CHATGPT_SESSION_TOKEN in .env."
) )
logger.error(msg) logger.error(msg)
raise ProviderError( raise ProviderError(
@@ -369,7 +399,7 @@ class ChatGPTProvider(BaseProvider):
logger.info( logger.info(
"[chatgpt] No project IDs configured — skipping project conversations. " "[chatgpt] No project IDs configured — skipping project conversations. "
"To include projects, set CHATGPT_PROJECT_IDS in .env " "To include projects, set CHATGPT_PROJECT_IDS in .env "
"(see 'python -m src.main auth' for instructions)." "(see 'ai-chat-exporter auth' for instructions)."
) )
return self._apply_since_filter(default_convs, since) return self._apply_since_filter(default_convs, since)
@@ -535,7 +565,7 @@ class ChatGPTProvider(BaseProvider):
# Normalization # Normalization
# ------------------------------------------------------------------ # ------------------------------------------------------------------
def normalize_conversation(self, raw: dict) -> dict: def normalize_conversation(self, raw: dict, loss_report: LossReport | None = None) -> dict:
"""Transform ChatGPT raw schema to the common normalized schema. """Transform ChatGPT raw schema to the common normalized schema.
ChatGPT stores messages in a nested ``mapping`` dict where each node ChatGPT stores messages in a nested ``mapping`` dict where each node
@@ -546,21 +576,32 @@ class ChatGPTProvider(BaseProvider):
fetch_all_conversations). The conversation detail endpoint does not fetch_all_conversations). The conversation detail endpoint does not
include project information. include project information.
""" """
conv_id = raw.get("id", "") report = loss_report if loss_report is not None else LossReport()
# ChatGPT's /backend-api/conversation/<id> response uses ``conversation_id``
# at the top level (not ``id``); fixtures and listing summaries use ``id``.
# Read both so both code paths populate the normalized ``id`` correctly.
conv_id = raw.get("id") or raw.get("conversation_id") or ""
title = raw.get("title") or "Untitled" title = raw.get("title") or "Untitled"
created_at = _ts_to_iso(raw.get("create_time")) created_at = _ts_to_iso(raw.get("create_time"))
updated_at = _ts_to_iso(raw.get("update_time")) updated_at = _ts_to_iso(raw.get("update_time"))
# Look up project name from the map built during fetch_all_conversations. # Prefer _project_name annotation injected from the listing summary
project = self._project_map.get(conv_id) if conv_id else None # (propagated by the export loop). Fall back to _project_map lookup.
project = raw.get("_project_name") or (
self._project_map.get(conv_id) if conv_id else None
)
logger.debug( logger.debug(
"[chatgpt] normalize_conversation[%s]: project_map lookup → %r", "[chatgpt] normalize_conversation[%s]: project=%r (source=%s)",
conv_id[:8] if conv_id else "?", conv_id[:8] if conv_id else "?",
project, project,
"_project_name" if raw.get("_project_name") else "_project_map",
) )
mapping: dict = raw.get("mapping", {}) mapping: dict = raw.get("mapping", {})
messages = _extract_messages(mapping, raw, conv_id) messages = _extract_messages(mapping, raw, conv_id, report)
for _ in messages:
report.record_message()
report.record_conversation()
return { return {
"id": conv_id, "id": conv_id,
@@ -590,14 +631,18 @@ def _ts_to_iso(ts: float | int | str | None) -> str:
def _extract_messages( def _extract_messages(
mapping: dict[str, Any], raw: dict, conv_id: str mapping: dict[str, Any], raw: dict, conv_id: str, report: LossReport
) -> list[dict]: ) -> list[dict]:
"""Walk the ChatGPT conversation mapping tree to produce an ordered message list.""" """Walk the ChatGPT conversation mapping tree to produce an ordered message list.
All roles (user/assistant/system/tool) are processed; the prior filter that
dropped non-user/assistant messages is lifted in v0.4.0 — truly empty
messages skip via the empty-content guard, anything with content renders.
"""
if not mapping: if not mapping:
logger.warning("[chatgpt] Conversation %s has empty mapping", conv_id[:8]) logger.warning("[chatgpt] Conversation %s has empty mapping", conv_id[:8])
return [] return []
# Find the root node (the one that has no parent, or whose parent is None/not in mapping)
root_id = _find_root(mapping) root_id = _find_root(mapping)
if root_id is None: if root_id is None:
logger.warning( logger.warning(
@@ -615,39 +660,12 @@ def _extract_messages(
node = mapping.get(node_id, {}) node = mapping.get(node_id, {})
msg_data = node.get("message") msg_data = node.get("message")
if msg_data: if msg_data:
role = msg_data.get("author", {}).get("role", "") built = _build_message(msg_data, conv_id, node_id, report)
# Skip system/tool messages silently unless they have visible content if built is not None:
if role in ("user", "assistant"): messages.append(built)
content_obj = msg_data.get("content", {})
content_type = content_obj.get("content_type", "text")
text = _extract_text(content_obj, conv_id, node_id)
if content_type != "text": # Walk children in order (linear in typical conversations)
logger.warning(
"[chatgpt] Skipping %s content in conversation %s message %s "
"— rich content not yet supported (see FUTURE.md)",
content_type,
conv_id[:8],
node_id[:8],
)
elif text:
ts = msg_data.get("create_time")
messages.append(
{
"role": role,
"content": text,
"content_type": "text",
"timestamp": _ts_to_iso(ts) if ts else None,
}
)
else:
logger.debug(
"[chatgpt] Skipping empty message in conversation %s", conv_id[:8]
)
# Walk children in order (ChatGPT typically has one child per node in a linear chat)
for child_id in node.get("children", []): for child_id in node.get("children", []):
walk(child_id) walk(child_id)
@@ -669,27 +687,529 @@ def _find_root(mapping: dict[str, Any]) -> str | None:
return None return None
def _extract_text(content_obj: dict, conv_id: str, node_id: str) -> str: def _build_message(
"""Extract plain text from a ChatGPT content object.""" msg_data: dict, conv_id: str, node_id: str, report: LossReport
parts = content_obj.get("parts", []) ) -> dict | None:
if not parts: """Construct a normalized message dict (with ``blocks``) for one ChatGPT node.
return ""
text_parts = [] Returns None for messages that should be skipped (truly empty). Otherwise
for part in parts: returns a dict with ``role``, ``content_type``, ``timestamp``, ``blocks``.
if isinstance(part, str): """
text_parts.append(part) author = msg_data.get("author") or {}
elif isinstance(part, dict): role = author.get("role", "") or ""
# Could be an image or file reference — skip and warn if role not in ("user", "assistant", "system", "tool"):
part_type = part.get("content_type", "unknown") # Unrecognised role — log and surface, but pass through so role metadata
if part_type != "text": # is preserved for the reader.
logger.warning( logger.debug(
"[chatgpt] Skipping %s attachment in conversation %s " "[chatgpt] Unrecognised role %r in conversation %s message %s",
"— rich content not yet supported (see FUTURE.md)", role,
part_type, conv_id[:8],
node_id[:8],
)
content_obj = msg_data.get("content") or {}
content_type = content_obj.get("content_type", "text")
ts = msg_data.get("create_time")
metadata = msg_data.get("metadata") or {}
is_hidden = bool(metadata.get("is_visually_hidden_from_conversation"))
author_name = author.get("name") or None
blocks = _extract_blocks_for_content(
content_type, content_obj, role, conv_id, node_id, report,
author_name=author_name, msg_metadata=metadata,
)
if not blocks:
logger.debug(
"[chatgpt] Skipping empty %s message in conversation %s",
content_type,
conv_id[:8], conv_id[:8],
) )
else: return None
text_parts.append(part.get("text", ""))
return "\n".join(t for t in text_parts if t) if is_hidden:
# Prepend a marker so the reader knows this message is hidden in the
# source UI. The marker is content-type-agnostic.
blocks = [make_hidden_context_marker(content_type)] + blocks
# Vestigial content_type: "code" for code-only messages, otherwise "text"
msg_content_type = "code" if (
len(blocks) == 1 and blocks[0].get("type") == "code"
) else "text"
return {
"role": role or "user",
"content_type": msg_content_type,
"timestamp": _ts_to_iso(ts) if ts else None,
"blocks": blocks,
}
# Content types whose ``parts`` are plain text strings.
_PLAIN_TEXT_PARTS_TYPES = {"text"}
# Content types that carry inline reasoning/thoughts.
_THINKING_TYPES = {"thoughts", "reasoning_recap"}
# Custom-Instructions / model-context types — direct fields, NOT parts.
_DIRECT_FIELD_CONTEXT_TYPES = {
"user_editable_context",
"model_editable_context",
}
# Known direct fields per context type. Anything not listed but non-null
# becomes an `unknown` block per the no-silent-drop-of-non-null-fields rule.
_USER_EDITABLE_CONTEXT_KNOWN_FIELDS = ("user_profile", "user_instructions")
_MODEL_EDITABLE_CONTEXT_KNOWN_FIELDS = (
"model_set_context",
"repository",
"repo_summary",
"structured_context",
)
def _extract_blocks_for_content(
content_type: str,
content_obj: dict,
role: str,
conv_id: str,
node_id: str,
report: LossReport,
author_name: str | None = None,
msg_metadata: dict | None = None,
) -> list[dict]:
"""Dispatch on content_type and return a list of blocks for one message."""
if content_type in _PLAIN_TEXT_PARTS_TYPES:
return _extract_text_content_type_blocks(content_obj, conv_id, node_id, report)
if content_type == "multimodal_text":
return _extract_multimodal_blocks(content_obj, role, conv_id, node_id, report)
if content_type == "execution_output":
return _extract_execution_output_blocks(
content_obj, author_name, msg_metadata or {}, conv_id, node_id
)
if content_type == "system_error":
return _extract_system_error_blocks(content_obj, author_name)
if content_type == "tether_browsing_display":
return _extract_tether_browsing_display_blocks(
content_obj, author_name, conv_id, node_id
)
if content_type == "code":
code_text = content_obj.get("text") or "\n".join(
p for p in content_obj.get("parts", []) if isinstance(p, str)
)
language = content_obj.get("language", "") or ""
block = make_code_block(code_text, language)
return [block] if block else []
if content_type in _THINKING_TYPES:
text = _join_string_parts(content_obj)
block = make_thinking_block(text)
return [block] if block else []
if content_type in _DIRECT_FIELD_CONTEXT_TYPES:
return _extract_editable_context_blocks(
content_type, content_obj, conv_id, node_id, report
)
if content_type == "image_asset_pointer":
# Top-level image (rare — usually nested inside multimodal_text).
ref = content_obj.get("asset_pointer", "")
source = "user_upload" if role == "user" else "model_generated"
return [make_image_placeholder(ref=ref, source=source)]
# Unknown content_type → visible unknown block + WARNING + tally
keys = list(content_obj.keys())
logger.warning(
"[chatgpt] Unknown content_type %r in conversation %s message %s "
"— see plan §Data-loss visibility (rendering as unknown block)",
content_type,
conv_id[:8],
node_id[:8],
)
report.record_unknown(content_type or "?")
return [
make_unknown_block(
raw_type=content_type or "?",
observed_keys=keys,
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
)
]
def _extract_text_content_type_blocks(
content_obj: dict, conv_id: str, node_id: str, report: LossReport
) -> list[dict]:
"""Extract blocks for ``content_type == "text"``.
Plural-parts rule: emit ONE text block per message with all string parts
joined by ``\\n``. Don't emit one block per part.
Dict parts inside a text content_type message (the suspected o1/o3 reasoning
subpart shape ``{"summary": ..., "content": ...}``) are preserved as text
today — defensive behavior pending real-data capture in v0.4.1.
"""
parts = content_obj.get("parts", []) or []
string_chunks: list[str] = []
for part in parts:
if isinstance(part, str):
string_chunks.append(part)
elif isinstance(part, dict):
part_type = part.get("content_type", "")
if part_type == "text":
txt = part.get("text", "") or ""
if txt:
string_chunks.append(txt)
elif "content" in part:
# Suspected o1/o3 reasoning subpart. Defensive: preserve as text
# block (matches current behavior). v0.4.1 reclassifies once
# the real shape is captured live.
content_val = part.get("content", "") or ""
if content_val:
string_chunks.append(content_val)
elif part_type:
# Non-text dict part inside a text content_type — surface it.
logger.warning(
"[chatgpt] Unexpected %s part inside text content_type "
"in conversation %s message %s — rendering as unknown block",
part_type,
conv_id[:8],
node_id[:8],
)
report.record_unknown(part_type)
# Inline mark in the joined text so order is preserved.
string_chunks.append(
f"\n[Unknown part: type={part_type}; "
f"keys={list(part.keys())[:10]}]\n"
)
joined = "\n".join(c for c in string_chunks if c)
block = make_text_block(joined)
return [block] if block else []
def _join_string_parts(content_obj: dict) -> str:
"""Helper: join all string parts in ``parts`` with newlines."""
parts = content_obj.get("parts", []) or []
return "\n".join(p for p in parts if isinstance(p, str) and p)
def _extract_multimodal_blocks(
content_obj: dict, role: str, conv_id: str, node_id: str, report: LossReport
) -> list[dict]:
"""Extract blocks from a ``multimodal_text`` content object.
Walks ``parts`` in array order — order varies between user and assistant
turns, and the extractor preserves source ordering. Emits text +
image_placeholder + file_placeholder blocks per part.
"""
parts = content_obj.get("parts", []) or []
blocks: list[dict] = []
for part in parts:
if isinstance(part, str):
block = make_text_block(part)
if block:
blocks.append(block)
continue
if not isinstance(part, dict):
continue
part_type = part.get("content_type", "")
if part_type == "audio_transcription":
txt = part.get("text", "") or ""
block = make_text_block(txt)
if block:
blocks.append(block)
elif "text" not in part:
logger.warning(
"[chatgpt] audio_transcription part missing 'text' key "
"in conversation %s message %s",
conv_id[:8],
node_id[:8],
)
report.record_extraction_failure("audio_transcription")
blocks.append(
make_unknown_block(
raw_type="audio_transcription",
observed_keys=list(part.keys()),
reason=UNKNOWN_REASON_EXTRACTION_FAILED,
summary="expected key 'text' not found",
)
)
continue
if part_type == "image_asset_pointer":
ref = part.get("asset_pointer", "")
source = "user_upload" if role == "user" else "model_generated"
mime = None
blocks.append(make_image_placeholder(ref=ref, source=source, mime=mime))
continue
if part_type == "audio_asset_pointer":
blocks.append(_audio_asset_placeholder(part))
continue
if part_type == "real_time_user_audio_video_asset_pointer":
# Wrapper carrying a nested audio_asset_pointer + optional video frames.
nested_audio = part.get("audio_asset_pointer")
if isinstance(nested_audio, dict):
blocks.append(_audio_asset_placeholder(nested_audio))
else:
logger.warning(
"[chatgpt] real_time_user_audio_video_asset_pointer missing "
"nested audio_asset_pointer in conversation %s message %s",
conv_id[:8],
node_id[:8],
)
report.record_extraction_failure(
"real_time_user_audio_video_asset_pointer"
)
blocks.append(
make_unknown_block(
raw_type="real_time_user_audio_video_asset_pointer",
observed_keys=list(part.keys()),
reason=UNKNOWN_REASON_EXTRACTION_FAILED,
summary="expected nested 'audio_asset_pointer' not found",
)
)
frames = part.get("frames_asset_pointers") or []
if frames:
# Defensive: empty in all observed cases, but if non-empty
# surface as a separate file placeholder.
video_ref = part.get("video_container_asset_pointer") or "(video frames)"
blocks.append(
make_file_placeholder(
ref=str(video_ref),
mime="video/unknown",
)
)
continue
# Anything else inside multimodal_text — visible unknown block
logger.warning(
"[chatgpt] Unknown multimodal_text part type %r in conversation %s message %s",
part_type,
conv_id[:8],
node_id[:8],
)
report.record_unknown(part_type or "?")
blocks.append(
make_unknown_block(
raw_type=part_type or "?",
observed_keys=list(part.keys()),
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
)
)
return blocks
def _audio_asset_placeholder(audio_part: dict) -> dict:
"""Build a file_placeholder for an audio_asset_pointer dict.
Handles missing/zero metadata defensively.
"""
ref = audio_part.get("asset_pointer", "") or ""
fmt = audio_part.get("format") or "unknown"
size_bytes = audio_part.get("size_bytes")
if not isinstance(size_bytes, int) or size_bytes <= 0:
size_bytes = None
metadata = audio_part.get("metadata") or {}
start = metadata.get("start") if isinstance(metadata, dict) else None
end = metadata.get("end") if isinstance(metadata, dict) else None
duration: float | None = None
if isinstance(start, (int, float)) and isinstance(end, (int, float)):
diff = float(end) - float(start)
if diff > 0:
duration = diff
return make_file_placeholder(
ref=ref,
mime=f"audio/{fmt}" if fmt else "audio/unknown",
size_bytes=size_bytes,
duration_seconds=duration,
)
def _extract_editable_context_blocks(
content_type: str, content_obj: dict, conv_id: str, node_id: str, report: LossReport
) -> list[dict]:
"""Extract blocks from user_editable_context / model_editable_context messages.
These have no ``parts`` field — they carry direct keys. Read all known
fields, emit one labeled fenced block per non-null known field, and emit an
``unknown`` block for any unrecognised non-null direct field (no-silent-drop
rule).
"""
if content_type == "user_editable_context":
known_fields: tuple[str, ...] = _USER_EDITABLE_CONTEXT_KNOWN_FIELDS
elif content_type == "model_editable_context":
known_fields = _MODEL_EDITABLE_CONTEXT_KNOWN_FIELDS
else:
known_fields = ()
blocks: list[dict] = []
label_kind = "Custom Instructions" if content_type == "user_editable_context" else "Model Context"
for field in known_fields:
value = content_obj.get(field)
if value is None or (isinstance(value, str) and not value.strip()):
continue
if isinstance(value, (dict, list)):
# Render as a JSON-rendered text block. _safe_fence will wrap it.
import json as _json
rendered = _json.dumps(value, indent=2, default=str, ensure_ascii=False)
else:
rendered = str(value)
label = f"**{label_kind}{field}:**"
# Emit as text block; the renderer's _safe_fence wraps the raw value.
# We use a "labeled fenced block" pattern: header line + raw content
# joined inside one text block, where the renderer will leave it alone.
# To get the safe-fence wrap we use a code block (which calls _safe_fence
# internally and renders without language-hint corruption risk).
blocks.append(make_text_block(label))
code_block = make_code_block(rendered, language="")
if code_block:
blocks.append(code_block)
# Catch unknown non-null direct fields (no-silent-drop rule).
structural_keys = {"content_type", "parts"}
for key, value in content_obj.items():
if key in structural_keys or key in known_fields:
continue
if value is None:
continue
# Reject null/empty containers.
if isinstance(value, (str, list, dict)) and not value:
continue
logger.warning(
"[chatgpt] Unknown non-null field %r in %s message %s/%s",
key,
content_type,
conv_id[:8],
node_id[:8],
)
report.record_unknown(f"{content_type}.{key}")
blocks.append(
make_unknown_block(
raw_type=f"{content_type}.{key}",
observed_keys=list(content_obj.keys()),
reason=UNKNOWN_REASON_UNKNOWN_FIELD_IN_KNOWN_TYPE,
summary=f"unknown non-null field '{key}' in {content_type}",
)
)
return blocks
def _extract_execution_output_blocks(
content_obj: dict,
author_name: str | None,
msg_metadata: dict,
conv_id: str,
node_id: str,
) -> list[dict]:
"""Map a ChatGPT ``execution_output`` content (Code Interpreter / container.exec
/ python tool output) onto a ``tool_result`` block.
Locked shape (captured live during planning v0.4.1):
content.text → output
author.name → tool_name
metadata.aggregate_result.status → "error" → is_error=True
metadata.reasoning_title → summary
Empty ``content.text`` → skip (DEBUG log) — a tool that emits no output is
a transient artifact, not archival content.
"""
text = content_obj.get("text") or ""
if not text.strip():
logger.debug(
"[chatgpt] Skipping empty execution_output in conversation %s message %s",
conv_id[:8],
node_id[:8],
)
return []
aggregate = msg_metadata.get("aggregate_result") or {}
status = aggregate.get("status") if isinstance(aggregate, dict) else None
is_error = isinstance(status, str) and status.lower() == "error"
summary = msg_metadata.get("reasoning_title") or None
return [
make_tool_result_block(
output=text,
tool_name=author_name,
is_error=is_error,
summary=summary,
)
]
def _extract_system_error_blocks(
content_obj: dict,
author_name: str | None,
) -> list[dict]:
"""Map a ChatGPT ``system_error`` content onto an error ``tool_result`` block.
Captured shape: ``{content_type, name, text}`` where ``text`` is the error
message (e.g. ``"Error: Error from browse service: 503"``). ``author.name``
identifies the failing tool (e.g. ``"web"``).
"""
text = content_obj.get("text") or ""
if not text:
text = "(error with no message)"
return [
make_tool_result_block(
output=text,
tool_name=author_name,
is_error=True,
)
]
def _extract_tether_browsing_display_blocks(
content_obj: dict,
author_name: str | None,
conv_id: str,
node_id: str,
) -> list[dict]:
"""Handle ChatGPT's ``tether_browsing_display`` content.
Captured live: most instances are **spinner placeholders** (transient UI
state — empty fields, ``metadata.command == "spinner"``). The actual
retrieval content arrives as a sibling/child ``multimodal_text`` message
that already extracts cleanly via the existing handler.
Locked behavior:
- If ``result`` AND ``summary`` are both empty → skip silently (DEBUG).
These are spinners; the real content is elsewhere.
- Otherwise (defensive: never observed populated in real data) → render
as a ``tool_result`` block carrying ``result`` as output and
``summary`` as the optional summary line.
"""
result = content_obj.get("result") or ""
summary = content_obj.get("summary") or ""
if not result.strip() and not summary.strip():
logger.debug(
"[chatgpt] Skipping tether_browsing_display spinner in "
"conversation %s message %s (empty result/summary)",
conv_id[:8],
node_id[:8],
)
return []
return [
make_tool_result_block(
output=result or summary,
tool_name=author_name,
is_error=False,
summary=summary if result and summary else None,
)
]

View File

@@ -5,6 +5,17 @@ import os
from curl_cffi import requests as curl_requests from curl_cffi import requests as curl_requests
from src.blocks import (
UNKNOWN_REASON_EXTRACTION_FAILED,
UNKNOWN_REASON_UNKNOWN_TYPE,
make_image_placeholder,
make_text_block,
make_thinking_block,
make_tool_result_block,
make_tool_use_block,
make_unknown_block,
)
from src.loss_report import LossReport
from src.providers.base import BaseProvider, ProviderError from src.providers.base import BaseProvider, ProviderError
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -39,7 +50,7 @@ class ClaudeProvider(BaseProvider):
"init", "init",
RuntimeError( RuntimeError(
"CLAUDE_SESSION_KEY is not set. " "CLAUDE_SESSION_KEY is not set. "
"Run 'python -m src.main auth' to configure it." "Run 'ai-chat-exporter auth' to configure it."
), ),
) )
# Set sessionKey in the cookie jar # Set sessionKey in the cookie jar
@@ -60,7 +71,7 @@ class ClaudeProvider(BaseProvider):
"Note: Claude session keys are opaque — a 401 is the only expiry signal. " "Note: Claude session keys are opaque — a 401 is the only expiry signal. "
"To refresh: open claude.ai in Chrome → F12 → Application → Cookies " "To refresh: open claude.ai in Chrome → F12 → Application → Cookies "
"→ find 'sessionKey' → copy the value. " "→ find 'sessionKey' → copy the value. "
"Then run 'python -m src.main auth' or update CLAUDE_SESSION_KEY in .env." "Then run 'ai-chat-exporter auth' or update CLAUDE_SESSION_KEY in .env."
) )
logger.error(msg) logger.error(msg)
raise ProviderError( raise ProviderError(
@@ -161,8 +172,9 @@ class ClaudeProvider(BaseProvider):
return data return data
def normalize_conversation(self, raw: dict) -> dict: def normalize_conversation(self, raw: dict, loss_report: LossReport | None = None) -> dict:
"""Transform Claude raw schema to the common normalized schema.""" """Transform Claude raw schema to the common normalized schema."""
report = loss_report if loss_report is not None else LossReport()
conv_id = raw.get("uuid") or raw.get("id", "") conv_id = raw.get("uuid") or raw.get("id", "")
title = raw.get("name") or raw.get("title") or "Untitled" title = raw.get("name") or raw.get("title") or "Untitled"
created_at = raw.get("created_at") or raw.get("create_time") or "" created_at = raw.get("created_at") or raw.get("create_time") or ""
@@ -178,40 +190,37 @@ class ClaudeProvider(BaseProvider):
# Messages # Messages
raw_messages = raw.get("chat_messages") or raw.get("messages") or [] raw_messages = raw.get("chat_messages") or raw.get("messages") or []
messages = [] messages: list[dict] = []
for msg in raw_messages: for msg in raw_messages:
role = _map_role(msg.get("sender") or msg.get("role", "")) role = _map_role(msg.get("sender") or msg.get("role", ""))
if not role: if not role:
continue continue
# Content can be a string or a list of content blocks content_raw = msg.get("content") if "content" in msg else msg.get("text", "")
content_raw = msg.get("content") or msg.get("text") or "" blocks = _extract_claude_blocks(content_raw, conv_id, report)
content, skipped_types = _extract_claude_text(content_raw, conv_id)
for ctype in skipped_types:
logger.warning(
"[claude] Skipping %s content in conversation %s "
"— rich content not yet supported (see FUTURE.md)",
ctype,
conv_id[:8],
)
timestamp = msg.get("created_at") or msg.get("timestamp") or None timestamp = msg.get("created_at") or msg.get("timestamp") or None
if content is None: if not blocks:
logger.debug("[claude] Skipping empty message in conversation %s", conv_id[:8]) logger.debug("[claude] Skipping empty message in conversation %s", conv_id[:8])
continue continue
content_type = "text"
messages.append( messages.append(
{ {
"role": role, "role": role,
"content": content, "content_type": content_type,
"content_type": "text",
"timestamp": timestamp, "timestamp": timestamp,
"blocks": blocks,
} }
) )
for _ in messages:
report.record_message()
report.record_conversation()
return { return {
"id": conv_id, "id": conv_id,
"title": title, "title": title,
@@ -242,43 +251,134 @@ def _map_role(sender: str) -> str | None:
return mapping.get(sender.lower()) if sender else None return mapping.get(sender.lower()) if sender else None
def _extract_claude_text( def _extract_claude_blocks(
content: str | list | dict, conv_id: str content: str | list | dict | None, conv_id: str, report: LossReport
) -> tuple[str | None, list[str]]: ) -> list[dict]:
"""Extract plain text from a Claude content field. """Extract typed blocks from a Claude content field.
Returns: Defensive dispatch — zero observed cases of rich Claude content in the
(text_or_None, list_of_skipped_content_types) user's archive at planning time, so this is theory-only. Real shapes
will be locked in v0.4.1 once captured. Any unrecognised block type
surfaces as an `unknown` block + WARNING + tally.
""" """
skipped: list[str] = [] if content is None:
return []
if isinstance(content, str): if isinstance(content, str):
text = content.strip() block = make_text_block(content)
return (text if text else None), skipped return [block] if block else []
if isinstance(content, list): if isinstance(content, list):
parts: list[str] = [] blocks: list[dict] = []
for block in content: for item in content:
if isinstance(block, str): if isinstance(item, str):
parts.append(block) block = make_text_block(item)
elif isinstance(block, dict): if block:
btype = block.get("type", "text") blocks.append(block)
if btype == "text": elif isinstance(item, dict):
t = block.get("text", "").strip() blocks.extend(_dispatch_claude_block(item, conv_id, report))
if t: return blocks
parts.append(t)
else:
skipped.append(btype)
text = "\n".join(parts).strip()
return (text if text else None), skipped
if isinstance(content, dict): if isinstance(content, dict):
btype = content.get("type", "text") return _dispatch_claude_block(content, conv_id, report)
if btype == "text":
text = content.get("text", "").strip()
return (text if text else None), skipped
else:
skipped.append(btype)
return None, skipped
return None, skipped return []
def _dispatch_claude_block(block: dict, conv_id: str, report: LossReport) -> list[dict]:
"""Translate one raw Claude content block into normalized blocks."""
btype = block.get("type", "text")
if btype == "text":
block_obj = make_text_block(block.get("text", "") or "")
return [block_obj] if block_obj else []
if btype == "thinking":
# Claude extended-thinking blocks may use 'thinking' or 'text' field.
text = block.get("thinking") or block.get("text") or ""
block_obj = make_thinking_block(text)
return [block_obj] if block_obj else []
if btype == "tool_use":
return [
make_tool_use_block(
name=block.get("name", "") or "",
input_data=block.get("input"),
tool_id=block.get("id"),
)
]
if btype == "tool_result":
# ``content`` may be a string or a list of nested blocks (recursive).
nested = block.get("content")
output = _flatten_tool_result_content(nested, conv_id, report)
return [
make_tool_result_block(
output=output,
tool_name=None,
is_error=bool(block.get("is_error")),
)
]
if btype == "image":
# Source shape is unverified; try the most likely fields.
source = block.get("source") or {}
ref = ""
if isinstance(source, dict):
ref = (
source.get("file_uuid")
or source.get("media_type")
or source.get("url")
or ""
)
return [make_image_placeholder(ref=ref or "(unknown)", source="user_upload")]
# Unknown block type
keys = list(block.keys())
logger.warning(
"[claude] Unknown block type %r in conversation %s "
"— see plan §Data-loss visibility (rendering as unknown block)",
btype,
conv_id[:8],
)
report.record_unknown(btype or "?")
return [
make_unknown_block(
raw_type=btype or "?",
observed_keys=keys,
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
)
]
def _flatten_tool_result_content(
nested: object, conv_id: str, report: LossReport
) -> str:
"""Flatten Claude tool_result content (string OR list of nested blocks) to text.
Recurses into nested text blocks; any non-text nested block becomes a
visible inline marker so non-text content isn't silently dropped.
"""
if nested is None:
return ""
if isinstance(nested, str):
return nested
if isinstance(nested, list):
chunks: list[str] = []
for item in nested:
if isinstance(item, str):
chunks.append(item)
elif isinstance(item, dict):
btype = item.get("type", "text")
if btype == "text":
chunks.append(item.get("text", "") or "")
else:
keys = list(item.keys())[:10]
report.record_extraction_failure(f"tool_result.{btype}")
chunks.append(
f"[Unsupported nested {btype} block; keys={keys}]"
)
return "\n".join(c for c in chunks if c)
if isinstance(nested, dict):
return _flatten_tool_result_content([nested], conv_id, report)
return str(nested)

View File

@@ -50,7 +50,7 @@ def build_export_path(
created_at: ISO8601 creation timestamp (used for year folder). created_at: ISO8601 creation timestamp (used for year folder).
filename: Already-generated filename from generate_filename(). filename: Already-generated filename from generate_filename().
structure: OUTPUT_STRUCTURE value. One of: structure: OUTPUT_STRUCTURE value. One of:
"provider/project/year" (default) "provider/project/year" (default) — project and year combined, e.g. no-project.2025/
"provider/project" "provider/project"
"provider/year" "provider/year"
@@ -64,14 +64,14 @@ def build_export_path(
parts: list[str] = [provider] parts: list[str] = [provider]
if structure == "provider/project/year": if structure == "provider/project/year":
parts += [project_slug, year] parts += [f"{project_slug}.{year}"]
elif structure == "provider/project": elif structure == "provider/project":
parts += [project_slug] parts += [project_slug]
elif structure == "provider/year": elif structure == "provider/year":
parts += [year] parts += [year]
else: else:
# Unknown structure — fall back to default # Unknown structure — fall back to default
parts += [project_slug, year] parts += [f"{project_slug}.{year}"]
return base_dir.joinpath(*parts) / filename return base_dir.joinpath(*parts) / filename

View File

@@ -8,12 +8,30 @@
"node-root": { "node-root": {
"id": "node-root", "id": "node-root",
"parent": null, "parent": null,
"children": ["node-1"], "children": ["node-uec"],
"message": null "message": null
}, },
"node-uec": {
"id": "node-uec",
"parent": "node-root",
"children": ["node-1"],
"message": {
"id": "node-uec",
"author": {"role": "user"},
"create_time": null,
"content": {
"content_type": "user_editable_context",
"user_profile": "Preferred name: Jesse",
"user_instructions": "The user provided the additional info about how they would like you to respond:\n```Always cite sources.```"
},
"metadata": {
"is_visually_hidden_from_conversation": true
}
}
},
"node-1": { "node-1": {
"id": "node-1", "id": "node-1",
"parent": "node-root", "parent": "node-uec",
"children": ["node-2"], "children": ["node-2"],
"message": { "message": {
"id": "node-1", "id": "node-1",
@@ -28,7 +46,7 @@
"node-2": { "node-2": {
"id": "node-2", "id": "node-2",
"parent": "node-1", "parent": "node-1",
"children": ["node-3"], "children": ["node-mm-user"],
"message": { "message": {
"id": "node-2", "id": "node-2",
"author": {"role": "assistant"}, "author": {"role": "assistant"},
@@ -39,18 +57,139 @@
} }
} }
}, },
"node-3": { "node-mm-user": {
"id": "node-3", "id": "node-mm-user",
"parent": "node-2", "parent": "node-2",
"children": [], "children": ["node-mm-assistant"],
"message": { "message": {
"id": "node-3", "id": "node-mm-user",
"author": {"role": "user"}, "author": {"role": "user"},
"create_time": 1704067300.0, "create_time": 1704067300.0,
"content": { "content": {
"content_type": "image_asset_pointer", "content_type": "multimodal_text",
"parts": [{"content_type": "image_asset_pointer", "asset_pointer": "file://some-image"}] "parts": [
{"content_type": "audio_transcription", "text": "What is the capital of France?", "direction": "in", "decoding_id": null},
{"content_type": "real_time_user_audio_video_asset_pointer", "frames_asset_pointers": [], "video_container_asset_pointer": null, "audio_asset_pointer": {"content_type": "audio_asset_pointer", "asset_pointer": "sediment://file_user001", "size_bytes": 50000, "format": "wav", "metadata": {"start": 0.0, "end": 2.5}}, "audio_start_timestamp": 1.0}
]
},
"metadata": {"voice_mode_message": true}
} }
},
"node-mm-assistant": {
"id": "node-mm-assistant",
"parent": "node-mm-user",
"children": ["node-mm-user-rev"],
"message": {
"id": "node-mm-assistant",
"author": {"role": "assistant"},
"create_time": 1704067305.0,
"content": {
"content_type": "multimodal_text",
"parts": [
{"content_type": "audio_transcription", "text": "The capital of France is Paris.", "direction": "out", "decoding_id": null},
{"content_type": "audio_asset_pointer", "asset_pointer": "sediment://file_assistant001", "size_bytes": 80000, "format": "wav", "metadata": {"start": 0.0, "end": 3.2}}
]
}
}
},
"node-mm-user-rev": {
"id": "node-mm-user-rev",
"parent": "node-mm-assistant",
"children": ["node-image-only"],
"message": {
"id": "node-mm-user-rev",
"author": {"role": "user"},
"create_time": 1704067400.0,
"content": {
"content_type": "multimodal_text",
"parts": [
{"content_type": "real_time_user_audio_video_asset_pointer", "frames_asset_pointers": [], "video_container_asset_pointer": null, "audio_asset_pointer": {"content_type": "audio_asset_pointer", "asset_pointer": "sediment://file_user002", "size_bytes": 30000, "format": "wav", "metadata": {"start": 0.0, "end": 1.5}}, "audio_start_timestamp": 5.0},
{"content_type": "audio_transcription", "text": "Tell me more please.", "direction": "in", "decoding_id": null}
]
}
}
},
"node-image-only": {
"id": "node-image-only",
"parent": "node-mm-user-rev",
"children": ["node-exec-output"],
"message": {
"id": "node-image-only",
"author": {"role": "user"},
"create_time": 1704067500.0,
"content": {
"content_type": "multimodal_text",
"parts": [
{"content_type": "image_asset_pointer", "asset_pointer": "file-service://image001"}
]
}
}
},
"node-exec-output": {
"id": "node-exec-output",
"parent": "node-image-only",
"children": ["node-exec-output-empty"],
"message": {
"id": "node-exec-output",
"author": {"role": "tool", "name": "container.exec", "metadata": {}},
"create_time": 1704067600.0,
"content": {
"content_type": "execution_output",
"text": "Hello from container.exec\nLine 2 of output"
},
"metadata": {
"aggregate_result": {"status": "success", "messages": []},
"reasoning_title": "Reading skill documentation"
}
}
},
"node-exec-output-empty": {
"id": "node-exec-output-empty",
"parent": "node-exec-output",
"children": ["node-system-error"],
"message": {
"id": "node-exec-output-empty",
"author": {"role": "tool", "name": "python", "metadata": {}},
"create_time": 1704067610.0,
"content": {
"content_type": "execution_output",
"text": ""
},
"metadata": {}
}
},
"node-system-error": {
"id": "node-system-error",
"parent": "node-exec-output-empty",
"children": ["node-tether-spinner"],
"message": {
"id": "node-system-error",
"author": {"role": "tool", "name": "web", "metadata": {}},
"create_time": 1704067620.0,
"content": {
"content_type": "system_error",
"name": "tool_error",
"text": "Error: Error from browse service: Error calling browse service: 503"
},
"metadata": {}
}
},
"node-tether-spinner": {
"id": "node-tether-spinner",
"parent": "node-system-error",
"children": [],
"message": {
"id": "node-tether-spinner",
"author": {"role": "tool", "name": "file_search", "metadata": {}},
"create_time": 1704067630.0,
"content": {
"content_type": "tether_browsing_display",
"result": "",
"summary": "",
"assets": null,
"tether_id": null
},
"metadata": {"command": "spinner", "status": "running"}
} }
} }
} }

View File

@@ -30,6 +30,15 @@
"sender": "human", "sender": "human",
"created_at": "2024-06-10T14:45:00.000Z", "created_at": "2024-06-10T14:45:00.000Z",
"content": "Thank you, that helped!" "content": "Thank you, that helped!"
},
{
"uuid": "msg-004",
"sender": "human",
"created_at": "2024-06-10T14:50:00.000Z",
"content": [
{"type": "text", "text": "What about this image?"},
{"type": "image", "source": {"file_uuid": "claude-image-uuid-1", "media_type": "image/png"}}
]
} }
] ]
} }

176
tests/test_cli.py Normal file
View File

@@ -0,0 +1,176 @@
"""CLI-level tests using Click's CliRunner — no live API calls required."""
import pytest
from click.testing import CliRunner
from src.cache import Cache
from src.main import _filter_by_project, cli
# ---------------------------------------------------------------------------
# _filter_by_project (T-27)
# ---------------------------------------------------------------------------
class TestFilterByProject:
"""Unit tests for the project filter logic used by export/list/joplin."""
# ChatGPT conversations use the _project_name annotation key
def _chatgpt(self, conv_id, project_name):
return {"id": conv_id, "_project_name": project_name}
# Claude conversations use the project dict key
def _claude(self, conv_id, project_name):
proj = {"name": project_name} if project_name else None
return {"id": conv_id, "project": proj}
def test_none_filter_keeps_no_project_chatgpt(self):
convs = [self._chatgpt("a", None), self._chatgpt("b", "Python Course")]
result = _filter_by_project(convs, "none")
assert len(result) == 1
assert result[0]["id"] == "a"
def test_none_filter_keeps_no_project_claude(self):
convs = [self._claude("a", None), self._claude("b", "Python Course")]
result = _filter_by_project(convs, "none")
assert len(result) == 1
assert result[0]["id"] == "a"
def test_name_filter_case_insensitive(self):
convs = [
self._chatgpt("a", "Python Course"),
self._chatgpt("b", "Java Course"),
self._chatgpt("c", None),
]
result = _filter_by_project(convs, "PYTHON")
assert len(result) == 1
assert result[0]["id"] == "a"
def test_name_filter_substring_match(self):
convs = [
self._chatgpt("a", "Python Advanced Course"),
self._chatgpt("b", "Python Basics"),
self._chatgpt("c", "JavaScript"),
]
result = _filter_by_project(convs, "python")
assert len(result) == 2
assert {c["id"] for c in result} == {"a", "b"}
def test_no_matches_returns_empty(self):
convs = [self._chatgpt("a", "Python Course"), self._chatgpt("b", None)]
result = _filter_by_project(convs, "ruby")
assert result == []
def test_none_filter_excludes_all_with_projects(self):
convs = [self._chatgpt("a", "Project A"), self._chatgpt("b", "Project B")]
result = _filter_by_project(convs, "none")
assert result == []
def test_empty_string_project_treated_as_no_project(self):
convs = [{"id": "a", "_project_name": ""}, {"id": "b", "_project_name": "Real"}]
result = _filter_by_project(convs, "none")
assert len(result) == 1
assert result[0]["id"] == "a"
def test_claude_project_string_matched(self):
# Claude can also have project as a plain string
convs = [{"id": "a", "project": "python-course"}, {"id": "b", "project": None}]
result = _filter_by_project(convs, "python")
assert len(result) == 1
assert result[0]["id"] == "a"
# ---------------------------------------------------------------------------
# export --since validation (T-25)
# ---------------------------------------------------------------------------
class TestExportSinceValidation:
"""Test that --since with an invalid date exits cleanly with an error message."""
def _pre_populated_cache(self, tmp_path) -> Cache:
"""Create a cache that passes the ToS gate and first-run doctor check."""
cache = Cache(tmp_path)
cache.acknowledge_tos()
cache.mark_exported("chatgpt", "dummy-conv", {"updated_at": "2024-01-01T00:00:00Z"})
return cache
def test_invalid_since_date_exits_with_error(self, tmp_path):
self._pre_populated_cache(tmp_path)
runner = CliRunner(mix_stderr=True)
result = runner.invoke(
cli,
["--no-log-file", "export", "--since", "notadate"],
env={
"CHATGPT_SESSION_TOKEN": "eyJtesttoken",
"CACHE_DIR": str(tmp_path),
"EXPORT_DIR": str(tmp_path / "exports"),
},
)
assert result.exit_code == 1
assert "Invalid --since date" in result.output
assert "YYYY-MM-DD" in result.output
def test_valid_since_date_does_not_error(self, tmp_path):
"""A valid date should not produce the invalid-date error (may fail later on API)."""
self._pre_populated_cache(tmp_path)
runner = CliRunner(mix_stderr=True)
result = runner.invoke(
cli,
["--no-log-file", "export", "--since", "2024-01-01"],
env={
"CHATGPT_SESSION_TOKEN": "eyJtesttoken",
"CACHE_DIR": str(tmp_path),
"EXPORT_DIR": str(tmp_path / "exports"),
},
)
assert "Invalid --since date" not in result.output
# ---------------------------------------------------------------------------
# LossReport summary
# ---------------------------------------------------------------------------
class TestLossReportSummary:
"""The LossReport's format_summary() pinned format covers zero, top-5, and overflow cases."""
def test_zero_summary_uses_none_sentinel(self):
from src.loss_report import LossReport
report = LossReport()
out = report.format_summary()
assert "[export] Run summary:" in out
assert "conversations: 0" in out
assert "messages rendered: 0" in out
# Both "(none)" sentinels present — never empty parens
assert out.count("(none)") == 2
def test_top_5_breakdown(self):
from src.loss_report import LossReport
report = LossReport()
for raw_type in ("a", "b", "c", "d", "e", "f", "g"):
report.record_unknown(raw_type)
if raw_type == "a":
# Make 'a' the most common
for _ in range(4):
report.record_unknown("a")
out = report.format_summary()
# Top entry shown
assert "a=5" in out
# Overflow line present (7 types, top 5 + 2 more)
assert "+ 2 more types" in out
def test_messages_and_conversations_recorded(self):
from src.loss_report import LossReport
report = LossReport()
report.record_conversation()
report.record_message()
report.record_message()
out = report.format_summary()
assert "conversations: 1" in out
assert "messages rendered: 2" in out

56
tests/test_config.py Normal file
View File

@@ -0,0 +1,56 @@
"""Tests for src/config.py — token validation logic (T-14)."""
import logging
import time
import jwt
import pytest
from src.config import _validate_chatgpt_token
class TestValidateChatGPTToken:
def test_expired_token_logs_warning(self, caplog):
# T-14: expired JWT must produce a clear warning
payload = {"exp": int(time.time()) - 3600} # expired 1 hour ago
token = jwt.encode(payload, "secret", algorithm="HS256")
with caplog.at_level(logging.WARNING, logger="src.config"):
result = _validate_chatgpt_token(token)
assert any("expired" in r.message.lower() for r in caplog.records)
assert result is not None # still returns the expiry datetime
def test_expiring_within_24h_logs_warning(self, caplog):
payload = {"exp": int(time.time()) + 3600} # expires in 1 hour
token = jwt.encode(payload, "secret", algorithm="HS256")
with caplog.at_level(logging.WARNING, logger="src.config"):
_validate_chatgpt_token(token)
assert any("less than 24 hours" in r.message for r in caplog.records)
def test_valid_token_no_expiry_warning(self, caplog):
payload = {"exp": int(time.time()) + 86400 * 5} # valid for 5 days
token = jwt.encode(payload, "secret", algorithm="HS256")
with caplog.at_level(logging.WARNING, logger="src.config"):
result = _validate_chatgpt_token(token)
assert not any("expired" in r.message.lower() for r in caplog.records)
assert result is not None
def test_token_without_exp_claim_logs_warning(self, caplog):
payload = {"sub": "user123"} # no exp
token = jwt.encode(payload, "secret", algorithm="HS256")
with caplog.at_level(logging.WARNING, logger="src.config"):
result = _validate_chatgpt_token(token)
assert any("'exp'" in r.message or "no 'exp'" in r.message for r in caplog.records)
assert result is None
def test_jwe_encrypted_token_returns_none(self, caplog):
# JWE tokens (alg=dir) cannot be decoded client-side — this is normal for ChatGPT
jwe_like = "eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIn0.fake.token.data.here"
with caplog.at_level(logging.DEBUG, logger="src.config"):
result = _validate_chatgpt_token(jwe_like)
assert result is None # cannot decode, but not an error
def test_non_jwt_string_logs_warning(self, caplog):
with caplog.at_level(logging.WARNING, logger="src.config"):
result = _validate_chatgpt_token("notajwttoken")
assert any("does not look like a JWT" in r.message for r in caplog.records)
assert result is None

View File

@@ -1,4 +1,4 @@
"""Unit tests for src/exporters/.""" """Unit tests for src/exporters/ and src/blocks.py."""
import json import json
import os import os
@@ -7,6 +7,23 @@ from pathlib import Path
import pytest import pytest
from src.blocks import (
BLOCK_TYPE_TEXT,
UNKNOWN_REASON_EXTRACTION_FAILED,
UNKNOWN_REASON_UNKNOWN_TYPE,
_blockquote_prefix,
_safe_fence,
make_code_block,
make_file_placeholder,
make_hidden_context_marker,
make_image_placeholder,
make_text_block,
make_thinking_block,
make_tool_result_block,
make_tool_use_block,
make_unknown_block,
render_blocks_to_markdown,
)
from src.exporters.markdown import MarkdownExporter, _yaml_escape, _format_timestamp from src.exporters.markdown import MarkdownExporter, _yaml_escape, _format_timestamp
from src.exporters.json_export import JSONExporter from src.exporters.json_export import JSONExporter
@@ -122,7 +139,7 @@ class TestMarkdownFilenameGeneration:
def test_year_in_path(self, tmp_path): def test_year_in_path(self, tmp_path):
exp = MarkdownExporter(tmp_path) exp = MarkdownExporter(tmp_path)
path = exp.export(SAMPLE_CONV) path = exp.export(SAMPLE_CONV)
assert "/2024/" in str(path) assert ".2024/" in str(path)
def test_output_structure_provider_project(self, tmp_path): def test_output_structure_provider_project(self, tmp_path):
exp = MarkdownExporter(tmp_path, output_structure="provider/project") exp = MarkdownExporter(tmp_path, output_structure="provider/project")
@@ -199,6 +216,34 @@ class TestJSONExporter:
assert " " in raw assert " " in raw
class TestBothFormats:
"""T-38: Markdown and JSON exporters produce matching filenames for the same conversation."""
def test_both_formats_produce_files(self, tmp_path):
md_exp = MarkdownExporter(tmp_path)
json_exp = JSONExporter(tmp_path)
md_path = md_exp.export(SAMPLE_CONV)
json_path = json_exp.export(SAMPLE_CONV)
assert md_path.exists()
assert json_path.exists()
def test_both_formats_have_matching_stems(self, tmp_path):
md_exp = MarkdownExporter(tmp_path)
json_exp = JSONExporter(tmp_path)
md_path = md_exp.export(SAMPLE_CONV)
json_path = json_exp.export(SAMPLE_CONV)
assert md_path.suffix == ".md"
assert json_path.suffix == ".json"
assert md_path.stem == json_path.stem
def test_both_formats_same_directory(self, tmp_path):
md_exp = MarkdownExporter(tmp_path)
json_exp = JSONExporter(tmp_path)
md_path = md_exp.export(SAMPLE_CONV)
json_path = json_exp.export(SAMPLE_CONV)
assert md_path.parent == json_path.parent
class TestYamlEscape: class TestYamlEscape:
def test_escapes_double_quotes(self): def test_escapes_double_quotes(self):
assert _yaml_escape('Say "hello"') == 'Say \\"hello\\"' assert _yaml_escape('Say "hello"') == 'Say \\"hello\\"'
@@ -222,3 +267,271 @@ class TestFormatTimestamp:
def test_empty_string(self): def test_empty_string(self):
assert _format_timestamp("") == "" assert _format_timestamp("") == ""
# ---------------------------------------------------------------------------
# Block helpers and rendering
# ---------------------------------------------------------------------------
class TestSafeFence:
def test_minimum_three_backticks(self):
assert _safe_fence("plain text") == "```"
def test_four_backticks_when_three_in_content(self):
assert _safe_fence("here ``` is a fence") == "````"
def test_five_backticks_when_four_in_content(self):
assert _safe_fence("here ```` is four") == "`````"
def test_handles_empty_string(self):
assert _safe_fence("") == "```"
def test_handles_run_at_end(self):
# Trailing run still counted
assert _safe_fence("text ending in ```") == "````"
class TestBlockquotePrefix:
def test_single_line(self):
assert _blockquote_prefix("hello") == "> hello"
def test_multi_line(self):
assert _blockquote_prefix("a\nb\nc") == "> a\n> b\n> c"
def test_empty_lines_become_naked_quote_marker(self):
assert _blockquote_prefix("a\n\nb") == "> a\n>\n> b"
def test_empty_string(self):
assert _blockquote_prefix("") == ">"
class TestBlockConstructors:
def test_make_text_block_returns_none_for_empty(self):
assert make_text_block("") is None
assert make_text_block(" ") is None
def test_make_text_block_returns_dict(self):
b = make_text_block("hello")
assert b == {"type": "text", "text": "hello"}
def test_make_code_block_returns_none_for_empty(self):
assert make_code_block("") is None
def test_make_thinking_block_returns_none_for_empty(self):
assert make_thinking_block("") is None
class TestRenderBlocks:
def test_text_block_renders_as_paragraph(self):
out = render_blocks_to_markdown([make_text_block("Hello world")])
assert out == "Hello world"
def test_blocks_separated_by_blank_line(self):
out = render_blocks_to_markdown(
[make_text_block("first"), make_text_block("second")]
)
assert out == "first\n\nsecond"
def test_code_block_with_language(self):
out = render_blocks_to_markdown([make_code_block("print(1)", language="python")])
assert "```python" in out
assert "print(1)" in out
def test_thinking_block_uses_blockquote(self):
out = render_blocks_to_markdown([make_thinking_block("step 1\nstep 2")])
assert "**💭 Reasoning**" in out
assert "> step 1" in out
assert "> step 2" in out
def test_tool_use_renders_as_blockquote_with_safe_fence(self):
out = render_blocks_to_markdown(
[make_tool_use_block("search", {"query": "test"})]
)
assert "> 🔧 **Tool: search**" in out
# Every line of the body is blockquote-prefixed
assert "> ```json" in out
assert "> }" in out
def test_tool_use_with_multiline_input(self):
out = render_blocks_to_markdown(
[make_tool_use_block("complex", {"a": 1, "b": [{"x": "y"}]})]
)
# Prefix every line of multi-line JSON
for line in out.split("\n"):
assert line.startswith(">") or line == ""
def test_tool_result_success_uses_outbox_icon(self):
out = render_blocks_to_markdown([make_tool_result_block("OK")])
assert "📤 **Result**" in out
assert "" not in out
def test_tool_result_error_uses_x_icon(self):
out = render_blocks_to_markdown([make_tool_result_block("oops", is_error=True)])
assert "❌ **Result (error)**" in out
assert "📤" not in out
def test_tool_result_with_tool_name_in_header(self):
out = render_blocks_to_markdown(
[make_tool_result_block("done", tool_name="container.exec")]
)
assert "📤 **Result: container.exec**" in out
def test_tool_result_error_with_tool_name(self):
out = render_blocks_to_markdown(
[make_tool_result_block("503", tool_name="web", is_error=True)]
)
assert "❌ **Result (error): web**" in out
def test_tool_result_summary_renders_as_italic_line(self):
out = render_blocks_to_markdown(
[
make_tool_result_block(
"output",
tool_name="container.exec",
summary="Reading skill documentation",
)
]
)
# Summary line is italic, lives between header and fence,
# all inside the blockquote prefix.
assert "> *Reading skill documentation*" in out
# Order: header before summary before fence
header_idx = out.index("Result: container.exec")
summary_idx = out.index("Reading skill documentation")
fence_idx = out.index("output")
assert header_idx < summary_idx < fence_idx
def test_image_placeholder_rendering(self):
out = render_blocks_to_markdown(
[make_image_placeholder(ref="file-123", source="user_upload")]
)
assert "🖼️ **Image attached**" in out
assert "`file-123`" in out
assert "user_upload" in out
assert "content not preserved" in out
def test_file_placeholder_with_metadata(self):
out = render_blocks_to_markdown(
[make_file_placeholder(ref="sediment://x", mime="audio/wav", size_bytes=10240, duration_seconds=2.5)]
)
assert "📎 **File attached**" in out
assert "audio/wav" in out
assert "KB" in out
assert "2.50s" in out
def test_unknown_block_renders_with_keys(self):
out = render_blocks_to_markdown(
[
make_unknown_block(
raw_type="future_x",
observed_keys=["foo", "bar"],
reason=UNKNOWN_REASON_UNKNOWN_TYPE,
)
]
)
assert "⚠️ **Unsupported content**" in out
assert "future_x" in out
assert "`foo`" in out
assert "`bar`" in out
def test_unknown_extraction_failed_includes_summary(self):
out = render_blocks_to_markdown(
[
make_unknown_block(
raw_type="audio_transcription",
observed_keys=["asset_pointer"],
reason=UNKNOWN_REASON_EXTRACTION_FAILED,
summary="expected key 'text' not found",
)
]
)
assert "extraction_failed" in out
assert "expected key 'text' not found" in out
def test_hidden_context_marker(self):
out = render_blocks_to_markdown(
[make_hidden_context_marker("user_editable_context")]
)
assert " **Hidden context**" in out
assert "`user_editable_context`" in out
def test_safe_fence_prevents_runaway_code_block(self):
# Content contains an unbalanced opening fence — without _safe_fence
# this would corrupt downstream rendering.
evil_content = "before\n```Follow\ntext\nraw is: \"```"
block = make_code_block(evil_content)
out = render_blocks_to_markdown([block, make_text_block("after")])
# The 4-backtick wrap should be present
assert "````" in out
# The "after" text should appear OUTSIDE any code block — it follows
# the closing ```` fence.
assert out.endswith("after")
def test_block_order_preserved(self):
blocks = [
make_text_block("a"),
make_image_placeholder(ref="r1", source="user_upload"),
make_text_block("b"),
]
out = render_blocks_to_markdown(blocks)
assert out.index("a") < out.index("Image attached")
assert out.index("Image attached") < out.index("b")
# ---------------------------------------------------------------------------
# Markdown exporter with blocks
# ---------------------------------------------------------------------------
SAMPLE_CONV_BLOCKS = {
"id": "blocks12345",
"title": "Blocks Conversation",
"provider": "claude",
"project": None,
"created_at": "2024-06-10T14:32:00Z",
"updated_at": "2024-06-10T15:00:00Z",
"message_count": 1,
"messages": [
{
"role": "assistant",
"content_type": "text",
"timestamp": None,
"blocks": [
{"type": "text", "text": "Here is the answer."},
{"type": "tool_use", "name": "search", "input": {"q": "x"}, "tool_id": "t1"},
],
}
],
}
class TestMarkdownExporterWithBlocks:
def test_renders_blocks(self, tmp_path):
exp = MarkdownExporter(tmp_path)
path = exp.export(SAMPLE_CONV_BLOCKS)
body = path.read_text()
assert "Here is the answer." in body
assert "🔧 **Tool: search**" in body
def test_falls_back_to_content_when_blocks_missing(self, tmp_path):
# Backward-compat: messages with `content` only (no `blocks`) still render.
exp = MarkdownExporter(tmp_path)
path = exp.export(SAMPLE_CONV) # SAMPLE_CONV has content only, no blocks
body = path.read_text()
assert "Hello, how are you?" in body
def test_skips_messages_with_neither_blocks_nor_content(self, tmp_path):
conv = {
**SAMPLE_CONV_BLOCKS,
"messages": [
{"role": "user", "content_type": "text", "timestamp": None, "blocks": []},
{"role": "assistant", "content_type": "text", "timestamp": None, "blocks": [
{"type": "text", "text": "I am here."}
]},
],
}
exp = MarkdownExporter(tmp_path)
path = exp.export(conv)
body = path.read_text()
assert "I am here." in body

View File

@@ -5,7 +5,7 @@ from unittest.mock import MagicMock, patch
import pytest import pytest
import requests import requests
from src.joplin import JoplinClient, JoplinError, _http_error_message, _timeout_message, notebook_title from src.joplin import JoplinClient, JoplinError, _http_error_message, _timeout_message, notebook_path
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@@ -31,25 +31,29 @@ def _mock_response(json_data=None, text="", status_code=200):
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# notebook_title helper # notebook_path helper
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestNotebookTitle: class TestNotebookPath:
def test_no_project(self): def test_no_project(self):
assert notebook_title("chatgpt", None) == "ChatGPT - No Project" assert notebook_path("chatgpt", None) == ("AI-ChatGPT", "No Project")
def test_no_project_string(self): def test_no_project_string(self):
assert notebook_title("chatgpt", "no-project") == "ChatGPT - No Project" assert notebook_path("chatgpt", "no-project") == ("AI-ChatGPT", "No Project")
def test_project_with_hyphens(self): def test_project_with_hyphens(self):
assert notebook_title("chatgpt", "my-project") == "ChatGPT - My Project" assert notebook_path("chatgpt", "my-project") == ("AI-ChatGPT", "My Project")
def test_claude_provider(self): def test_claude_provider(self):
assert notebook_title("claude", "budget-tracker") == "Claude - Budget Tracker" assert notebook_path("claude", "budget-tracker") == ("AI-Claude", "Budget Tracker")
def test_multi_word_project(self): def test_multi_word_project(self):
assert notebook_title("claude", "ai-research-notes") == "Claude - Ai Research Notes" assert notebook_path("claude", "ai-research-notes") == ("AI-Claude", "Ai Research Notes")
def test_returns_tuple(self):
result = notebook_path("chatgpt", "some-project")
assert isinstance(result, tuple) and len(result) == 2
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@@ -236,18 +240,30 @@ class TestListNotebooks:
class TestGetOrCreateNotebook: class TestGetOrCreateNotebook:
def test_returns_existing_notebook_id(self): def test_returns_existing_root_notebook_id(self):
client = _make_client() client = _make_client()
with patch("requests.get") as mock_get: with patch("requests.get") as mock_get:
mock_get.return_value = _mock_response( mock_get.return_value = _mock_response(
json_data={ json_data={
"items": [{"id": "nb-existing", "title": "ChatGPT - No Project"}], "items": [{"id": "nb-existing", "title": "AI-ChatGPT", "parent_id": ""}],
"has_more": False, "has_more": False,
} }
) )
nb_id = client.get_or_create_notebook("ChatGPT - No Project") nb_id = client.get_or_create_notebook("AI-ChatGPT")
assert nb_id == "nb-existing" assert nb_id == "nb-existing"
def test_returns_existing_child_notebook_id(self):
client = _make_client()
with patch("requests.get") as mock_get:
mock_get.return_value = _mock_response(
json_data={
"items": [{"id": "nb-child", "title": "No Project", "parent_id": "nb-parent"}],
"has_more": False,
}
)
nb_id = client.get_or_create_notebook("No Project", parent_id="nb-parent")
assert nb_id == "nb-child"
def test_creates_new_notebook_when_not_found(self): def test_creates_new_notebook_when_not_found(self):
client = _make_client() client = _make_client()
with patch("requests.get") as mock_get, patch("requests.post") as mock_post: with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
@@ -255,26 +271,103 @@ class TestGetOrCreateNotebook:
json_data={"items": [], "has_more": False} json_data={"items": [], "has_more": False}
) )
mock_post.return_value = _mock_response( mock_post.return_value = _mock_response(
json_data={"id": "nb-new", "title": "ChatGPT - New Project"} json_data={"id": "nb-new", "title": "AI-ChatGPT"}
) )
nb_id = client.get_or_create_notebook("ChatGPT - New Project") nb_id = client.get_or_create_notebook("AI-ChatGPT")
assert nb_id == "nb-new" assert nb_id == "nb-new"
mock_post.assert_called_once() mock_post.assert_called_once()
def test_creates_child_notebook_with_parent_id(self):
client = _make_client()
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
mock_get.return_value = _mock_response(
json_data={"items": [], "has_more": False}
)
mock_post.return_value = _mock_response(
json_data={"id": "nb-child", "title": "My Project"}
)
nb_id = client.get_or_create_notebook("My Project", parent_id="nb-parent")
assert nb_id == "nb-child"
_, kwargs = mock_post.call_args
assert kwargs["json"]["parent_id"] == "nb-parent"
def test_does_not_include_parent_id_for_root(self):
client = _make_client()
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
mock_get.return_value = _mock_response(json_data={"items": [], "has_more": False})
mock_post.return_value = _mock_response(json_data={"id": "nb-root", "title": "AI-Claude"})
client.get_or_create_notebook("AI-Claude")
_, kwargs = mock_post.call_args
assert "parent_id" not in kwargs["json"]
def test_caches_notebook_after_first_load(self): def test_caches_notebook_after_first_load(self):
client = _make_client() client = _make_client()
with patch("requests.get") as mock_get: with patch("requests.get") as mock_get:
mock_get.return_value = _mock_response( mock_get.return_value = _mock_response(
json_data={ json_data={
"items": [{"id": "nb1", "title": "Claude - No Project"}], "items": [{"id": "nb1", "title": "AI-Claude", "parent_id": ""}],
"has_more": False, "has_more": False,
} }
) )
# Call twice — GET /folders should only happen once # Call twice — GET /folders should only happen once
client.get_or_create_notebook("Claude - No Project") client.get_or_create_notebook("AI-Claude")
client.get_or_create_notebook("Claude - No Project") client.get_or_create_notebook("AI-Claude")
assert mock_get.call_count == 1 assert mock_get.call_count == 1
def test_different_parent_ids_are_distinct_cache_entries(self):
"""Same title under different parents are different notebooks."""
client = _make_client()
with patch("requests.get") as mock_get:
mock_get.return_value = _mock_response(
json_data={
"items": [
{"id": "nb-a", "title": "No Project", "parent_id": "parent-chatgpt"},
{"id": "nb-b", "title": "No Project", "parent_id": "parent-claude"},
],
"has_more": False,
}
)
id_a = client.get_or_create_notebook("No Project", parent_id="parent-chatgpt")
id_b = client.get_or_create_notebook("No Project", parent_id="parent-claude")
assert id_a == "nb-a"
assert id_b == "nb-b"
class TestGetOrCreateNotebookPath:
def test_creates_two_level_path(self):
client = _make_client()
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
mock_get.return_value = _mock_response(json_data={"items": [], "has_more": False})
mock_post.side_effect = [
_mock_response(json_data={"id": "nb-parent", "title": "AI-ChatGPT"}),
_mock_response(json_data={"id": "nb-child", "title": "No Project"}),
]
leaf_id = client.get_or_create_notebook_path(["AI-ChatGPT", "No Project"])
assert leaf_id == "nb-child"
assert mock_post.call_count == 2
# Second POST should use the parent's ID
_, kwargs = mock_post.call_args_list[1]
assert kwargs["json"]["parent_id"] == "nb-parent"
def test_reuses_existing_parent_for_new_child(self):
client = _make_client()
with patch("requests.get") as mock_get, patch("requests.post") as mock_post:
mock_get.return_value = _mock_response(
json_data={
"items": [{"id": "nb-parent", "title": "AI-Claude", "parent_id": ""}],
"has_more": False,
}
)
mock_post.return_value = _mock_response(
json_data={"id": "nb-child", "title": "Budget Tracker"}
)
leaf_id = client.get_or_create_notebook_path(["AI-Claude", "Budget Tracker"])
assert leaf_id == "nb-child"
# Only one POST — the parent already existed
assert mock_post.call_count == 1
_, kwargs = mock_post.call_args
assert kwargs["json"]["parent_id"] == "nb-parent"
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# create_note # create_note

View File

@@ -1,19 +1,53 @@
"""Unit tests for src/providers/ using fixture files.""" """Unit tests for src/providers/ using fixture files."""
import json import json
import logging
from pathlib import Path from pathlib import Path
import pytest import pytest
from src.blocks import (
BLOCK_TYPE_FILE_PLACEHOLDER,
BLOCK_TYPE_HIDDEN_CONTEXT_MARKER,
BLOCK_TYPE_IMAGE_PLACEHOLDER,
BLOCK_TYPE_TEXT,
BLOCK_TYPE_THINKING,
BLOCK_TYPE_TOOL_RESULT,
BLOCK_TYPE_TOOL_USE,
BLOCK_TYPE_UNKNOWN,
render_blocks_to_markdown,
)
from src.loss_report import LossReport
FIXTURES = Path(__file__).parent / "fixtures" FIXTURES = Path(__file__).parent / "fixtures"
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _block_types(message: dict) -> list[str]:
return [b.get("type") for b in (message.get("blocks") or [])]
def _first_block(message: dict, block_type: str) -> dict | None:
for b in message.get("blocks") or []:
if b.get("type") == block_type:
return b
return None
# ---------------------------------------------------------------------------
# ChatGPT
# ---------------------------------------------------------------------------
class TestChatGPTNormalization: class TestChatGPTNormalization:
"""Test ChatGPTProvider.normalize_conversation() using fixture data.""" """ChatGPT normalize_conversation block-extraction behavior."""
def _get_provider(self): def _get_provider(self):
from src.providers.chatgpt import ChatGPTProvider from src.providers.chatgpt import ChatGPTProvider
# Bypass __init__ token check
p = ChatGPTProvider.__new__(ChatGPTProvider) p = ChatGPTProvider.__new__(ChatGPTProvider)
import requests import requests
p._session = requests.Session() p._session = requests.Session()
@@ -31,7 +65,6 @@ class TestChatGPTNormalization:
assert result["id"] == "chatgpt-conv-001" assert result["id"] == "chatgpt-conv-001"
assert result["title"] == "Python Async Tutorial" assert result["title"] == "Python Async Tutorial"
assert result["provider"] == "chatgpt" assert result["provider"] == "chatgpt"
# No entry in _project_map → project is None
assert result["project"] is None assert result["project"] is None
assert result["created_at"] != "" assert result["created_at"] != ""
assert result["updated_at"] != "" assert result["updated_at"] != ""
@@ -46,7 +79,6 @@ class TestChatGPTNormalization:
assert result["id"] == "chatgpt-conv-002" assert result["id"] == "chatgpt-conv-002"
def test_normalizes_with_project_from_map(self): def test_normalizes_with_project_from_map(self):
"""Project name from _project_map (populated by fetch_all_conversations) flows through."""
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text()) raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
p._project_map["chatgpt-conv-001"] = "My Research Project" p._project_map["chatgpt-conv-001"] = "My Research Project"
@@ -54,33 +86,134 @@ class TestChatGPTNormalization:
assert result["project"] == "My Research Project" assert result["project"] == "My Research Project"
def test_extracts_text_messages(self): def test_text_message_emits_text_block(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text()) raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
assert len(result["messages"]) >= 2
user_msgs = [m for m in result["messages"] if m["role"] == "user"] user_msgs = [m for m in result["messages"] if m["role"] == "user"]
assert any("async" in m["content"].lower() for m in user_msgs) # The "How does async/await..." message
async_msgs = [
m for m in user_msgs
if any(
"async" in (b.get("text") or "").lower()
for b in (m.get("blocks") or [])
)
]
assert async_msgs, "expected a user message about async/await"
assert _block_types(async_msgs[0]) == [BLOCK_TYPE_TEXT]
def test_skips_non_text_content_with_warning(self, caplog): def test_code_block_preserved_with_language(self):
import logging
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text()) raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
with caplog.at_level(logging.WARNING):
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
# The fixture has an image_asset_pointer node — should be warned about
assert any( assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
"image_asset_pointer" in r.message or "rich content" in r.message # The first assistant message is the async/await answer with a python fence
for r in caplog.records text_block = _first_block(assistant_msgs[0], BLOCK_TYPE_TEXT)
assert text_block is not None
assert "```python" in text_block["text"]
def test_multimodal_voice_user_message(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
# node-mm-user: audio_transcription "What is the capital of France?"
# + real_time_user_audio_video_asset_pointer wrapping a sediment:// URL
capital_msgs = [
m for m in result["messages"]
if any(
"capital of france" in (b.get("text") or "").lower()
for b in (m.get("blocks") or [])
) )
]
assert capital_msgs, "expected the audio_transcription text to surface"
types = _block_types(capital_msgs[0])
assert BLOCK_TYPE_TEXT in types
assert BLOCK_TYPE_FILE_PLACEHOLDER in types
file_block = _first_block(capital_msgs[0], BLOCK_TYPE_FILE_PLACEHOLDER)
assert file_block["ref"].startswith("sediment://")
assert file_block["mime"] == "audio/wav"
assert file_block["size_bytes"] == 50000
assert file_block["duration_seconds"] == pytest.approx(2.5)
def test_multimodal_voice_reverse_order_preserved(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
# node-mm-user-rev has parts in REVERSE order: asset first, transcription second.
rev_msgs = [
m for m in result["messages"]
if any(
"tell me more" in (b.get("text") or "").lower()
for b in (m.get("blocks") or [])
)
]
assert rev_msgs, "expected the reverse-order voice message"
types = _block_types(rev_msgs[0])
# Order preserved: file_placeholder before text
assert types == [BLOCK_TYPE_FILE_PLACEHOLDER, BLOCK_TYPE_TEXT]
def test_image_only_user_message_renders(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
image_msgs = [
m for m in result["messages"]
if any(b.get("type") == BLOCK_TYPE_IMAGE_PLACEHOLDER for b in (m.get("blocks") or []))
]
assert image_msgs, "image-only user message should now render"
def test_user_editable_context_emits_blocks(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
# The user_editable_context message has user_profile + user_instructions.
# It should now appear (was silently dropped pre-v0.4.0).
uec_msgs = [
m for m in result["messages"]
if any(
"Custom Instructions" in (b.get("text") or "")
for b in (m.get("blocks") or [])
)
]
assert uec_msgs, "user_editable_context should be visible in output"
# Hidden context marker should be prepended.
assert uec_msgs[0]["blocks"][0]["type"] == BLOCK_TYPE_HIDDEN_CONTEXT_MARKER
def test_user_editable_context_uses_safe_fence(self):
"""The user_instructions value contains embedded triple-backticks; the rendered
Markdown must use a fence longer than 3 backticks so embedded fences are inert.
"""
from src.blocks import render_blocks_to_markdown
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
uec_msgs = [
m for m in result["messages"]
if any(
"Custom Instructions" in (b.get("text") or "")
for b in (m.get("blocks") or [])
)
]
assert uec_msgs
rendered = render_blocks_to_markdown(uec_msgs[0]["blocks"])
# Content has ``` inside, so the wrap fence must be at least 4 backticks.
assert "````" in rendered, "expected a 4+ backtick safe-fence wrap"
def test_message_roles_are_valid(self): def test_message_roles_are_valid(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text()) raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
for msg in result["messages"]: for msg in result["messages"]:
assert msg["role"] in ("user", "assistant", "system") assert msg["role"] in ("user", "assistant", "system", "tool")
def test_message_count_matches(self): def test_message_count_matches(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text()) raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
@@ -88,16 +221,82 @@ class TestChatGPTNormalization:
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
assert result["message_count"] == len(result["messages"]) assert result["message_count"] == len(result["messages"])
def test_code_fence_preserved(self): def test_loss_report_records_messages(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text()) raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) report = LossReport()
all_content = " ".join(m["content"] for m in result["messages"]) result = p.normalize_conversation(raw, report)
assert "```python" in all_content assert report.messages_rendered == len(result["messages"])
assert report.conversations == 1
class TestChatGPTUnknownContent:
"""Unrecognised content types should produce visible unknown blocks + WARNING + tally."""
def _get_provider(self):
from src.providers.chatgpt import ChatGPTProvider
p = ChatGPTProvider.__new__(ChatGPTProvider)
import requests
p._session = requests.Session()
p._org_id = None
p._project_ids = []
p._project_map = {}
p._project_name_cache = {}
return p
def _make_unknown_conv(self):
return {
"id": "test-unknown",
"title": "Test",
"create_time": 1700000000.0,
"update_time": 1700000001.0,
"mapping": {
"root": {"id": "root", "message": None, "parent": None, "children": ["msg1"]},
"msg1": {
"id": "msg1",
"message": {
"id": "msg1",
"author": {"role": "user"},
"content": {
"content_type": "future_unknown_type_xyz",
"some_field": "value",
},
},
"parent": "root",
"children": [],
},
},
}
def test_unknown_content_type_produces_unknown_block(self):
p = self._get_provider()
result = p.normalize_conversation(self._make_unknown_conv())
assert any(
b.get("type") == BLOCK_TYPE_UNKNOWN
for m in result["messages"]
for b in (m.get("blocks") or [])
)
def test_unknown_content_type_logs_warning(self, caplog):
p = self._get_provider()
with caplog.at_level(logging.WARNING):
p.normalize_conversation(self._make_unknown_conv())
assert any("future_unknown_type_xyz" in r.message for r in caplog.records)
def test_unknown_content_type_increments_loss_report(self):
p = self._get_provider()
report = LossReport()
p.normalize_conversation(self._make_unknown_conv(), report)
assert report.unknown_blocks["future_unknown_type_xyz"] == 1
# ---------------------------------------------------------------------------
# Claude
# ---------------------------------------------------------------------------
class TestClaudeNormalization: class TestClaudeNormalization:
"""Test ClaudeProvider.normalize_conversation() using fixture data.""" """Claude normalize_conversation block-extraction behavior."""
def _get_provider(self): def _get_provider(self):
from src.providers.claude import ClaudeProvider from src.providers.claude import ClaudeProvider
@@ -117,55 +316,138 @@ class TestClaudeNormalization:
assert result["provider"] == "claude" assert result["provider"] == "claude"
assert result["project"] == "StarTOS Packaging" assert result["project"] == "StarTOS Packaging"
assert result["created_at"] == "2024-06-10T14:32:00.000Z" assert result["created_at"] == "2024-06-10T14:32:00.000Z"
assert isinstance(result["messages"], list)
def test_normalizes_without_project(self): def test_normalizes_without_project(self):
raw = json.loads((FIXTURES / "claude_no_project.json").read_text()) raw = json.loads((FIXTURES / "claude_no_project.json").read_text())
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
assert result["project"] is None assert result["project"] is None
assert result["id"] == "claude-conv-002"
def test_string_content_extracted(self): def test_string_content_emits_text_block(self):
raw = json.loads((FIXTURES / "claude_no_project.json").read_text()) raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
assert any("Docker" in m["content"] for m in result["messages"]) thanks_msgs = [
m for m in result["messages"]
if any(
"thank you" in (b.get("text") or "").lower()
for b in (m.get("blocks") or [])
)
]
assert thanks_msgs
def test_list_content_extracted(self): def test_list_content_emits_blocks_in_order(self):
raw = json.loads((FIXTURES / "claude_conversation.json").read_text()) raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"] assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
assert any("manifest" in m["content"].lower() for m in assistant_msgs) # msg-002 has text + tool_use, in that order.
assert assistant_msgs
types = _block_types(assistant_msgs[0])
assert BLOCK_TYPE_TEXT in types
assert BLOCK_TYPE_TOOL_USE in types
# Order preserved
assert types.index(BLOCK_TYPE_TEXT) < types.index(BLOCK_TYPE_TOOL_USE)
def test_non_text_blocks_skipped_with_warning(self, caplog): def test_tool_use_block_fields(self):
import logging
raw = json.loads((FIXTURES / "claude_conversation.json").read_text()) raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
p = self._get_provider() p = self._get_provider()
with caplog.at_level(logging.WARNING):
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
# The fixture has a tool_use block — should warn
assistant_msgs = [m for m in result["messages"] if m["role"] == "assistant"]
tool_block = _first_block(assistant_msgs[0], BLOCK_TYPE_TOOL_USE)
assert tool_block["name"] == "search"
assert tool_block["input"] == {"query": "startOS docs"}
assert tool_block["tool_id"] == "tool-001"
def test_image_block_emits_image_placeholder(self):
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
msg004 = [
m for m in result["messages"]
if any(b.get("type") == BLOCK_TYPE_IMAGE_PLACEHOLDER for b in (m.get("blocks") or []))
]
assert msg004
img = _first_block(msg004[0], BLOCK_TYPE_IMAGE_PLACEHOLDER)
assert img["ref"] == "claude-image-uuid-1"
def test_unknown_block_type_records_loss(self):
from src.blocks import BLOCK_TYPE_UNKNOWN as _UNK
raw = {
"uuid": "test-unknown",
"name": "T",
"chat_messages": [
{
"uuid": "m1",
"sender": "human",
"content": [{"type": "future_block_xyz", "data": "..."}],
}
],
}
p = self._get_provider()
report = LossReport()
result = p.normalize_conversation(raw, report)
assert any( assert any(
"tool_use" in r.message or "rich content" in r.message b.get("type") == _UNK
for r in caplog.records for m in result["messages"]
for b in (m.get("blocks") or [])
) )
assert report.unknown_blocks["future_block_xyz"] == 1
def test_message_count_matches(self): def test_thinking_block(self):
raw = json.loads((FIXTURES / "claude_conversation.json").read_text()) raw = {
"uuid": "thinking-test",
"name": "T",
"chat_messages": [
{
"uuid": "m1",
"sender": "assistant",
"content": [
{"type": "thinking", "thinking": "Let me reason about this."},
{"type": "text", "text": "Here's the answer."},
],
}
],
}
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
assert result["message_count"] == len(result["messages"]) types = _block_types(result["messages"][0])
assert BLOCK_TYPE_THINKING in types
assert BLOCK_TYPE_TEXT in types
def test_roles_normalized(self): def test_tool_result_with_nested_text_blocks(self):
raw = json.loads((FIXTURES / "claude_conversation.json").read_text()) raw = {
"uuid": "tool-result-test",
"name": "T",
"chat_messages": [
{
"uuid": "m1",
"sender": "assistant",
"content": [
{
"type": "tool_result",
"tool_use_id": "tool-001",
"content": [
{"type": "text", "text": "search hit 1"},
{"type": "text", "text": "search hit 2"},
],
"is_error": False,
}
],
}
],
}
p = self._get_provider() p = self._get_provider()
result = p.normalize_conversation(raw) result = p.normalize_conversation(raw)
for msg in result["messages"]: tool_result = _first_block(result["messages"][0], BLOCK_TYPE_TOOL_RESULT)
assert msg["role"] in ("user", "assistant", "system") assert tool_result is not None
assert "search hit 1" in tool_result["output"]
assert "search hit 2" in tool_result["output"]
assert tool_result["is_error"] is False
def test_human_sender_maps_to_user(self): def test_human_sender_maps_to_user(self):
raw = json.loads((FIXTURES / "claude_conversation.json").read_text()) raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
@@ -174,3 +456,188 @@ class TestClaudeNormalization:
roles = {m["role"] for m in result["messages"]} roles = {m["role"] for m in result["messages"]}
assert "user" in roles assert "user" in roles
assert "human" not in roles assert "human" not in roles
def test_loss_report_messages_recorded(self):
raw = json.loads((FIXTURES / "claude_conversation.json").read_text())
p = self._get_provider()
report = LossReport()
result = p.normalize_conversation(raw, report)
assert report.messages_rendered == len(result["messages"])
# ---------------------------------------------------------------------------
# v0.4.1 — execution_output, system_error, tether_browsing_display, conv_id
# ---------------------------------------------------------------------------
class TestChatGPTToolOutputs:
"""v0.4.1 ChatGPT tool-role content_types map onto tool_result blocks."""
def _get_provider(self):
from src.providers.chatgpt import ChatGPTProvider
p = ChatGPTProvider.__new__(ChatGPTProvider)
import requests
p._session = requests.Session()
p._org_id = None
p._project_ids = []
p._project_map = {}
p._project_name_cache = {}
return p
def test_execution_output_emits_tool_result_with_metadata(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
exec_msgs = [
m for m in result["messages"]
if any(
b.get("type") == BLOCK_TYPE_TOOL_RESULT
and b.get("tool_name") == "container.exec"
for b in (m.get("blocks") or [])
)
]
assert exec_msgs, "expected execution_output to render as tool_result"
block = next(
b for b in exec_msgs[0]["blocks"] if b.get("type") == BLOCK_TYPE_TOOL_RESULT
)
assert block["output"].startswith("Hello from container.exec")
assert block["is_error"] is False
assert block["summary"] == "Reading skill documentation"
def test_execution_output_message_role_is_tool(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
tool_msgs = [m for m in result["messages"] if m["role"] == "tool"]
assert tool_msgs, "tool-role messages must pass through (filter lifted in v0.4.0)"
def test_empty_execution_output_skipped(self, caplog):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
with caplog.at_level(logging.DEBUG, logger="src.providers.chatgpt"):
result = p.normalize_conversation(raw)
# The empty execution_output (author.name="python") must NOT appear.
python_msgs = [
m for m in result["messages"]
if any(
b.get("type") == BLOCK_TYPE_TOOL_RESULT and b.get("tool_name") == "python"
for b in (m.get("blocks") or [])
)
]
assert not python_msgs, "empty execution_output should be skipped"
assert any("Skipping empty execution_output" in r.message for r in caplog.records)
def test_system_error_emits_error_tool_result(self):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
result = p.normalize_conversation(raw)
web_err = [
m for m in result["messages"]
if any(
b.get("type") == BLOCK_TYPE_TOOL_RESULT
and b.get("tool_name") == "web"
and b.get("is_error") is True
for b in (m.get("blocks") or [])
)
]
assert web_err, "system_error should render as tool_result with is_error=True"
block = next(b for b in web_err[0]["blocks"] if b.get("tool_name") == "web")
assert "503" in block["output"]
def test_tether_browsing_display_spinner_skipped(self, caplog):
raw = json.loads((FIXTURES / "chatgpt_conversation.json").read_text())
p = self._get_provider()
with caplog.at_level(logging.DEBUG, logger="src.providers.chatgpt"):
result = p.normalize_conversation(raw)
spinner_msgs = [
m for m in result["messages"]
if any(
b.get("type") == BLOCK_TYPE_TOOL_RESULT and b.get("tool_name") == "file_search"
for b in (m.get("blocks") or [])
)
]
assert not spinner_msgs, "spinner tether_browsing_display should be skipped"
assert any("tether_browsing_display spinner" in r.message for r in caplog.records)
def test_tether_browsing_display_populated_renders_defensively(self):
"""Defensive case (never observed in real data) — populated browse renders."""
conv = {
"id": "test-tether",
"title": "T",
"create_time": 1700000000.0,
"update_time": 1700000001.0,
"mapping": {
"root": {"id": "root", "message": None, "parent": None, "children": ["m1"]},
"m1": {
"id": "m1",
"parent": "root",
"children": [],
"message": {
"id": "m1",
"author": {"role": "tool", "name": "browser"},
"content": {
"content_type": "tether_browsing_display",
"result": "Found 3 results about kubernetes ingress.",
"summary": "ingress search",
"assets": None,
"tether_id": None,
},
},
},
},
}
p = self._get_provider()
result = p.normalize_conversation(conv)
assert any(
b.get("type") == BLOCK_TYPE_TOOL_RESULT and b.get("tool_name") == "browser"
for m in result["messages"]
for b in (m.get("blocks") or [])
)
class TestChatGPTConvIdFallback:
"""v0.4.1: live ChatGPT detail responses use conversation_id, not id."""
def _get_provider(self):
from src.providers.chatgpt import ChatGPTProvider
p = ChatGPTProvider.__new__(ChatGPTProvider)
import requests
p._session = requests.Session()
p._org_id = None
p._project_ids = []
p._project_map = {}
p._project_name_cache = {}
return p
def test_falls_back_to_conversation_id(self):
raw = {
"conversation_id": "live-chatgpt-uuid",
"title": "T",
"create_time": 1700000000.0,
"update_time": 1700000001.0,
"mapping": {
"root": {"id": "root", "message": None, "parent": None, "children": []},
},
}
p = self._get_provider()
result = p.normalize_conversation(raw)
assert result["id"] == "live-chatgpt-uuid"
def test_id_takes_precedence_when_both_present(self):
raw = {
"id": "from-id",
"conversation_id": "from-conversation-id",
"title": "T",
"create_time": 1700000000.0,
"update_time": 1700000001.0,
"mapping": {
"root": {"id": "root", "message": None, "parent": None, "children": []},
},
}
p = self._get_provider()
result = p.normalize_conversation(raw)
assert result["id"] == "from-id"

147
tests/test_utils.py Normal file
View File

@@ -0,0 +1,147 @@
"""Tests for src/utils.py — filename generation, path building, redaction."""
from pathlib import Path
import pytest
from src.utils import (
build_export_path,
format_token_status,
generate_filename,
redact_secrets,
)
class TestGenerateFilename:
def test_basic_format(self):
name = generate_filename("Hello World", "abc12345def", "2024-06-10T14:00:00Z")
assert name == "2024-06-10_hello-world_abc12345.md"
def test_special_chars_slugified(self):
# T-36: titles with punctuation must produce safe, OS-compatible filenames
name = generate_filename("What's this?! A test.", "abc12345", "2024-06-01T00:00:00Z")
assert "?" not in name
assert "!" not in name
assert "'" not in name
assert " " not in name
assert name.startswith("2024-06-01_")
assert name.endswith("_abc12345.md")
def test_unicode_chars_handled(self):
name = generate_filename("Héllo Wörld", "abc12345", "2024-06-01T00:00:00Z")
assert " " not in name
assert name.endswith("_abc12345.md")
def test_empty_title_becomes_untitled(self):
name = generate_filename("", "abc12345", "2024-06-01T00:00:00Z")
assert "untitled" in name
def test_id_truncated_to_8_chars(self):
name = generate_filename("Test", "abcdefghijklmnop", "2024-06-01T00:00:00Z")
assert name.endswith("_abcdefgh.md")
def test_long_title_truncated(self):
long_title = "a" * 200
name = generate_filename(long_title, "abc12345", "2024-06-01T00:00:00Z")
# Slug is capped at 60 chars by max_length
slug_part = name.split("_")[1]
assert len(slug_part) <= 60
def test_date_comes_from_created_at(self):
name = generate_filename("Test", "abc12345", "2023-11-25T00:00:00Z")
assert name.startswith("2023-11-25_")
class TestBuildExportPath:
def test_default_structure_provider_project_year(self):
path = build_export_path(
Path("/exports"), "claude", "my-project", "2024-06-01T00:00:00Z", "file.md"
)
assert str(path) == "/exports/claude/my-project.2024/file.md"
def test_no_project_uses_no_project_slug(self):
path = build_export_path(
Path("/exports"), "chatgpt", None, "2024-06-01T00:00:00Z", "file.md"
)
assert "no-project.2024" in str(path)
def test_provider_project_structure_omits_year(self):
path = build_export_path(
Path("/exports"), "claude", "proj", "2024-06-01T00:00:00Z", "file.md",
structure="provider/project",
)
assert "2024" not in str(path)
assert "proj" in str(path)
def test_provider_year_structure_omits_project(self):
path = build_export_path(
Path("/exports"), "claude", "proj", "2024-06-01T00:00:00Z", "file.md",
structure="provider/year",
)
assert "proj" not in str(path)
assert "2024" in str(path)
def test_project_name_with_spaces_is_slugified(self):
path = build_export_path(
Path("/exports"), "claude", "My Project Name!", "2024-06-01T00:00:00Z", "file.md"
)
assert " " not in str(path)
assert "!" not in str(path)
class TestRedactSecrets:
def test_token_value_redacted(self):
data = {"token": "supersecret"}
result = redact_secrets(data)
assert result["token"] == "[REDACTED]"
def test_session_key_redacted(self):
result = redact_secrets({"sessionKey": "abc123"})
assert result["sessionKey"] == "[REDACTED]"
def test_non_sensitive_key_unchanged(self):
result = redact_secrets({"title": "My Chat", "id": "abc123"})
assert result["title"] == "My Chat"
assert result["id"] == "abc123"
def test_nested_dict_redacted(self):
data = {"user": {"token": "secret", "name": "Alice"}}
result = redact_secrets(data)
assert result["user"]["token"] == "[REDACTED]"
assert result["user"]["name"] == "Alice"
def test_list_of_dicts(self):
data = [{"password": "p@ss"}, {"title": "chat"}]
result = redact_secrets(data)
assert result[0]["password"] == "[REDACTED]"
assert result[1]["title"] == "chat"
class TestFormatTokenStatus:
def test_none_token_returns_not_set(self):
assert format_token_status(None) == "[NOT SET]"
def test_empty_token_returns_not_set(self):
assert format_token_status("") == "[NOT SET]"
def test_set_token_no_expiry(self):
assert format_token_status("sometoken") == "[SET]"
def test_expired_token(self):
from datetime import datetime, timezone, timedelta
expiry = datetime.now(tz=timezone.utc) - timedelta(days=1)
result = format_token_status("tok", expiry)
assert "EXPIRED" in result
def test_expiring_today_shows_hours(self):
from datetime import datetime, timezone, timedelta
expiry = datetime.now(tz=timezone.utc) + timedelta(hours=3)
result = format_token_status("tok", expiry)
assert "expires in" in result
assert "h" in result
def test_expiring_in_days(self):
from datetime import datetime, timezone, timedelta
expiry = datetime.now(tz=timezone.utc) + timedelta(days=10, hours=12)
result = format_token_status("tok", expiry)
assert "10 days" in result