Files

JesseMarkowitz 473d02f71a feat: v0.4.0 — rich content support with typed blocks and loss visibility

Extracts per-message content into a typed `blocks` list (text, code,
thinking, tool_use, tool_result, image_placeholder, file_placeholder,
unknown) and renders them at exporter write time. Voice transcripts,
Custom Instructions, and image references now appear in exports
instead of being silently dropped.

Foundation:
- src/blocks.py: pure block constructors, _safe_fence (fence-corruption
  defense, verified live in Joplin), _blockquote_prefix, render
- src/loss_report.py: per-run tally surfaced as INFO summary at end of
  export so silently-dropped data becomes visible

Providers:
- ChatGPT: dispatch on content_type produces typed blocks; voice shapes
  (audio_transcription, audio_asset_pointer, real_time_user_audio_video_
  asset_pointer) locked from live DevTools capture; Custom Instructions
  bug fix (parts-vs-direct-fields); role filter lifted; hidden-context
  marker driven by is_visually_hidden_from_conversation flag
- Claude: defensive dispatch for text/thinking/tool_use/tool_result/image
  with recursive nested-block flattening; untested against real rich-
  content data — fix-forward in v0.4.1

Exporter:
- Markdown renders from blocks at write time via render_blocks_to_markdown;
  backward-compat fallback to content for any pre-v0.4.0 cached data

Tests:
- 27 new tests across providers, exporters, CLI; fixtures rebuilt with
  real-shape ChatGPT voice + Custom Instructions cases
- 181/181 pass

Behavior changes (intentional):
- JSON output omits content; consumers should read blocks
- Per-conversation message counts increase (Custom Instructions, image-
  only, tool-only messages now appear)
- Existing exports not auto-re-rendered; users wanting fresh output run
  cache --clear then export

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-04 23:17:18 -04:00

7.0 KiB

Raw Blame History

Planned Future Work

Items completed in each release are moved to the changelog. Items here are designed for but not yet implemented. The codebase is structured to make each of these additions straightforward.

Completed:

v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
v0.2.0 — Joplin import automation (joplin command, create/update notes, notebook auto-creation)
v0.4.0 — Rich content support: typed message blocks (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown); ChatGPT voice transcripts as text + audio placeholders; Custom Instructions extraction; data-loss visibility via LossReport summary and visible unknown blocks

Export `--force` Flag (v0.2.x)

Add --force to the export command to re-export already-cached conversations without permanently clearing the entire manifest. Useful for re-generating files after changing the Markdown template or output structure.

Implementation: pass a force=True flag to cache.get_new_or_updated(), which returns all conversations regardless of cache state when force is True.

Current workaround: python -m src.main cache --clear then re-run export.

Joplin `--force` Flag (v0.2.x)

Similarly, add --force to the joplin command to re-sync all cached conversations to Joplin regardless of whether they've been synced before. Useful after making formatting changes to the Markdown exporter.

Implementation: in get_joplin_pending(), return all entries that have a file_path when force=True, ignoring joplin_synced_at.

Per-Conversation Cache Reset (v0.2.x)

Add cache --reset --conversation <id> to force re-export or re-sync of a single conversation without clearing the entire provider cache.

Current workaround: manually edit ~/.ai-chat-exporter/manifest.json and delete the entry, then re-run export.

Official API Fallback (v0.3.0)

If the unofficial internal web API approach breaks, migrate to official export file parsing as a fallback:

ChatGPT: parse conversations.json from Settings → Export Data
Claude: parse conversations.json from Settings → Privacy → Export Data

The BaseProvider abstract class is intentionally designed so that a FileProvider subclass can implement the same interface (list_conversations, get_conversation, normalize_conversation) without any changes to cache, exporters, or CLI code.

To add this: implement src/providers/file_chatgpt.py and src/providers/file_claude.py, then add --input-file flag to the export command to accept a pre-downloaded export ZIP or JSON.

Binary Content Downloads (v0.5.0)

v0.4.0 ships placeholders for images and audio assets but does not download the binary content. The _safe_fence-wrapped placeholders include the asset reference (sediment://... or file-service://...), MIME type, size, and duration where available; the actual bytes are not preserved.

Next steps:

Download attached images alongside the Markdown export, save under a media/ sibling directory with a stable filename derived from the asset reference.
Replace image_placeholder rendering with an inline ![](relative/path) reference once the file is on disk.
Joplin integration: upload binaries as Joplin resources via POST /resources, rewrite the rendered Markdown to use :/resourceId references, and track the resource ID in the cache manifest so re-syncs stay idempotent.
DALL-E images on the assistant side: not observed in this user's data; the code path exists (source = "model_generated") but is untested.

The block-level schema is already in place — only the file-fetch + rewrite layer needs to be added. See the image_placeholder and file_placeholder block definitions in src/blocks.py.

Reclassify o1/o3 Reasoning Subparts (v0.4.1)

v0.4.0 leaves dict parts inside text content_type messages with shape {"summary": ..., "content": ...} rendered as plain text (defensive — the shape was inferred from a code comment, not captured live). Once a real reasoning conversation is captured, reclassify these as thinking blocks.

Suppress Hidden Context (v0.4.x)

If Custom Instructions duplication across conversations becomes a storage problem, add EXPORTER_INCLUDE_HIDDEN_CONTEXT=false env var. The toggle is a single os.getenv() check at the start of _extract_editable_context_blocks in src/providers/chatgpt.py — return empty list if disabled.

Scheduled / Watch Mode (v0.5.0)

Add a watch command (or cron integration helper) to run exports automatically on a schedule:

python -m src.main watch --interval 6h   # poll every 6 hours

This would run export + joplin in sequence, then sleep. Alternatively, provide a cron command that prints the correct crontab line for the user's setup.

Implementation: simple loop with time.sleep(), or emit a crontab entry string that calls the export and joplin commands in sequence. A --once flag would do a single run then exit (useful for cron itself).

Obsidian Vault Output (v0.5.0)

Add an obsidian command (or --target obsidian flag) to sync exported conversations into an Obsidian vault directory. The current Markdown format is already largely compatible; the main differences are:

Obsidian uses YAML frontmatter properties (same format, already supported)
Tags should use #tag inline or tags: list in frontmatter (already done)
Wikilinks ([[Title]]) instead of Markdown links — optional, Obsidian supports both

Implementation: the existing MarkdownExporter output is already valid in Obsidian. An ObsidianSyncer class (mirroring JoplinClient) would simply copy files to the vault directory and maintain a flat or nested folder structure matching the user's Obsidian setup. No API needed — just file I/O.

Joplin Nested Notebooks (future)

Currently notebooks are flat: ChatGPT - My Project. Joplin supports nested notebooks via parent_id. A future option (JOPLIN_NESTED_NOTEBOOKS=true) could create a two-level hierarchy:

ChatGPT/
  My Project/
  No Project/
Claude/
  Budget Tracker/

Implementation: get_or_create_notebook would first find/create the provider notebook, then find/create the project notebook as a child.

Token Expiry Notifications (future)

Proactively warn when a token is close to expiry (within 48h for ChatGPT), rather than only surfacing the warning at startup. Options:

Add an expiry subcommand that prints token status and exits non-zero if any token is expired or expiring soon (useful in scripts/cron)
Send a desktop notification via notify-send (Linux) or osascript (macOS) when a token is within 24h of expiry

Search Command (future)

Add a search command to full-text search across all exported Markdown files:

python -m src.main search "kubernetes ingress"
python -m src.main search "kubernetes ingress" --provider claude --project devops

Implementation: grep/ripgrep over EXPORT_DIR, display results with conversation title, date, and a snippet. No index needed — Markdown files are small enough to grep directly.

7.0 KiB Raw Blame History