feat: v0.4.0 — rich content support with typed blocks and loss visibility

Extracts per-message content into a typed `blocks` list (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown) and renders them at exporter write time. Voice transcripts, Custom Instructions, and image references now appear in exports instead of being silently dropped. Foundation: - src/blocks.py: pure block constructors, _safe_fence (fence-corruption defense, verified live in Joplin), _blockquote_prefix, render - src/loss_report.py: per-run tally surfaced as INFO summary at end of export so silently-dropped data becomes visible Providers: - ChatGPT: dispatch on content_type produces typed blocks; voice shapes (audio_transcription, audio_asset_pointer, real_time_user_audio_video_ asset_pointer) locked from live DevTools capture; Custom Instructions bug fix (parts-vs-direct-fields); role filter lifted; hidden-context marker driven by is_visually_hidden_from_conversation flag - Claude: defensive dispatch for text/thinking/tool_use/tool_result/image with recursive nested-block flattening; untested against real rich- content data — fix-forward in v0.4.1 Exporter: - Markdown renders from blocks at write time via render_blocks_to_markdown; backward-compat fallback to content for any pre-v0.4.0 cached data Tests: - 27 new tests across providers, exporters, CLI; fixtures rebuilt with real-shape ChatGPT voice + Custom Instructions cases - 181/181 pass Behavior changes (intentional): - JSON output omits content; consumers should read blocks - Per-conversation message counts increase (Custom Instructions, image- only, tool-only messages now appear) - Existing exports not auto-re-rendered; users wanting fresh output run cache --clear then export Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-04 23:17:18 -04:00
parent 4798edcea7
commit 473d02f71a
16 changed files with 1786 additions and 232 deletions
--- a/FUTURE.md
+++ b/FUTURE.md
@@ -7,6 +7,7 @@ of these additions straightforward.
 **Completed:**
 - v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
 - v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation)
+- v0.4.0 — Rich content support: typed message blocks (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown); ChatGPT voice transcripts as text + audio placeholders; Custom Instructions extraction; data-loss visibility via `LossReport` summary and visible `unknown` blocks

 ---

@@ -58,26 +59,43 @@ export command to accept a pre-downloaded export ZIP or JSON.

 ---

-## Rich Content Support (v0.4.0)
+## Binary Content Downloads (v0.5.0)

-Currently only text content is exported. Future versions should handle:
+v0.4.0 ships placeholders for images and audio assets but does not download
+the binary content. The `_safe_fence`-wrapped placeholders include the asset
+reference (`sediment://...` or `file-service://...`), MIME type, size, and
+duration where available; the actual bytes are not preserved.

-### Claude
- Artifacts (code, documents, HTML) — export as separate files, link from Markdown
- Uploaded images — download and embed or link
- Extended thinking/reasoning blocks — include as collapsible sections
- Tool call results and web search citations — include as footnotes or appendices
+Next steps:
+- Download attached images alongside the Markdown export, save under a
+  `media/` sibling directory with a stable filename derived from the asset
+  reference.
+- Replace `image_placeholder` rendering with an inline `![](relative/path)`
+  reference once the file is on disk.
+- Joplin integration: upload binaries as Joplin resources via `POST /resources`,
+  rewrite the rendered Markdown to use `:/resourceId` references, and track
+  the resource ID in the cache manifest so re-syncs stay idempotent.
+- DALL-E images on the assistant side: not observed in this user's data; the
+  code path exists (`source = "model_generated"`) but is untested.

-### ChatGPT
- DALL-E generated images — download and embed or link
- Code Interpreter outputs — export code and results
- File attachments — download and reference
- Voice transcripts — include as text
+The block-level schema is already in place — only the file-fetch + rewrite
+layer needs to be added. See the `image_placeholder` and `file_placeholder`
+block definitions in `src/blocks.py`.

-Implementation note: the normalized message schema already includes a
-`content_type` field placeholder. When this work begins, extend the schema
-rather than replacing it. Non-text content already logs a WARNING when
-encountered so users can see what was skipped.
+## Reclassify o1/o3 Reasoning Subparts (v0.4.1)
+
+v0.4.0 leaves dict parts inside `text` content_type messages with shape
+`{"summary": ..., "content": ...}` rendered as plain text (defensive — the
+shape was inferred from a code comment, not captured live). Once a real
+reasoning conversation is captured, reclassify these as `thinking` blocks.
+
+## Suppress Hidden Context (v0.4.x)
+
+If Custom Instructions duplication across conversations becomes a storage
+problem, add `EXPORTER_INCLUDE_HIDDEN_CONTEXT=false` env var. The toggle is
+a single `os.getenv()` check at the start of
+`_extract_editable_context_blocks` in `src/providers/chatgpt.py` — return
+empty list if disabled.

 ---