feat: v0.4.0 — rich content support with typed blocks and loss visibility

Extracts per-message content into a typed `blocks` list (text, code,
thinking, tool_use, tool_result, image_placeholder, file_placeholder,
unknown) and renders them at exporter write time. Voice transcripts,
Custom Instructions, and image references now appear in exports
instead of being silently dropped.

Foundation:
- src/blocks.py: pure block constructors, _safe_fence (fence-corruption
  defense, verified live in Joplin), _blockquote_prefix, render
- src/loss_report.py: per-run tally surfaced as INFO summary at end of
  export so silently-dropped data becomes visible

Providers:
- ChatGPT: dispatch on content_type produces typed blocks; voice shapes
  (audio_transcription, audio_asset_pointer, real_time_user_audio_video_
  asset_pointer) locked from live DevTools capture; Custom Instructions
  bug fix (parts-vs-direct-fields); role filter lifted; hidden-context
  marker driven by is_visually_hidden_from_conversation flag
- Claude: defensive dispatch for text/thinking/tool_use/tool_result/image
  with recursive nested-block flattening; untested against real rich-
  content data — fix-forward in v0.4.1

Exporter:
- Markdown renders from blocks at write time via render_blocks_to_markdown;
  backward-compat fallback to content for any pre-v0.4.0 cached data

Tests:
- 27 new tests across providers, exporters, CLI; fixtures rebuilt with
  real-shape ChatGPT voice + Custom Instructions cases
- 181/181 pass

Behavior changes (intentional):
- JSON output omits content; consumers should read blocks
- Per-conversation message counts increase (Custom Instructions, image-
  only, tool-only messages now appear)
- Existing exports not auto-re-rendered; users wanting fresh output run
  cache --clear then export

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
JesseMarkowitz
2026-05-04 23:17:18 -04:00
parent 4798edcea7
commit 473d02f71a
16 changed files with 1786 additions and 232 deletions

View File

@@ -7,6 +7,7 @@ of these additions straightforward.
**Completed:**
- v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
- v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation)
- v0.4.0 — Rich content support: typed message blocks (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown); ChatGPT voice transcripts as text + audio placeholders; Custom Instructions extraction; data-loss visibility via `LossReport` summary and visible `unknown` blocks
---
@@ -58,26 +59,43 @@ export command to accept a pre-downloaded export ZIP or JSON.
---
## Rich Content Support (v0.4.0)
## Binary Content Downloads (v0.5.0)
Currently only text content is exported. Future versions should handle:
v0.4.0 ships placeholders for images and audio assets but does not download
the binary content. The `_safe_fence`-wrapped placeholders include the asset
reference (`sediment://...` or `file-service://...`), MIME type, size, and
duration where available; the actual bytes are not preserved.
### Claude
- Artifacts (code, documents, HTML) — export as separate files, link from Markdown
- Uploaded images — download and embed or link
- Extended thinking/reasoning blocks — include as collapsible sections
- Tool call results and web search citations — include as footnotes or appendices
Next steps:
- Download attached images alongside the Markdown export, save under a
`media/` sibling directory with a stable filename derived from the asset
reference.
- Replace `image_placeholder` rendering with an inline `![](relative/path)`
reference once the file is on disk.
- Joplin integration: upload binaries as Joplin resources via `POST /resources`,
rewrite the rendered Markdown to use `:/resourceId` references, and track
the resource ID in the cache manifest so re-syncs stay idempotent.
- DALL-E images on the assistant side: not observed in this user's data; the
code path exists (`source = "model_generated"`) but is untested.
### ChatGPT
- DALL-E generated images — download and embed or link
- Code Interpreter outputs — export code and results
- File attachments — download and reference
- Voice transcripts — include as text
The block-level schema is already in place — only the file-fetch + rewrite
layer needs to be added. See the `image_placeholder` and `file_placeholder`
block definitions in `src/blocks.py`.
Implementation note: the normalized message schema already includes a
`content_type` field placeholder. When this work begins, extend the schema
rather than replacing it. Non-text content already logs a WARNING when
encountered so users can see what was skipped.
## Reclassify o1/o3 Reasoning Subparts (v0.4.1)
v0.4.0 leaves dict parts inside `text` content_type messages with shape
`{"summary": ..., "content": ...}` rendered as plain text (defensive — the
shape was inferred from a code comment, not captured live). Once a real
reasoning conversation is captured, reclassify these as `thinking` blocks.
## Suppress Hidden Context (v0.4.x)
If Custom Instructions duplication across conversations becomes a storage
problem, add `EXPORTER_INCLUDE_HIDDEN_CONTEXT=false` env var. The toggle is
a single `os.getenv()` check at the start of
`_extract_editable_context_blocks` in `src/providers/chatgpt.py` — return
empty list if disabled.
---