feat: v0.4.0 — rich content support with typed blocks and loss visibility
Extracts per-message content into a typed `blocks` list (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown) and renders them at exporter write time. Voice transcripts, Custom Instructions, and image references now appear in exports instead of being silently dropped. Foundation: - src/blocks.py: pure block constructors, _safe_fence (fence-corruption defense, verified live in Joplin), _blockquote_prefix, render - src/loss_report.py: per-run tally surfaced as INFO summary at end of export so silently-dropped data becomes visible Providers: - ChatGPT: dispatch on content_type produces typed blocks; voice shapes (audio_transcription, audio_asset_pointer, real_time_user_audio_video_ asset_pointer) locked from live DevTools capture; Custom Instructions bug fix (parts-vs-direct-fields); role filter lifted; hidden-context marker driven by is_visually_hidden_from_conversation flag - Claude: defensive dispatch for text/thinking/tool_use/tool_result/image with recursive nested-block flattening; untested against real rich- content data — fix-forward in v0.4.1 Exporter: - Markdown renders from blocks at write time via render_blocks_to_markdown; backward-compat fallback to content for any pre-v0.4.0 cached data Tests: - 27 new tests across providers, exporters, CLI; fixtures rebuilt with real-shape ChatGPT voice + Custom Instructions cases - 181/181 pass Behavior changes (intentional): - JSON output omits content; consumers should read blocks - Per-conversation message counts increase (Custom Instructions, image- only, tool-only messages now appear) - Existing exports not auto-re-rendered; users wanting fresh output run cache --clear then export Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
50
FUTURE.md
50
FUTURE.md
@@ -7,6 +7,7 @@ of these additions straightforward.
|
||||
**Completed:**
|
||||
- v0.1.0 — Core export: ChatGPT + Claude, incremental sync, Markdown + JSON output
|
||||
- v0.2.0 — Joplin import automation (`joplin` command, create/update notes, notebook auto-creation)
|
||||
- v0.4.0 — Rich content support: typed message blocks (text, code, thinking, tool_use, tool_result, image_placeholder, file_placeholder, unknown); ChatGPT voice transcripts as text + audio placeholders; Custom Instructions extraction; data-loss visibility via `LossReport` summary and visible `unknown` blocks
|
||||
|
||||
---
|
||||
|
||||
@@ -58,26 +59,43 @@ export command to accept a pre-downloaded export ZIP or JSON.
|
||||
|
||||
---
|
||||
|
||||
## Rich Content Support (v0.4.0)
|
||||
## Binary Content Downloads (v0.5.0)
|
||||
|
||||
Currently only text content is exported. Future versions should handle:
|
||||
v0.4.0 ships placeholders for images and audio assets but does not download
|
||||
the binary content. The `_safe_fence`-wrapped placeholders include the asset
|
||||
reference (`sediment://...` or `file-service://...`), MIME type, size, and
|
||||
duration where available; the actual bytes are not preserved.
|
||||
|
||||
### Claude
|
||||
- Artifacts (code, documents, HTML) — export as separate files, link from Markdown
|
||||
- Uploaded images — download and embed or link
|
||||
- Extended thinking/reasoning blocks — include as collapsible sections
|
||||
- Tool call results and web search citations — include as footnotes or appendices
|
||||
Next steps:
|
||||
- Download attached images alongside the Markdown export, save under a
|
||||
`media/` sibling directory with a stable filename derived from the asset
|
||||
reference.
|
||||
- Replace `image_placeholder` rendering with an inline ``
|
||||
reference once the file is on disk.
|
||||
- Joplin integration: upload binaries as Joplin resources via `POST /resources`,
|
||||
rewrite the rendered Markdown to use `:/resourceId` references, and track
|
||||
the resource ID in the cache manifest so re-syncs stay idempotent.
|
||||
- DALL-E images on the assistant side: not observed in this user's data; the
|
||||
code path exists (`source = "model_generated"`) but is untested.
|
||||
|
||||
### ChatGPT
|
||||
- DALL-E generated images — download and embed or link
|
||||
- Code Interpreter outputs — export code and results
|
||||
- File attachments — download and reference
|
||||
- Voice transcripts — include as text
|
||||
The block-level schema is already in place — only the file-fetch + rewrite
|
||||
layer needs to be added. See the `image_placeholder` and `file_placeholder`
|
||||
block definitions in `src/blocks.py`.
|
||||
|
||||
Implementation note: the normalized message schema already includes a
|
||||
`content_type` field placeholder. When this work begins, extend the schema
|
||||
rather than replacing it. Non-text content already logs a WARNING when
|
||||
encountered so users can see what was skipped.
|
||||
## Reclassify o1/o3 Reasoning Subparts (v0.4.1)
|
||||
|
||||
v0.4.0 leaves dict parts inside `text` content_type messages with shape
|
||||
`{"summary": ..., "content": ...}` rendered as plain text (defensive — the
|
||||
shape was inferred from a code comment, not captured live). Once a real
|
||||
reasoning conversation is captured, reclassify these as `thinking` blocks.
|
||||
|
||||
## Suppress Hidden Context (v0.4.x)
|
||||
|
||||
If Custom Instructions duplication across conversations becomes a storage
|
||||
problem, add `EXPORTER_INCLUDE_HIDDEN_CONTEXT=false` env var. The toggle is
|
||||
a single `os.getenv()` check at the start of
|
||||
`_extract_editable_context_blocks` in `src/providers/chatgpt.py` — return
|
||||
empty list if disabled.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user