Supersedes #34727Closes#30703
Related:
* langchain-ai/langchain-google#1460
* langchain-ai/langchain-google#1501
Fixing this at the `langchain-core` callback layer instead of
normalizing inside individual provider integrations, so structured
streaming content is preserved consistently.
---
Models are increasingly streaming structured content blocks instead of
plain text tokens. For example, Gemini 3 can stream text as
content-block lists, and Anthropic/tool-use flows can also produce
non-text message content. Today those values already reach
`on_llm_new_token`, but the callback API still advertises `token: str`,
which makes custom callbacks, tracers, and streaming helpers assume
every streamed value is text.
User story: as a LangChain user building a streaming callback for chat
models with tool calls, reasoning/thinking blocks, or provider-specific
structured content, I need `on_llm_new_token` to accept the same content
shape that chat model chunks can actually emit, so my callback can
observe the stream without providers flattening or dropping non-text
data.
Fixing this in `langchain-core` makes the existing runtime behavior
explicit at the shared callback boundary. Normalizing content blocks
inside each provider would duplicate logic, produce inconsistent
behavior across integrations, and in some cases lose required provider
metadata such as Gemini thought signatures.
## Changes
- Update the callback contract so streamed tokens can be either plain
text or structured content blocks
- Carry structured streamed content through tracing and event/log
streaming paths without forcing provider data into text too early
- Keep built-in text-oriented streaming callbacks working by converting
structured tokens only at the display/queue boundary
- Drop the now-incorrect `cast("str", ...)` on streamed content in
`BaseChatModel` so the producer side matches the widened callback
signature instead of asserting a string it doesn't always have (no
runtime change — `cast` is erased)
- Align Anthropic and Mistral content typing with the structured content
shapes already used by chat model messages
- Update callback tests to reflect that not every streamed value is text
## Compatibility
No runtime behavior change: no producer emits anything it wasn't already
emitting, and widening a parameter type is safe for existing callers and
handlers that pass or receive `str`. The one caveat is downstream code
that subclasses a callback handler or tracer and overrides
`on_llm_new_token` with a `token: str` annotation — under strict type
checking that override is now narrower than the base and will be flagged
as incompatible with the supertype. Such code still runs unchanged; the
fix is to widen the annotation to match.
Fix CI to trigger benchmarks on `run-codspeed-benchmarks` label addition
Reduce scope of async benchmark to save time on CI
Waiting to merge this PR until we figure out how to use walltime on
local runners.
The first in a sequence of PRs focusing on improving performance in
core. We're starting with reducing import times for common structures,
hence the benchmarks here.
The benchmark looks a little bit complicated - we have to use a process
so that we don't suffer from Python's import caching system. I tried
doing manual modification of `sys.modules` between runs, but that's
pretty tricky / hacky to get right, hence the subprocess approach.
Motivated by extremely slow baseline for common imports (we're talking
2-5 seconds):
<img width="633" alt="Screenshot 2025-04-09 at 12 48 12 PM"
src="https://github.com/user-attachments/assets/994616fe-1798-404d-bcbe-48ad0eb8a9a0"
/>
Also added a `make benchmark` command to make local runs easy :).
Currently using walltimes so that we can track total time despite using
a manual proces.