Docs: conceptual docs batch 1 (#27173)

Re-organizing some of the content involves runnables/lcel/streaming into conceptual guides. Conceptual guides added: - [x] Runnables - [x] LCEL - [x] Chat Models - [x] LLM - [x] async - [x] Messages - [x] Chat History - [x] Multimodality - [x] Tokenization Outstanding: - [ ] Callbacks/Tracers - [ ] Streaming - [ ] Tool Creation - [ ] Document Loading Other conceptual guides are placeholders to make sure that no existing links breaks. Some high level re-organization: * Introduce the Runnable interface prior to LCEL (since those are two distinct concepts) * Cross-link as much related content as possible (including how-to guides)
2025-08-28 05:54:55 +00:00 · 2024-10-18 14:59:53 -04:00 · 2024-10-18 14:59:53 -04:00 · 127ac819fc
commit 127ac819fc
parent 046f6a5544
17 changed files with 1320 additions and 137 deletions
--- a/docs/docs/concepts/async.mdx
+++ b/docs/docs/concepts/async.mdx
@ -0,0 +1,83 @@
 # Async Programming with LangChain
 :::info Prerequisites
 * [Runnable Interface](/docs/concepts/runnables)
 * [asyncio documentation](https://docs.python.org/3/library/asyncio.html)
 :::
 ## Overview
 LLM based applications often involve a lot of I/O-bound operations, such as making API calls to language models, databases, or other services. Asynchronous programming (or async programming) is a paradigm that allows a program to perform multiple tasks concurrently without blocking the execution of other tasks, improving efficiency and responsiveness, particularly in I/O-bound operations.
 :::note
 You are expected to be familiar with asynchronous programming in Python before reading this guide. If you are not, please find appropriate resources online to learn how to program asynchronously in Python.
 This guide specifically focuses on what you need to know to work with LangChain in an asynchronous context, assuming that you are already familiar with asynch
 :::
 ## LangChain Asynchronous APIs
 Many LangChain APIs are designed to be asynchronous, allowing you to build efficient and responsive applications.
 Typically, any method that may perform I/O operations (e.g., making API calls, reading files) will have an asynchronous counterpart.
 In LangChain, async implementations are located in the same classes as their synchronous counterparts, with the asynchronous methods having an "a" prefix. For example, the synchronous `invoke` method has an asynchronous counterpart called `ainvoke`.
 Many components of LangChain implement the [Runnable Interface](/docs/concepts/runnables), which includes support for asynchronous execution. This means that you can run Runnables asynchronously using the `await` keyword in Python.
 ```python
 await some_runnable.ainvoke(some_input)
 ```
 Other components like [Embedding Models](/docs/concepts/embedding_models) and [VectorStore](/docs/concepts/vectorstores) that do not implement the [Runnable Interface](/docs/concepts/runnables) usually still follow the same rule and include the asynchronous version of method in the same class with an "a" prefix.
 For example,
 ```python
 await some_vectorstore.aadd_documents(documents)
 ```
 Runnables created using the [LangChain Expression Language (LCEL)](/docs/concepts/lcel) can also be run asynchronously as they implement
 the full [Runnable Interface](/docs/concepts/runnables).
 Fore more information, please review the [API reference](https://python.langchain.com/api_reference/) for the specific component you are using.
 ## Delegation to Sync Methods
 Most popular LangChain integrations implement asynchronous support of their APIs. For example, the `ainvoke` method of many ChatModel implementations uses the `httpx.AsyncClient` to make asynchronous HTTP requests to the model provider's API.
 When an asynchronous implementation is not available, LangChain tries to provide a default implementation, even if it incurs
 a **slight** overhead.
 By default, LangChain will delegate the execution of a unimplemented asynchronous methods to the synchronous counterparts. LangChain almost always assumes that the synchronous method should be treated as a blocking operation and should be run in a separate thread.
 This is done using [asyncio.loop.run_in_executor](https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor) functionality provided by the `asyncio` library. LangChain uses the default executor provided by the `asyncio` library, which lazily initializes a thread pool executor with a default number of threads that is reused in the given event loop. While this strategy incurs a slight overhead due to context switching between threads, it guarantees that every asynchronous method has a default implementation that works out of the box.
 ## Performance
 Async code in LangChain should generally perform relatively well with minimal overhead out of the box, and is unlikely
 to be a bottleneck in most applications.
 The two main sources of overhead are:
 1. Cost of context switching between threads when [delegating to synchronous methods](#delegation-to-sync-methods). This can be addressed by providing a native asynchronous implementation.
 2. In [LCEL](/docs/concepts/lcel) any "cheap functions" that appear as part of the chain will be either scheduled as tasks on the event loop (if they are async) or run in a separate thread (if they are sync), rather than just be run inline.
 The latency overhead you should expect from these is between tens of microseconds to a few milliseconds.
 A more common source of performance issues arises from users accidentally blocking the event loop by calling synchronous code in an async context (e.g., calling `invoke` rather than `ainvoke`).
 ## Compatibility
 LangChain is only compatible with the `asyncio` library, which is distributed as part of the Python standard library. It will not work with other async libraries like `trio` or `curio`.
 In Python 3.9 and 3.10, [asyncio's tasks](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) did not
 accept a `context` parameter. Due to this limitation, LangChain cannot automatically propagate the `RunnableConfig` down the call chain
 in certain scenarios.
 If you are experiencing issues with streaming, callbacks or tracing in async code and are using Python 3.9 or 3.10, this is a likely cause.
 Please read [Propagation RunnableConfig](/docs/concepts/runnables#propagation-runnableconfig) for more details to learn how to propagate the `RunnableConfig` down the call chain manually (or upgrade to Python 3.11 where this is no longer an issue).
 ## How to use in IPython and Jupyter Notebooks
 As of IPython 7.0, IPython supports asynchronous REPLs. This means that you can use the `await` keyword in the IPython REPL and Jupyter Notebooks without any additional setup. For more information, see the [IPython blog post](https://blog.jupyter.org/ipython-7-0-async-repl-a35ce050f7f7).
--- a/docs/docs/concepts/callbacks.mdx
+++ b/docs/docs/concepts/callbacks.mdx
@ -1,5 +1,9 @@
 # Callbacks
 :::note Pre-requisites
 - [Runnable interface](/docs/concepts/#runnable-interface)
 :::
 The lowest level way to stream outputs from LLMs in LangChain is via the [callbacks](/docs/concepts/#callbacks) system. You can pass a
 callback handler that handles the [`on_llm_new_token`](https://python.langchain.com/api_reference/langchain/callbacks/langchain.callbacks.streaming_aiter.AsyncIteratorCallbackHandler.html#langchain.callbacks.streaming_aiter.AsyncIteratorCallbackHandler.on_llm_new_token) event into LangChain components. When that component is invoked, any
 [LLM](/docs/concepts/#llms) or [chat model](/docs/concepts/#chat-models) contained in the component calls
--- a/docs/docs/concepts/chat_history.mdx
+++ b/docs/docs/concepts/chat_history.mdx
@ -0,0 +1,46 @@
 # Chat History
 :::info Prerequisites
 - [Messages](/docs/concepts/messages)
 - [Chat Models](/docs/concepts/chat_models)
 - [Tool Calling](/docs/concepts/tool_calling)
 :::
 ## Overview
 Chat history is a record of the conversation between the user and the chat model. It is used to maintain context and state throughout the conversation. The chat history is sequence of [messages](/docs/concepts/messages), each of which is associated with a specific [role](/docs/concepts/messages#role), such as "user", "assistant", "system", or "tool".
 ## Conversation Patterns
 Most conversations start with a **system message** that sets the context for the conversation. This is followed by a **user message** containing the user's input, and then an **assistant message** containing the model's response.
 The **assistant** may respond directly to the user or if configured with tools request that a [tool](/docs/concepts/tool_calling) be invoked to perform a specific task.
 So a full conversation often involves a combination of two patterns of alternating messages:
 1. The **user** and the **assistant** representing a back-and-forth conversation.
 2. The **assistant** and **tool messages** representing an ["agentic" workflow](/docs/concepts/agents) where the assistant is invoking tools to perform specific tasks.
 ## Managing Chat History
 Since chat models have a maximum limit on input size, it's important to manage chat history and trim it as needed to avoid exceeding the [context window](/docs/concepts/chat_models#context_window).
 While processing chat history, it's essential to preserve a correct conversation structure. 
 Key guidelines for managing chat history:
 - The conversation should follow one of these structures:
    - The first message is either a "user" message or a "system" message, followed by a "user" and then an "assistant" message.
    - The last message should be either a "user" message or a "tool" message containing the result of a tool call.
 - When using [tool calling](/docs/concepts/tool_calling), a "tool" message should only follow an "assistant" message that requested the tool invocation.
 :::tip
 Understanding correct conversation structure is essential for being able to properly implement
 [memory](https://langchain-ai.github.io/langgraph/concepts/memory/) in chat models.
 :::
 ## Related Resources
 - [How to Trim Messages](https://python.langchain.com/docs/how_to/trim_messages/)
 - [Memory Guide](https://langchain-ai.github.io/langgraph/concepts/memory/) for information on implementing short-term and long-term memory in chat models using [LangGraph](https://langchain-ai.github.io/langgraph/).
--- a/docs/docs/concepts/chat_models.mdx
+++ b/docs/docs/concepts/chat_models.mdx
@ -1,36 +1,162 @@
-# Chat models
+# Chat Models
 <span data-heading-keywords="chat model,chat models"></span>
-Language models that use a sequence of messages as inputs and return chat messages as outputs (as opposed to using plain text).
+## Overview
 These are traditionally newer models (older models are generally `LLMs`, see below).
 Chat models support the assignment of distinct roles to conversation messages, helping to distinguish messages from the AI, users, and instructions such as system messages.
-Although the underlying models are messages in, message out, the LangChain wrappers also allow these models to take a string as input. This means you can easily use chat models in place of LLMs.
+Large Language Models (LLMs) are advanced machine learning models that excel in a wide range of language-related tasks such as text generation, translation, summarization, question answering, and more, without needing task-specific tuning for every scenario.
-When a string is passed in as input, it is converted to a `HumanMessage` and then passed to the underlying model.
+Modern LLMs are typically accessed through a chat model interface that takes [messages](/docs/concepts/messages) as input and returns [messages](/docs/concepts/messages) as output.
-LangChain does not host any Chat Models, rather we rely on third party integrations.
+The newest generation of chat models offer additional capabilities:
-We have some standardized parameters when constructing ChatModels:
+* [Tool Calling](/docs/concepts#tool-calling): Many popular chat models offer a native [tool calling](/docs/concepts#tool-calling) API. This API allows developers to build rich applications that enable AI to interact with external services, APIs, and databases. Tool calling can also be used to extract structured information from unstructured data and perform various other tasks.
- `model`: the name of the model
+* [Multimodality](/docs/concepts/multimodality): The ability to work with data other than text; for example, images, audio, and video.
 - `temperature`: the sampling temperature
 - `timeout`: request timeout
 - `max_tokens`: max tokens to generate
 - `stop`: default stop sequences
 - `max_retries`: max number of times to retry requests
 - `api_key`: API key for the model provider
 - `base_url`: endpoint to send requests to
-Some important things to note:
+## Features
 - standard params only apply to model providers that expose parameters with the intended functionality. For example, some providers do not expose a configuration for maximum output tokens, so max_tokens can't be supported on these.
 - standard params are currently only enforced on integrations that have their own integration packages (e.g. `langchain-openai`, `langchain-anthropic`, etc.), they're not enforced on models in ``langchain-community``.
-ChatModels also accept other parameters that are specific to that integration. To find all the parameters supported by a ChatModel head to the API reference for that model.
+LangChain provides a consistent interface for working with chat models from different providers while offering additional features for monitoring, debugging, and optimizing the performance of applications that use LLMs.
-:::important
+* Integrations with many chat model providers (e.g., Anthropic, OpenAI, Ollama, Cohere, Hugging Face, Groq, Microsoft Azure, Google Vertex, Amazon Bedrock). Please see [chat model integrations](/docs/integrations/chat/) for an up-to-date list of supported models.
-Some chat models have been fine-tuned for **tool calling** and provide a dedicated API for it.
+* Use either LangChain's [messages](/docs/concepts/messages) format or OpenAI format.
-Generally, such models are better at tool calling than non-fine-tuned models, and are recommended for use cases that require tool calling.
+* Standard [tool calling API](/docs/concepts#tool-calling): standard interface for binding tools to models, accessing tool call requests made by models, and sending tool results back to the model.
-Please see the [tool calling section](/docs/concepts/#functiontool-calling) for more information.
+* Standard API for structuring outputs (/docs/concepts/structured_outputs) via the `with_structured_output` method.
 * Provides support for [async programming](/docs/concepts/async), [efficient batching](/docs/concepts/runnables#batch), [a rich streaming API](/docs/concepts/streaming).
 * Integration with [LangSmith](https://docs.smith.langchain.com) for monitoring and debugging production-grade applications based on LLMs.
 * Additional features like standardized [token usage](/docs/concepts/messages#token_usage), [rate limiting](#rate-limiting), [caching](#cache) and more.
 ##  Available Integrations
 LangChain has many chat model integrations that allow you to use a wide variety of models from different providers.
 These integrations are one of two types:
 1. **Official Models**: These are models that are officially supported by LangChain and/or model provider. You can find these models in the `langchain-<provider>` packages.
 2. **Community Models**: There are models that are mostly contributed and supported by the community. You can find these models in the `langchain-community` package.
 LangChain chat models are named with a convention that prefixes "Chat" to their class names (e.g., `ChatOllama`, `ChatAnthropic`, `ChatOpenAI`, etc.).
 Please review the [chat model integrations](/docs/integrations/chat/) for a list of supported models.
 :::note
 Models that do **not** include the prefix "Chat" in their name or include "LLM" as a suffix in their name typically refer to older models that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output.
 :::
-For specifics on how to use chat models, see the [relevant how-to guides here](/docs/how_to/#chat-models).
+## Interface
 LangChain chat models implement the [BaseChatModel](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html) interface. Because [BaseChatModel] also implements the [Runnable Interface](/docs/concepts/runnables), chat models support a [standard streaming interface](/docs/concepts/streaming), [async programming](/docs/concepts/async), optimized [batching](/docs/concepts/runnables#batch), and more. Please see the [Runnable Interface](/docs/concepts/runnables) for more details.
 Many of the key methods of chat models operate on [messages](/docs/concepts/messages) as input and return messages as output.
 Chat models offer a standard set of parameters that can be used to configure the model. These parameters are typically used to control the behavior of the model, such as the temperature of the output, the maximum number of tokens in the response, and the maximum time to wait for a response. Please see the [standard parameters](#standard-parameters) section for more details.
 ### Key Methods
 The key methods of a chat model are:
 1. **invoke**: The primary method for interacting with a chat model. It takes a list of [messages](/docs/concepts/messages) as input and returns a list of messages as output.
 2. **stream**: A method that allows you to stream the output of a chat model as it is generated.
 3. **batch**: A method that allows you to batch multiple requests to a chat model together for more efficient processing.
 4. **bind_tools**: A method that allows you to bind a tool to a chat model for use in the model's execution context.
 5. **with_structured_output**: A wrapper around the `invoke` method for models that natively support [structured output](/docs/concepts#structured_output).
 Other important methods can be found in the [BaseChatModel API Reference](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html).
 ### Inputs and Outputs 
 Modern LLMs are typically accessed through a chat model interface that takes [messages](/docs/concepts/messages) as input and returns [messages](/docs/concepts/messages) as output. Messages are typically associated with a role (e.g., "system", "human", "assistant") and one or more content blocks that contain text or potentially multimodal data (e.g., images, audio, video).
 LangChain supports two message formats to interact with chat models:
 1. **LangChain Message Format**: LangChain's own message format, which is used by default and is used internally by LangChain.
 2. **OpenAI's Message Format**: OpenAI's message format.
 ### Standard Parameters
 Many chat models have standardized parameters that can be used to configure the model:
 | Parameter      | Description                                                                                                                                                                                                                                                                                                    |
 |----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `model`        | The name or identifier of the specific AI model you want to use (e.g., `"gpt-3.5-turbo"` or `"gpt-4"`).                                                                                                                                                                                                        |
 | `temperature`  | Controls the randomness of the model's output. A higher value (e.g., 1.0) makes responses more creative, while a lower value (e.g., 0.1) makes them more deterministic and focused.                                                                                                                            |
 | `timeout`      | The maximum time (in seconds) to wait for a response from the model before canceling the request. Ensures the request doesn’t hang indefinitely.                                                                                                                                                               |
 | `max_tokens`   | Limits the total number of tokens (words and punctuation) in the response. This controls how long the output can be.                                                                                                                                                                                           |
 | `stop`         | Specifies stop sequences that indicate when the model should stop generating tokens. For example, you might use specific strings to signal the end of a response.                                                                                                                                              |
 | `max_retries`  | The maximum number of attempts the system will make to resend a request if it fails due to issues like network timeouts or rate limits.                                                                                                                                                                        |
 | `api_key`      | The API key required for authenticating with the model provider. This is usually issued when you sign up for access to the model.                                                                                                                                                                              |
 | `base_url`     | The URL of the API endpoint where requests are sent. This is typically provided by the model's provider and is necessary for directing your requests.                                                                                                                                                          |
 | `rate_limiter` | An optional [BaseRateLimiter](https://python.langchain.com/api_reference/core/rate_limiters/langchain_core.rate_limiters.BaseRateLimiter.html#langchain_core.rate_limiters.BaseRateLimiter) to space out requests to avoid exceeding rate limits.  See [rate-limiting](#rate-limiting) below for more details. |
 Some important things to note:
 - Standard parameters only apply to model providers that expose parameters with the intended functionality. For example, some providers do not expose a configuration for maximum output tokens, so max_tokens can't be supported on these.
 - Standard params are currently only enforced on integrations that have their own integration packages (e.g. `langchain-openai`, `langchain-anthropic`, etc.), they're not enforced on models in ``langchain-community``.
 ChatModels also accept other parameters that are specific to that integration. To find all the parameters supported by a ChatModel head to the [API reference](https://python.langchain.com/api_reference/) for that model.
 ## Tool Calling
 Chat models can call [tools](/docs/concepts/tools) to perform tasks such as fetching data from a database, making API requests, or running custom code. Please
 see the [tool calling](/docs/concepts#tool-calling) guide for more information.
 ## Structured Outputs
 Chat models can be requested to respond in a particular format (e.g., JSON or matching a particular schema). This feature is extremely
 useful for information extraction tasks. Please read more about
 the technique in the [structured outputs](/docs/concepts#structured_output) guide.
 ## Multimodality
 Large Language Models (LLMs) are not limited to processing text. They can also be used to process other types of data, such as images, audio, and video. This is known as [multimodality](/docs/concepts/multimodality).
 Currently, only some LLMs support multimodal inputs, and almost none support multimodal outputs. Please consult the specific model documentation for details.
 ## Context Window
 A chat model's context window refers to the maximum size of the input sequence the model can process at one time. While the context windows of modern LLMs are quite large, they still present a limitation that developers must keep in mind when working with chat models.
 If the input exceeds the context window, the model may not be able to process the entire input and could raise an error. In conversational applications, this is especially important because the context window determines how much information the model can "remember" throughout a conversation. Developers often need to manage the input within the context window to maintain a coherent dialogue without exceeding the limit. For more details on handling memory in conversations, refer to the [memory](https://langchain-ai.github.io/langgraph/concepts/memory/).
 The size of the input is measured in [tokens](/docs/concepts/tokens) which are the unit of processing that the model uses.
 ## Advanced Topics 
 ### Rate-limiting
 Many chat model providers impose a limit on the number of requests that can be made in a given time period.
 If you hit a rate limit, you will typically receive a rate limit error response from the provider, and will need to wait before making more requests.
 You have a few options to deal with rate limits:
 1. Try to avoid hitting rate limits by spacing out requests: Chat models accept a `rate_limiter` parameter that can be provided during initialization. This parameter is used to control the rate at which requests are made to the model provider. Spacing out the requests to a given model is a particularly useful strategy when benchmarking models to evaluate their performance. Please see the [how to handle rate limits](https://python.langchain.com/docs/how_to/chat_model_rate_limiting/) for more information on how to use this feature.
 2. Try to recover from rate limit errors: If you receive a rate limit error, you can wait a certain amount of time before retrying the request. The amount of time to wait can be increased with each subsequent rate limit error. Chat models have a `max_retries` parameter that can be used to control the number of retries. See the [standard parameters](#standard-parameters) section for more information.
 3. Fallback to another chat model: If you hit a rate limit with one chat model, you can switch to another chat model that is not rate-limited.
 ### Caching
 Chat model APIs can be slow, so a natural question is whether to cache the results of previous conversations. Theoretically, caching can help improve performance by reducing the number of requests made to the model provider. In practice, caching chat model responses is a complex problem and should be approached with caution.
 The reason is that getting a cache hit is unlikely after the first or second interaction in a conversation if relying on caching the **exact** inputs into the model. For example, how likely do you think that multiple conversations start with the exact same message? What about the exact same three messages?
 An alternative approach is to use semantic caching, where you cache responses based on the meaning of the input rather than the exact input itself. This can be effective in some situations, but not in others.
 A semantic cache introduces a dependency on another model on the critical path of your application (e.g., the semantic cache may rely on an [embedding model](/docs/concepts/embedding_models) to convert text to a vector representation), and it's not guaranteed to capture the meaning of the input accurately.
 However, there might be situations where caching chat model responses is beneficial. For example, if you have a chat model that is used to answer frequently asked questions, caching responses can help reduce the load on the model provider and improve response times.
 Please see the [how to cache chat model responses](/docs/how_to/#chat-model-caching) guide for more details.
 ## Related Resources
 * How-to guides on using chat models: [how-to guides](/docs/how_to/#chat-models).
 * List of supported chat models: [chat model integrations](/docs/integrations/chat/).
 ### Conceptual guides
 * [Messages](/docs/concepts/messages)
 * [Tool calling](/docs/concepts#tool-calling)
 * [Multimodality](/docs/concepts/multimodality)
 * [Structured outputs](/docs/concepts#structured_output)
 * [Tokens](/docs/concepts/tokens)
--- a/docs/docs/concepts/index.mdx
+++ b/docs/docs/concepts/index.mdx
@ -399,25 +399,7 @@ TODO(concepts): Add URL fragment
 #### Tokens
-The unit that most model providers use to measure input and output is via a unit called a **token**.
+* Conceptual Guide: [Tokens](/docs/concepts/tokens)
 Tokens are the basic units that language models read and generate when processing or producing text.
 The exact definition of a token can vary depending on the specific way the model was trained -
 for instance, in English, a token could be a single word like "apple", or a part of a word like "app".
 When you send a model a prompt, the words and characters in the prompt are encoded into tokens using a **tokenizer**.
 The model then streams back generated output tokens, which the tokenizer decodes into human-readable text.
 The below example shows how OpenAI models tokenize `LangChain is cool!`:
 ![](/img/tokenization.png)
 You can see that it gets split into 5 different tokens, and that the boundaries between tokens are not exactly the same as word boundaries.
 The reason language models use tokens rather than something more immediately intuitive like "characters"
 has to do with how they process and understand text. At a high-level, language models iteratively predict their next generated output based on
 the initial input and their previous generations. Training the model using tokens language models to handle linguistic
 units (like words or subwords) that carry meaning, rather than individual characters, which makes it easier for the model
 to learn and understand the structure of the language, including grammar and context.
 Furthermore, using tokens can also improve efficiency, since the model processes fewer units of text compared to character-level processing.
 ### Function/tool calling
--- a/docs/docs/concepts/langgraph.mdx
+++ b/docs/docs/concepts/langgraph.mdx
@ -0,0 +1,4 @@
 # LangGraph
 PLACEHOLDER TO BE REPLACED BY ACTUAL DOCUMENTATION
 USED TO MAKE SURE THAT WE DO NOT FORGET TO ADD LINKS LATER
--- a/docs/docs/concepts/langserve.md
+++ b/docs/docs/concepts/langserve.md
@ -0,0 +1,4 @@
 # LangServe
 PLACE HOLDER TO BE REPLACED BY ACTUAL DOCUMENTATION
 USED TO MAKE SURE THAT WE DO NOT FORGET TO ADD LINKS LATER
--- a/docs/docs/concepts/lcel.mdx
+++ b/docs/docs/concepts/lcel.mdx
@ -1,30 +1,216 @@
-## LangChain Expression Language (LCEL)
+# LangChain Expression Language (LCEL)
 <span data-heading-keywords="lcel"></span>
-`LangChain Expression Language`, or `LCEL`, is a declarative way to chain LangChain components.
+:::info Prerequisites
-LCEL was designed from day 1 to **support putting prototypes in production, with no code changes**, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of the reasons you might want to use LCEL:
+* [Runnable Interface](/docs/concepts/runnables)
 :::
- **First-class streaming support:**
+The **L**ang**C**hain **E**xpression **L**anguage (LCEL) takes a [declarative](https://en.wikipedia.org/wiki/Declarative_programming) approach to building new [Runnables](/docs/concepts/runnables) from existing Runnables.
 When you build your chains with LCEL you get the best possible time-to-first-token (time elapsed until the first chunk of output comes out). For some chains this means eg. we stream tokens straight from an LLM to a streaming output parser, and you get back parsed, incremental chunks of output at the same rate as the LLM provider outputs the raw tokens.
- **Async support:**
+This means that you describe what you want to happen, rather than how you want it to happen, allowing LangChain to optimize the run-time execution of the chains.
 Any chain built with LCEL can be called both with the synchronous API (eg. in your Jupyter notebook while prototyping) as well as with the asynchronous API (eg. in a [LangServe](/docs/langserve/) server). This enables using the same code for prototypes and in production, with great performance, and the ability to handle many concurrent requests in the same server.
- **Optimized parallel execution:**
+We often refer to a `Runnable` created using LCEL as a "chain". It's important to remember that a "chain" is `Runnable` and it implements the full [Runnable Interface](/docs/concepts/runnables).
 Whenever your LCEL chains have steps that can be executed in parallel (eg if you fetch documents from multiple retrievers) we automatically do it, both in the sync and the async interfaces, for the smallest possible latency.
- **Retries and fallbacks:**
+:::note
-Configure retries and fallbacks for any part of your LCEL chain. This is a great way to make your chains more reliable at scale. We’re currently working on adding streaming support for retries/fallbacks, so you can get the added reliability without any latency cost.
+* The [LCEL cheatsheet](https://python.langchain.com/docs/how_to/lcel_cheatsheet/) shows common patterns that involve the Runnable interface and LCEL expressions.
 * Please see the following list of [how-to guides](/docs/how_to/#langchain-expression-language-lcel) that cover common tasks with LCEL.
 * A list of built-in `Runnables` can be found in the [LangChain Core API Reference](https://python.langchain.com/api_reference/core/runnables.html). Many of these Runnables are useful when composing custom "chains" in LangChain using LCEL.
 :::
- **Access intermediate results:**
+## Benefits of LCEL
 For more complex chains it’s often very useful to access the results of intermediate steps even before the final output is produced. This can be used to let end-users know something is happening, or even just to debug your chain. You can stream intermediate results, and it’s available on every [LangServe](/docs/langserve) server.
- **Input and output schemas**
+LangChain optimizes the run-time execution of chains built with LCEL in a number of ways:
-Input and output schemas give every LCEL chain Pydantic and JSONSchema schemas inferred from the structure of your chain. This can be used for validation of inputs and outputs, and is an integral part of LangServe.
+
 - **Optimize parallel execution**: Run Runnables in parallel using [RunnableParallel](#RunnableParallel) or run multiple inputs through a given chain in parallel using the [Runnable Batch API](/docs/concepts/runnables#batch). Parallel execution can significantly reduce the latency as processing can be done in parallel instead of sequentially.
 - **Guarantee Async support**: Any chain built with LCEL can be run asynchronously using the [Runnable Async API](/docs/concepts/runnables#async-api). This can be useful when running chains in a server environment where you want to handle large number of requests concurrently.
 - **Simplify streaming**: LCEL chains can be streamed, allowing for incremental output as the chain is executed. LangChain can optimize the streaming of the output to minimize the time-to-first-token(time elapsed until the first chunk of output from a [chat model](/docs/concepts/chat_models) or [llm](/docs/concepts/llms) comes out).
 Other benefits include:
 - [**Seamless LangSmith tracing**](https://docs.smith.langchain.com)
 As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step.
 With LCEL, **all** steps are automatically logged to [LangSmith](https://docs.smith.langchain.com/) for maximum observability and debuggability.
 - **Standard API**: Because all chains are built using the Runnable interface, they can be used in the same way as any other Runnable.
 - [**Deployable with LangServe**](/docs/concepts/langserve): Chains built with LCEL can be deployed using for production use.
 ## Should I use LCEL?
 LCEL is an [orchestration solution](https://en.wikipedia.org/wiki/Orchestration_(computing)) -- it allows LangChain to handle run-time execution of chains in an optimized way.
 While we have seen users run chains with hundreds of steps in production, we generally recommend using LCEL for simpler orchestration tasks. When the application requires complex state management, branching, cycles or multiple agents, we recommend that users take advantage of [LangGraph](/docs/concepts/langgraph).
 In LangGraph, users define graphs that specify the flow of the application. This allows users to keep using LCEL within individual nodes when LCEL is needed, while making it easy to define complex orchestration logic that is more readable and maintainable.
 Here are some guidelines:
 * If you are making a single LLM call, you don't need LCEL; instead call the underlying [chat model](/docs/concepts/chat_models) directly.
 * If you have a simple chain (e.g., prompt + llm + parser, simple retrieval set up etc.), LCEL is a reasonable fit, if you're taking advantage of the LCEL benefits.
 * If you're building a complex chain (e.g., with branching, cycles, multiple agents, etc.) use [LangGraph](/docs/concepts/langgraph) instead. Remember that you can always use LCEL within individual nodes in LangGraph.
 ## Composition Primitives
 `LCEL` chains are built by composing existing `Runnables` together. The two main composition primitives are [RunnableSequence](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableSequence.html#langchain_core.runnables.base.RunnableSequence) and [RunnableParallel](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableParallel.html#langchain_core.runnables.base.RunnableParallel).
 Many other composition primitives (e.g., [RunnableAssign](
 https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.passthrough.RunnableAssign.html#langchain_core.runnables.passthrough.RunnableAssign
 )) can be thought of as variations of these two primitives.
 :::note
 You can find a list of all composition primitives in the [LangChain Core API Reference](https://python.langchain.com/api_reference/core/runnables.html).
 :::
 ### RunnableSequence
 `RunnableSequence` is a composition primitive that allows you "chain" multiple runnables sequentially, with the output of one runnable serving as the input to the next.
 ```python
 from langchain_core.runnables import RunnableSequence
 chain = RunnableSequence([runnable1, runnable2])
 ```
 Invoking the `chain` with some input:
 ```python
 final_output = chain.invoke(some_input)
 ```
 corresponds to the following:
 ```python
 output1 = runnable1.invoke(some_input)
 final_output = runnable2.invoke(output1)
 ```
 :::note
 `runnable1` and `runnable2` are placeholders for any `Runnable` that you want to chain together.
 :::
 ### RunnableParallel
 `RunnableParallel` is a composition primitive that allows you to run multiple runnables concurrently, with the same input provided to each.
 ```python
 from langchain_core.runnables import RunnableParallel
 chain = RunnableParallel({
    "key1": runnable1,
    "key2": runnable2,
 })
 ```
 Invoking the `chain` with some input:
 ```python
 final_output = chain.invoke(some_input)
 ```
 Will yield a `final_output` dictionary with the same keys as the input dictionary, but with the values replaced by the output of the corresponding runnable.
 ```python
 {
    "key1": runnable1.invoke(some_input),
    "key2": runnable2.invoke(some_input),
 }
 ```
 Recall, that the runnables are executed in parallel, so while the result is the same as
 dictionary comprehension shown above, the execution time is much faster.
 :::note
 `RunnableParallel`supports both synchronous and asynchronous execution (as all `Runnables` do).
 * For synchronous execution, `RunnableParallel` uses a [ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor) to run the runnables concurrently.
 * For asynchronous execution, `RunnableParallel` uses [asyncio.gather](https://docs.python.org/3/library/asyncio.html#asyncio.gather) to run the runnables concurrently.
 :::
 ## Composition Syntax
 The usage of `RunnableSequence` and `RunnableParallel` is so common that we created a shorthand syntax for using them. This helps
 to make the code more readable and concise.
 ### The `|` operator
 We have [overloaded](https://docs.python.org/3/reference/datamodel.html#special-method-names) the `|` operator to create a `RunnableSequence` from two `Runnables`.
 ```python
 chain = runnable1 | runnable2
 ```
 is Equivalent to:
 ```python
 chain = RunnableSequence([runnable1, runnable2])
 ```
 ### The `.pipe` method`
 If you have moral qualms with operator overloading, you can use the `.pipe` method instead. This is equivalent to the `|` operator.
 ```python
 chain = runnable1.pipe(runnable2)
 ```
 ### Coercion
 LCEL applies automatic type coercion to make it easier to compose chains.
 If you do not understand the type coercion, you can always use the `RunnableSequence` and `RunnableParallel` classes directly.
 This will make the code more verbose, but it will also make it more explicit.
 #### Dictionary to RunnableParallel
 Inside an LCEL expression, a dictionary is automatically converted to a `RunnableParallel`.
 For example, the following code:
 ```python
 mapping = {
    "key1": runnable1,
    "key2": runnable2,
 }
 chain = mapping | runnable3
 ```
 It gets automatically converted to the following:
 ```python
 chain = RunnableSequence([RunnableParallel(mapping), runnable3])
 ```
 :::caution
 You have to be careful because the `mapping` dictionary is not a `RunnableParallel` object, it is just a dictionary. This means that the following code will raise an `AttributeError`:
 ```python
 mapping.invoke(some_input)
 ```
 :::
 #### Function to RunnableLambda
 Inside an LCEL expression, a function is automatically converted to a `RunnableLambda`.
 ```
 def some_func(x):
    return x
 chain = some_func | runnable1
 ```
 It gets automatically converted to the following:
 ```python
 chain = RunnableSequence([RunnableLambda(some_func), runnable1])
 ```
 :::caution
 You have to be careful because the lambda function is not a `RunnableLambda` object, it is just a function. This means that the following code will raise an `AttributeError`:
 ```python
 lambda x: x + 1.invoke(some_input)
 ```
 :::
 ## Legacy Chains
 LCEL aims to provide consistency around behavior and customization over legacy subclassed chains such as `LLMChain` and
 `ConversationalRetrievalChain`. Many of these legacy chains hide important details like prompts, and as a wider variety
@ -32,4 +218,4 @@ of viable models emerge, customization has become more and more important.
 If you are currently using one of these legacy chains, please see [this guide for guidance on how to migrate](/docs/versions/migrating_chains).
-For guides on how to do specific tasks with LCEL, check out [the relevant how-to guides](/docs/how_to/#langchain-expression-language-lcel).
+For guides on how to do specific tasks with LCEL, check out [the relevant how-to guides](/docs/how_to/#langchain-expression-language-lcel).
--- a/docs/docs/concepts/llms.mdx
+++ b/docs/docs/concepts/llms.mdx
@ -1 +1,39 @@
-# LLMs
+# Large Language Models (LLMs)
 Large Language Models (LLMs) are advanced machine learning models that excel in a wide range of language-related tasks such as
 text generation, translation, summarization, question answering, and more, without needing task-specific tuning for every scenario.
 ## Chat Models
 Modern LLMs are typically exposed to users via a [Chat Model interface](/docs/concepts/chat_models). These models process sequences of [messages](/docs/concepts/messages) as input and output messages.
 Popular chat models support native [tool calling](/docs/concepts#tool-calling) capabilities, which allows building applications
 that can interact with external services, APIs, databases, extract structured information from unstructured text, and more.
 Modern LLMs are not limited to processing natural language text. They can also process other types of data, such as images, audio, and video. This is known as [multimodality](/docs/concepts/multimodality). Please see the [Chat Model Concept Guide](/docs/concepts/chat_models) page for more information.
 ## Terminology
 In documentation, we will often use the terms "LLM" and "Chat Model" interchangeably. This is because most modern LLMs are exposed to users via a chat model interface.
 However, users must know that there are two distinct interfaces for LLMs in LangChain:
 1. Modern LLMs implement the [BaseChatModel](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html) interface. These are chat models that process sequences of messages as input and output messages. Such models will typically be named with a convention that prefixes "Chat" to their class names (e.g., `ChatOllama`, `ChatAnthropic`, `ChatOpenAI`, etc.).
 2. Older LLMs implement the [BaseLLM](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.llms.BaseLLM.html#langchain_core.language_models.llms.BaseLLM) interface. These are LLMs that take as input text strings and output text strings. Such models are typically named just using the provider's name (e.g., `Ollama`, `Anthropic`, `OpenAI`, etc.). Generally, users should not use these models.
 ## Related Resources
 Modern LLMs (aka Chat Models):
 * [Conceptual Guide about Chat Models](/docs/concepts/chat_models/)
 * [Chat Model Integrations](/docs/integrations/chat/)
 * How-to Guides: [LLMs](/docs/how_to/#chat_models)
 Text-in, text-out LLMs (older or lower-level models):
 :::caution
 Unless you have a specific use case that requires using these models, you should use the chat models instead.
 :::
 * [LLM Integrations](/docs/integrations/llms/)
 * How-to Guides: [LLMs](/docs/how_to/#llms)
--- a/docs/docs/concepts/messages.mdx
+++ b/docs/docs/concepts/messages.mdx
@ -1,12 +1,244 @@
 # Messages
-## HumanMessage
+:::info Prerequisites
 - [Chat Models](/docs/concepts/chat_models)
 :::
-## AIMessage
+## Overview
-## SystemMessage
+Messages are the unit of communication in [chat models](/docs/concepts/chat_models). They are used to represent the input and output of a chat model, as well as any additional context or metadata that may be associated with a conversation.
-## ToolMessage
+Each message has a **role** (e.g., "user", "assistant"), **content** (e.g., text, multimodal data), and additional metadata that can vary depending on the chat model provider.
-## (Legacy) FunctionMessage
+LangChain provides a unified message format that can be used across chat models, allowing users to work with different chat models without worrying about the specific details of the message format used by each model provider.
 ## What inside a message?
 A message typically consists of the following pieces of information:
 - **Role**: The role of the message (e.g., "user", "assistant").
 - **Content**: The content of the message (e.g., text, multimodal data).
 - Additional metadata: id, name, [token usage](/docs/concepts/tokens) and other model-specific metadata.
 ### Role
 Roles are used to distinguish between different types of messages in a conversation and help the chat model understand how to respond to a given sequence of messages.
 | **Role**              | **Description**                                                                                                                                                                                                 |
 |-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | **system**            | Used to tell the chat model how to behave and provide additional context. Not supported by all chat model providers.                                                                                            |
 | **user**              | Represents input from a user interacting with the model, usually in the form of text or other interactive input.                                                                                                |
 | **assistant**         | Represents a response from the model, which can include text or a request to invoke tools.                                                                                                                      |
 | **tool**              | A message used to pass the results of a tool invocation back to the model after external data or processing has been retrieved. Used with chat models that support [tool calling](/docs/concepts/tool_calling). |
 | **function (legacy)** | This is a legacy role, corresponding to OpenAI's legacy function-calling API. **tool** role should be used instead.                                                                                             |
 ### Content
 The content of a message text or a list of dictionaries representing [multimodal data](/docs/concepts/multimodality) (e.g., images, audio, video). The exact format of the content can vary between different chat model providers.
 Currently, most chat models support text as the primary content type, with some models also supporting multimodal data. However, support for multimodal data is still limited across most chat model providers.
 For more information see:
 * [HumanMessage](#humanmessage) -- for content in the input from the user.
 * [AIMessage](#aimessage) -- for content in the response from the model.
 * [Multimodality](/docs/concepts/multimodality) -- for more information on multimodal content.
 ### Other Message Data
 Depending on the chat model provider, messages can include other data such as:
 - **ID**: An optional unique identifier for the message.
 - **Name**: An optional `name` property which allows differentiate between different entities/speakers with the same role. Not all models support this!
 - **Metadata**: Additional information about the message, such as timestamps, token usage, etc.
 - **Tool Calls**: A request made by the model to call one or more tools> See [tool calling](/docs/concepts/tool_calling) for more information.
 ## Conversation Structure
 The sequence of messages into a chat model should follow a specific structure to ensure that the chat model can generate a valid response.
 For example, a typical conversation structure might look like this:
 1. **User Message**: "Hello, how are you?"
 2. **Assistant Message**: "I'm doing well, thank you for asking."
 3. **User Message**: "Can you tell me a joke?"
 4. **Assistant Message**: "Sure! Why did the scarecrow win an award? Because he was outstanding in his field!"
 Please read the [chat history](/docs/concepts/chat_history) guide for more information on managing chat history and ensuring that the conversation structure is correct.
 ## LangChain Messages
 LangChain provides a unified message format that can be used across all chat models, allowing users to work with different chat models without worrying about the specific details of the message format used by each model provider.
 LangChain messages are Python objects that subclass from a [BaseMessage](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.base.BaseMessage.html).
 The five main message types are:
 - [SystemMessage](#systemmessage): corresponds to **system** role
 - [HumanMessage](#humanmessage): corresponds to **user** role
 - [AIMessage](#aimessage): corresponds to **assistant** role
 - [AIMessageChunk](#aimessagechunk): corresponds to **assistant** role, used for [streaming](/docs/concepts/streaming) responses
 - [ToolMessage](#toolmessage): corresponds to **tool** role
 Other important messages include:
 - [RemoveMessage](#removemessage) -- does not correspond to any role. This is an abstraction, mostly used in [LangGraph](/docs/concepts/langgraph) to manage chat history.
 - **Legacy** [FunctionMessage](#legacy-functionmessage): corresponds to the **function** role in OpenAI's **legacy** function-calling API.
 You can find more information about **messages** in the [API Reference](https://python.langchain.com/api_reference/core/messages.html).
 ### SystemMessage
 A `SystemMessage` is used to prime the behavior of the AI model and provide additional context, such as instructing the model to adopt a specific persona or setting the tone of the conversation (e.g., "This is a conversation about cooking").
 Different chat providers may support system message in one of the following ways:
 * **Through a "system" message role**: In this case, a system message is included as part of the message sequence with the role explicitly set as "system."
 * **Through a separate API parameter for system instructions**: Instead of being included as a message, system instructions are passed via a dedicated API parameter.
 * **No support for system messages**: Some models do not support system messages at all.
 Most major chat model providers support system instructions via either a chat message or a separate API parameter. LangChain will automatically adapt based on the provider’s capabilities. If the provider supports a separate API parameter for system instructions, LangChain will extract the content of a system message and pass it through that parameter.
 If no system message is supported by the provider, in most cases LangChain will attempt to incorporate the system message's content into a HumanMessage or raise an exception if that is not possible. However, this behavior is not yet consistently enforced across all implementations, and if using a less popular implementation of a chat model (e.g., an implementation from the `langchain-community` package) it is recommended to check the specific documentation for that model.
 ### HumanMessage
 The `HumanMessage` corresponds to the **"user"** role. A human message represents input from a user interacting with the model.
 #### Text Content
 Most chat models expect the user input to be in the form of text.
 ```python
 from langchain_core.messages import HumanMessage
 model.invoke([HumanMessage(content="Hello, how are you?")])
 ```
 :::tip
 When invoking a chat model with a string as input, LangChain will automatically convert the string into a `HumanMessage` object. This is mostly useful for quick testing.
 ```python
 model.invoke("Hello, how are you?")
 ```
 :::
 #### Multi-modal Content
 Some chat models accept multimodal inputs, such as images, audio, video, or files like PDFs.
 Please see the [multimodality](/docs/concepts/multimodality) guide for more information.
 ### AIMessage
 `AIMessage` is used to represent a message with the role **"assistant"**. This is the response from the model, which can include text or a request to invoke tools. It could also include other media types like images, audio, or video -- though this is still uncommon at the moment.
 ```python
 from langchain_core.messages import HumanMessage
 ai_message = model.invoke([HumanMessage("Tell me a joke")])
 ai_message # <-- AIMessage
 ```
 An `AIMessage` has the following attributes. The attributes which are **standardized** are the ones that LangChain attempts to standardize across different chat model providers. **raw** fields are specific to the model provider and may vary.
 | Attribute            | Standardized/Raw | Description                                                                                                                                                                                                             |
 |----------------------|:-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `content`            | Raw              | Usually a string, but can be a list of content blocks. See [content](#content) for details.                                                                                                                             |
 | `tool_calls`         | Standardized     | Tool calls associated with the message. See [tool calling](/docs/concepts/tool_calling) for details.                                                                                                                    |
 | `invalid_tool_calls` | Standardized     | Tool calls with parsing errors associated with the message. See [tool calling](/docs/concepts/tool_calling) for details.                                                                                                |
 | `usage_metadata`     | Standardized     | Usage metadata for a message, such as [token counts](/docs/concepts/tokens). See [Usage Metadata API Reference](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.UsageMetadata.html) |
 | `id`                 | Standardized     | An optional unique identifier for the message, ideally provided by the provider/model that created the message.                                                                                                         |
 | `response_metadata`  | Raw              | Response metadata, e.g., response headers, logprobs, token counts.                                                                                                                                                      |
 #### content
 The **content** property of an `AIMessage` represents the response generated by the chat model.
 The content is either:
 - **text** -- the norm for virtually all chat models.
 - A **list of dictionaries** -- Each dictionary represents a content block and is associated with a `type`.
    * Used by Anthropic for surfacing agent thought process when doing [tool calling](/docs/concepts/tool_calling).
    * Used by OpenAI for audio outputs. Please see [multi-modal content](/docs/concepts/multimodality) for more information.
 :::important
 The **content** property is **not** standardized across different chat model providers, mostly because there are
 still few examples to generalize from.
 :::
 ### AIMessageChunk
 It is common to [stream](/docs/concepts/streaming) responses for the chat model as they are being generated, so the user can see the response in real-time instead of waiting for the entire response to be generated before displaying it.
 It is returned from the `stream`, `astream` and `astream_events` methods of the chat model.
 For example,
 ```python
 for chunk in model.stream([HumanMessage("what color is the sky?")]):
    print(chunk)
 ```
 `AIMessageChunk` follows nearly the same structure as `AIMessage`, but uses a different [ToolCallChunk](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.tool.ToolCallChunk.html#langchain_core.messages.tool.ToolCallChunk)
 to be able to stream tool calling in a standardized manner.
 #### Aggregating
 `AIMessageChunks` support the `+` operator to merge them into a single `AIMessage`. This is useful when you want to display the final response to the user.
 ```python
 ai_message = chunk1 + chunk2 + chunk3 + ...
 ```
 ### ToolMessage
 This represents a message with role "tool", which contains the result of [calling a tool](/docs/concepts/tool_calling). In addition to `role` and `content`, this message has:
 - a `tool_call_id` field which conveys the id of the call to the tool that was called to produce this result.
 - an `artifact` field which can be used to pass along arbitrary artifacts of the tool execution which are useful to track but which should not be sent to the model.
 Please see [tool calling](/docs/concepts/tool_calling) for more information.
 ### RemoveMessage
 This is a special message type that does not correspond to any roles. It is used
 for managing chat history in [LangGraph](/docs/concepts/langgraph).
 Please see the following for more information on how to use the `RemoveMessage`:
 * [Memory conceptual guide](https://langchain-ai.github.io/langgraph/concepts/memory/)
 * [How to delete messages](https://langchain-ai.github.io/langgraph/how-tos/memory/delete-messages/)
 ### (Legacy) FunctionMessage
 This is a legacy message type, corresponding to OpenAI's legacy function-calling API. `ToolMessage` should be used instead to correspond to the updated tool-calling API.
 ## OpenAI Format
 ### Inputs
 Chat models also accept OpenAI's format as **inputs** to chat models:
 ```python
 chat_model.invoke([
    {
        "role": "user",
        "content": "Hello, how are you?",
    },
    {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking.",
    },
    {
        "role": "user",
        "content": "Can you tell me a joke?",
    }
 ])
 ```
 ### Outputs
 At the moment, the output of the model will be in terms of LangChain messages, so you will need to convert the output to the OpenAI format if you
 need OpenAI format for the output as well.
 The [convert_to_openai_messages](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.utils.convert_to_openai_messages.html) utility function can be used to convert from LangChain messages to OpenAI format.
--- a/docs/docs/concepts/multimodality.mdx
+++ b/docs/docs/concepts/multimodality.mdx
@ -1,11 +1,88 @@
-# MultiModality
+# Multimodality
-Some chat models are multimodal, accepting images, audio and even video as inputs. These are still less common, meaning model providers haven't standardized on the "best" way to define the API. Multimodal **outputs** are even less common. As such, we've kept our multimodal abstractions fairly light weight and plan to further solidify the multimodal APIs and interaction patterns as the field matures.
+## Overview
-In LangChain, most chat models that support multimodal inputs also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
+**Multimodality** refers to the ability to work with data that comes in different forms, such as text, audio, images, and video. Multimodality can appear in various components, allowing models and systems to handle and process a mix of these data types seamlessly.
-For specifics on how to use multimodal models, see the [relevant how-to guides here](/docs/how_to/#multimodal).
+- **Chat Models**: These could, in theory, accept and generate multimodal inputs and outputs, handling a variety of data types like text, images, audio, and video.
 - **Embedding Models**: Embedding Models can represent multimodal content, embedding various forms of data—such as text, images, and audio—into vector spaces.
 - **Vector Stores**: Vector stores could search over embeddings that represent multimodal data, enabling retrieval across different types of information.
-For a full list of LangChain model providers with multimodal models, [check out this table](/docs/integrations/chat/#advanced-features).
+## Multimodality in Chat models
 :::info Pre-requisites
 * [Chat models](/docs/concepts/chat_models)
 * [Messages](/docs/concepts/messages)
 :::
 Multimodal support is still relatively new and less common, model providers have not yet standardized on the "best" way to define the API. As such, LangChain's multimodal abstractions are lightweight and flexible, designed to accommodate different model providers' APIs and interaction patterns, but are **not** standardized across models.
 ### How to use multimodal models
 * Use the [chat model integration table](/docs/integrations/chat/) to identify which models support multimodality.
 * Reference the [relevant how-to guides](/docs/how_to/#multimodal) for specific examples of how to use multimodal models.
 ### What kind of multimodality is supported?
 #### Inputs
 Some models can accept multimodal inputs, such as images, audio, video, or files. The types of multimodal inputs supported depend on the model provider. For instance, [Google's Gemini](https://python.langchain.com/docs/integrations/chat/google_generative_ai/) supports documents like PDFs as inputs.
 Most chat models that support **multimodal inputs** also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
 The gist of passing multimodal inputs to a chat model is to use content blocks that specify a type and corresponding data. For example, to pass an image to a chat model:
 ```python
 from langchain_core.messages import HumanMessage
 message = HumanMessage(
    content=[
        {"type": "text", "text": "describe the weather in this image"},
        {"type": "image_url", "image_url": {"url": image_url}},
    ],
 )
 response = model.invoke([message])
 ```
 :::caution
 The exact format of the content blocks may vary depending on the model provider. Please refer to the chat model's
 integration documentation for the correct format. Find the integration in the [chat model integration table](/docs/integrations/chat/).
 :::
 #### Outputs
 Virtually no popular chat models support multimodal outputs at the time of writing (October 2024). 
 The only exception is OpenAI's chat model ([gpt-4o-audio-preview](https://python.langchain.com/docs/integrations/chat/openai/)), which can generate audio outputs.
 Multimodal outputs will appear as part of the [AIMessage](/docs/concepts/messages/#aimessage) response object.
 Please see the [ChatOpenAI](/docs/integrations/chat/openai/) for more information on how to use multimodal outputs.
 #### Tools
 Currently, no chat model is designed to work **directly** with multimodal data in a [tool call request](/docs/concepts/tool_calling) or [ToolMessage](/docs/concepts/tool_calling) result.
 However, a chat model can easily interact with multimodal data by invoking tools with references (e.g., a URL) to the multimodal data, rather than the data itself. For example, any model capable of [tool calling](/docs/concepts/tool_calling) can be equipped with tools to download and process images, audio, or video.
 ## Multimodality in embedding models
 :::info Prerequisites
 * [Embedding Models](/docs/concepts/embedding_models)
 :::
 **Embeddings** are vector representations of data used for tasks like similarity search and retrieval.
 The current [embedding interface](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html#langchain_core.embeddings.embeddings.Embeddings) used in LangChain is optimized entirely for text-based data, and will **not** work with multimodal data.
 As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the embedding interface to accommodate other data types like images, audio, and video.
 ## Multimodality in vector stores
 :::info Prerequisites
 * [Vectorstores](/docs/concepts/vectorstores)
 :::
 Vector stores are databases for storing and retrieving embeddings, which are typically used in search and retrieval tasks. Similar to embeddings, vector stores are currently optimized for text-based data.
 As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the vector store interface to accommodate other data types like images, audio, and video.
--- a/docs/docs/concepts/runnables.mdx
+++ b/docs/docs/concepts/runnables.mdx
@ -1,35 +1,352 @@
 # Runnable Interface
 <span data-heading-keywords="invoke,runnable"></span>
-To make it as easy as possible to create custom chains, we've implemented a ["Runnable"](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable) protocol. Many LangChain components implement the `Runnable` protocol, including chat models, LLMs, output parsers, retrievers, prompt templates, and more. There are also several useful primitives for working with runnables, which you can read about below.
+The Runnable interface is foundational for working with LangChain components, and it's implemented across many of them, such as [language models](/docs/concepts/chat_models), [output parsers](/docs/concepts/output_parsers), [retrievers](/docs/concepts/retrievers), [compiled LangGraph graphs](
 https://langchain-ai.github.io/langgraph/concepts/low_level/#compiling-your-graph) and more.
-This is a standard interface, which makes it easy to define custom chains as well as invoke them in a standard way.
+This guide covers the main concepts and methods of the Runnable interface, which allows developers to interact with various LangChain components in a consistent and predictable manner.
 The standard interface includes:
- `stream`: stream back chunks of the response
+:::info Related Resources
- `invoke`: call the chain on an input
+* The ["Runnable" Interface API Reference](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable) provides a detailed overview of the Runnable interface and its methods.
- `batch`: call the chain on a list of inputs
+* A list of built-in `Runnables` can be found in the [LangChain Core API Reference](https://python.langchain.com/api_reference/core/runnables.html). Many of these Runnables are useful when composing custom "chains" in LangChain using the [LangChain Expression Language (LCEL)](/docs/concepts/lcel).
 :::
-These also have corresponding async methods that should be used with [asyncio](https://docs.python.org/3/library/asyncio.html) `await` syntax for concurrency:
+## Overview of Runnable Interface
- `astream`: stream back chunks of the response async
+The Runnable way defines a standard interface that allows a Runnable component to be:
 - `ainvoke`: call the chain on an input async
 - `abatch`: call the chain on a list of inputs async
 - `astream_log`: stream back intermediate steps as they happen, in addition to the final response
 - `astream_events`: **beta** stream events as they happen in the chain (introduced in `langchain-core` 0.1.14)
-The **input type** and **output type** varies by component:
+* [Invoked](/docs/how_to/lcel_cheatsheet/#invoke-a-runnable): A single input is transformed into an output.
 * [Batched](/docs/how_to/lcel_cheatsheet/#batch-a-runnable/): Multiple inputs are efficiently transformed into outputs.
 * [Streamed](/docs/how_to/lcel_cheatsheet/#stream-a-runnable): Outputs are streamed as they are produced.
 * Inspected: Schematic information about Runnable's input, output, and configuration can be accessed.
 * Composed: Multiple Runnables can be composed to work together using [the LangChain Expression Language (LCEL)](/docs/concepts/lcel) to create complex pipelines.
-| Component    | Input Type                                            | Output Type           |
+Please review the [LCEL Cheatsheet](/docs/how_to/lcel_cheatsheet) for some common patterns that involve the Runnable interface and LCEL expressions.
-|--------------|-------------------------------------------------------|-----------------------|
+
-| Prompt       | Dictionary                                            | PromptValue           |
+<a id="batch"></a>
-| ChatModel    | Single string, list of chat messages or a PromptValue | ChatMessage           |
+### Optimized Parallel Execution (Batch)
-| LLM          | Single string, list of chat messages or a PromptValue | String                |
+<span data-heading-keywords="batch"></span>
-| OutputParser | The output of an LLM or ChatModel                     | Depends on the parser |
+
-| Retriever    | Single string                                         | List of Documents     |
+LangChain Runnables offer a built-in `batch` (and `batch_as_completed`) API that allow you to process multiple inputs in parallel.
-| Tool         | Single string or dictionary, depending on the tool    | Depends on the tool   |
+
 Using these methods can significantly improve performance when needing to process multiple independent inputs, as the
 processing can be done in parallel instead of sequentially.
 The two batching options are:
 * `batch`: Process multiple inputs in parallel, returning results in the same order as the inputs.
 * `batch_as_completed`: Process multiple inputs in parallel, returning results as they complete. Results may arrive out of order, but each includes the input index for matching.
 The default implementation of `batch` and `batch_as_completed` use a thread pool executor to run the `invoke` method in parallel. This allows for efficient parallel execution without the need for users to manage threads, and speeds up code that is I/O-bound (e.g., making API requests, reading files, etc.). It will not be as effective for CPU-bound operations, as the GIL (Global Interpreter Lock) in Python will prevent true parallel execution.
 Some Runnables may provide their own implementations of `batch` and `batch_as_completed` that are optimized for their specific use case (e.g.,
 rely on a `batch` API provided by a model provider).
 :::note
 The async versions of `abatch` and `abatch_as_completed` these rely on asyncio's [gather](https://docs.python.org/3/library/asyncio-task.html#asyncio.gather) and [as_completed](https://docs.python.org/3/library/asyncio-task.html#asyncio.as_completed) functions to run the `ainvoke` method in parallel.
 :::
 :::tip
 When processing a large number of inputs using `batch` or `batch_as_completed`, users may want to control the maximum number of parallel calls. This can be done by setting the `max_concurrency` attribute in the `RunnableConfig` dictionary. See the [RunnableConfig](/docs/concepts/runnables#runnableconfig) for more information.
 Chat Models also have a built-in [rate limiter](/docs/concepts/chat_models#rate-limiting) that can be used to control the rate at which requests are made.
 :::
 ### Asynchronous Support
 <span data-heading-keywords="async-api"></span>
 Runnables expose an asynchronous API, allowing them to be called using the `await` syntax in Python. Asynchronous methods can be identified by the "a" prefix (e.g., `ainvoke`, `abatch`, `astream`, `abatch_as_completed`).
 Please refer to the [Async Programming with LangChain](/docs/concepts/async) guide for more details.
 ## Streaming APIs
 <span data-heading-keywords="streaming-api"></span>
 Streaming is critical in making applications based on LLMs feel responsive to end-users.
 Runnables expose the following three streaming APIs:
 1. sync [stream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.stream) and async [astream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream): yields the output a Runnable as it is generated.
 2. The async `astream_events`: a more advanced streaming API that allows streaming intermediate steps and final output
 3. The **legacy** async `astream_log`: a legacy streaming API that streams intermediate steps and final output
 Please refer to the [Streaming Conceptual Guide](/docs/concepts/streaming) for more details on how to stream in LangChain.
 ## Input and Output Types
 Every `Runnable` is characterized by an input and output type. These input and output types can be any Python object, and are defined by the Runnable itself.
 Runnable methods that result in the execution of the Runnable (e.g., `invoke`, `batch`, `stream`, `astream_events`) work with these input and output types.
 * invoke: Accepts an input and returns an output.
 * batch: Accepts a list of inputs and returns a list of outputs.
 * stream: Accepts an input and returns a generator that yields outputs.
 The **input type** and **output type** vary by component:
 | Component    | Input Type                                       | Output Type           |
 |--------------|--------------------------------------------------|-----------------------|
 | Prompt       | dictionary                                       | PromptValue           |
 | ChatModel    | a string, list of chat messages or a PromptValue | ChatMessage           |
 | LLM          | a string, list of chat messages or a PromptValue | String                |
 | OutputParser | the output of an LLM or ChatModel                | Depends on the parser |
 | Retriever    | a string                                         | List of Documents     |
 | Tool         | a string or dictionary, depending on the tool    | Depends on the tool   |
 Please refer to the individual component documentation for more information on the input and output types and how to use them.
 ### Inspecting Schemas
 :::note
 This is an advanced feature that is unnecessary for most users. You should probably
 skip this section unless you have a specific need to inspect the schema of a Runnable.
 :::
 In some advanced uses, you may want to programmatically **inspect** the Runnable and determine what input and output types the Runnable expects and produces.
 The Runnable interface provides methods to get the [JSON Schema](https://json-schema.org/) of the input and output types of a Runnable, as well as [Pydantic schemas](https://docs.pydantic.dev/latest/) for the input and output types.
 These APIs are mostly used internally for unit-testing and by [LangServe](/docs/concepts/langserve) which uses the APIs for input validation and generation of [OpenAPI documentation](https://www.openapis.org/).
 In addition, to the input and output types, some Runnables have been set up with additional run time configuration options. 
 There are corresponding APIs to get the Pydantic Schema and JSON Schema of the configuration options for the Runnable.
 Please see the [Configurable Runnables](#configurable-runnables) section for more information.
 | Method                  | Description                                                      |
 |-------------------------|------------------------------------------------------------------|
 | `get_input_schema`      | Gives the Pydantic Schema of the input schema for the Runnable.  |
 | `get_output_chema`      | Gives the Pydantic Schema of the output schema for the Runnable. |
 | `config_schema`         | Gives the Pydantic Schema of the config schema for the Runnable. |
 | `get_input_jsonschema`  | Gives the JSONSchema of the input schema for the Runnable.       |
 | `get_output_jsonschema` | Gives the JSONSchema of the output schema for the Runnable.      |
 | `get_config_jsonschema` | Gives the JSONSchema of the config schema for the Runnable.      |
-All runnables expose input and output **schemas** to inspect the inputs and outputs:
+#### with_types
- `input_schema`: an input Pydantic model auto-generated from the structure of the Runnable
+
- `output_schema`: an output Pydantic model auto-generated from the structure of the Runnable
+LangChain will automatically try to infer the input and output types of a Runnable based on available information.
 Currently, this inference does not work well for more complex Runnables that are built using [LCEL](/docs/concepts/lcel) composition, and the inferred input and / or output types may be incorrect. In these cases, we recommend that users override the inferred input and output types using the `with_types` method ([API Reference](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.with_types
 ).
 ## RunnableConfig
 Any of the methods that are used to execute the runnable (e.g., `invoke`, `batch`, `stream`, `astream_events`) accept a second argument called
 `RunnableConfig` ([API Reference](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.config.RunnableConfig.html#runnableconfig)). This argument is a dictionary that contains configuration for the Runnable that will be used
 at run time during the execution of the runnable.
 A `RunnableConfig` can have any of the following properties defined:
 | Attribute       | Description                                                                                |
 |-----------------|--------------------------------------------------------------------------------------------|
 | run_name        | Name used for the given Runnable (not inherited).                                          |
 | run_id          | Unique identifier for this call. sub-calls will get their own unique run ids.              |
 | tags            | Tags for this call and any sub-calls.                                                      |
 | metadata        | Metadata for this call and any sub-calls.                                                  |
 | callbacks       | Callbacks for this call and any sub-calls.                                                 |
 | max_concurrency | Maximum number of parallel calls to make (e.g., used by batch).                            |
 | recursion_limit | Maximum number of times a call can recurse (e.g., used by Runnables that return Runnables) |
 | configurable    | Runtime values for configurable attributes of the Runnable.                                |
 Passing `config` to the `invoke` method is done like so:
 ```python
 some_runnable.invoke(
   some_input, 
   config={
      'run_name': 'my_run', 
      'tags': ['tag1', 'tag2'], 
      'metadata': {'key': 'value'}
   }
 )
 ```
 ### Propagation of RunnableConfig
 Many `Runnables` are composed of other Runnables, and it is important that the `RunnableConfig` is propagated to all sub-calls made by the Runnable. This allows providing run time configuration values to the parent Runnable that are inherited by all sub-calls.
 If this were not the case, it would be impossible to set and propagate [callbacks](/docs/concepts/callbacks) or other configuration values like `tags` and `metadata` which
 are expected to be inherited by all sub-calls.
 There are two main patterns by which new `Runnables` are created:
 1. Declaratively using [LangChain Expression Language (LCEL)](/docs/concepts/lcel):
    ```python
    chain = prompt | chat_model | output_parser
    ```
 2. Using a [custom Runnable](#custom-runnables)  (e.g., `RunnableLambda`) or using the `@tool` decorator:
    ```python
    def foo(input):
        # Note that .invoke() is used directly here
        return bar_runnable.invoke(input)
    foo_runnable = RunnableLambda(foo)
    ```
 LangChain will try to propagate `RunnableConfig` automatically for both of the patterns. 
 For handling the second pattern, LangChain relies on Python's [contextvars](https://docs.python.org/3/library/contextvars.html).
 In Python 3.11 and above, this works out of the box, and you do not need to do anything special to propagate the `RunnableConfig` to the sub-calls.
 In Python 3.9 and 3.10, if you are using **async code**, you need to manually pass the `RunnableConfig` through to the `Runnable` when invoking it. 
 This is due to a limitation in [asyncio's tasks](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task)  in Python 3.9 and 3.10 which did
 not accept a `context` argument).
 Propagating the `RunnableConfig` manually is done like so:
 ```python
 async def foo(input, config): # <-- Note the config argument
    return await bar_runnable.ainvoke(input, config=config)
 foo_runnable = RunnableLambda(foo)
 ```
 :::caution
 When using Python 3.10 or lower and writing async code, `RunnableConfig` cannot be propagated
 automatically, and you will need to do it manually! This is a common pitfall when
 attempting to stream data using `astream_events` and `astream_log` as these methods
 rely on proper propagation of [callbacks](/docs/concepts/callbacks) defined inside of `RunnableConfig`.
 :::
 ### Setting Custom Run Name, Tags, and Metadata
 The `run_name`, `tags`, and `metadata` attributes of the `RunnableConfig` dictionary can be used to set custom values for the run name, tags, and metadata for a given Runnable.
 The `run_name` is a string that can be used to set a custom name for the run. This name will be used in logs and other places to identify the run. It is not inherited by sub-calls.
 The `tags` and `metadata` attributes are lists and dictionaries, respectively, that can be used to set custom tags and metadata for the run. These values are inherited by sub-calls.
 Using these attributes can be useful for tracking and debugging runs, as they will be surfaced in [LangSmith](https://docs.smith.langchain.com/) as trace attributes that you can
 filter and search on.
 The attributes will also be propagated to [callbacks](/docs/concepts/callbacks), and will appear in streaming APIs like [astream_events](/docs/concepts/streaming) as part of each event in the stream.
 :::note Related
 * [How-to trace with LangChain](https://docs.smith.langchain.com/how_to_guides/tracing/trace_with_langchain)
 :::
 ### Setting Run ID
 :::note
 This is an advanced feature that is unnecessary for most users.
 :::
 You may need to set a custom `run_id` for a given run, in case you want 
 to reference it later or correlate it with other systems.
 The `run_id` MUST be a valid UUID string and **unique** for each run. It is used to identify
 the parent run, sub-class will get their own unique run ids automatically.
 To set a custom `run_id`, you can pass it as a key-value pair in the `config` dictionary when invoking the Runnable:
 ```python
 import uuid
 run_id = uuid.uuid4()
 some_runnable.invoke(
   some_input, 
   config={
      'run_id': run_id
   }
 )
 # do something with the run_id
 ```
 ### Setting Recursion Limit
 :::note
 This is an advanced feature that is unnecessary for most users.
 :::
 Some Runnables may return other Runnables, which can lead to infinite recursion if not handled properly. To prevent this, you can set a `recursion_limit` in the `RunnableConfig` dictionary. This will limit the number of times a Runnable can recurse.
 ### Setting Max Concurrency
 If using the `batch` or `batch_as_completed` methods, you can set the `max_concurrency` attribute in the `RunnableConfig` dictionary to control the maximum number of parallel calls to make. This can be useful when you want to limit the number of parallel calls to prevent overloading a server or API.
 :::tip
 If you're trying to rate limit the number of requests made by a **Chat Model**, you can use the built-in [rate limiter](/docs/concepts/chat_models#rate-limiting) instead of setting `max_concurrency`, which will be more effective.
 See the [How to handle rate limits](https://python.langchain.com/docs/how_to/chat_model_rate_limiting/) guide for more information.
 :::
 ### Setting configurable
 The `configurable` field is used to pass runtime values for configurable attributes of the Runnable.
 It is used frequently in [LangGraph](/docs/concepts/langgraph) with
 [LangGraph Persistence](https://langchain-ai.github.io/langgraph/concepts/persistence/)
 and [memory](https://langchain-ai.github.io/langgraph/concepts/memory/).
 It is used for a similar purpose in [RunnableWithMessageHistory](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.history.RunnableWithMessageHistory.html#langchain_core.runnables.history.RunnableWithMessageHistory) to specify either
 a `session_id` / `conversation_id` to keep track of conversation history.
 In addition, you can use it to specify any custom configuration options to pass to any [Configurable Runnable](#configurable-runnables) that they create.
 ### Setting Callbacks
 Use this option to configure [callbacks](/docs/concepts/callbacks) for the runnable at 
 runtime. The callbacks will be passed to all sub-calls made by the runnable.
 ```python
 some_runnable.invoke(
   some_input,
   {
      "callbacks": [
         SomeCallbackHandler(),
         AnotherCallbackHandler(),
      ]
   }
 )
 ```
 Please read the [Callbacks Conceptual Guide](/docs/concepts/callbacks) for more information on how to use callbacks in LangChain.
 :::important
 If you're using Python 3.9 or 3.10 in an async environment, you must propagate
 the `RunnableConfig` manually to sub-calls in some cases. Please see the
 [Propagating RunnableConfig](#propagation-of-runnableconfig) section for more information.
 :::
 ## Creating a Runnable from a function
 You may need to create a custom Runnable that runs arbitrary logic. This is especially
 useful if using [LangChain Expression Language (LCEL)](/docs/concepts/lcel) to compose
 multiple Runnables and you need to add custom processing logic in one of the steps.
 There are two ways to create a custom Runnable from a function:
 * `RunnableLambda`: Use this simple transformations where streaming is not required.
 * `RunnableGenerator`: use this for more complex transformations when streaming is needed.
 See the [How to run custom functions](/docs/how_to/functions) guide for more information on how to use `RunnableLambda` and `RunnableGenerator`.
 :::important
 Users should not try to subclass Runnables to create a new custom Runnable. It is
 much more complex and error-prone than simply using `RunnableLambda` or `RunnableGenerator`.
 :::
 ## Configurable Runnables
 :::note
 This is an advanced feature that is unnecessary for most users.
 It helps with configuration of large "chains" created using the [LangChain Expression Language (LCEL)](/docs/concepts/lcel)
 and is leveraged by [LangServe](/docs/concepts/langserve) for deployed Runnables.
 :::
 Sometimes you may want to experiment with, or even expose to the end user, multiple different ways of doing things with your Runnable. This could involve adjusting parameters like the temperature in a chat model or even switching between different chat models.
 To simplify this process, the Runnable interface provides two methods for creating configurable Runnables at runtime:
 * `configurable_fields`: This method allows you to configure specific **attributes** in a Runnable. For example, the `temperature` attribute of a chat model.
 * `configurable_alternatives`: This method enables you to specify **alternative** Runnables that can be run during run time. For example, you could specify a list of different chat models that can be used.
 See the [How to configure runtime chain internals](/docs/how_to/configure) guide for more information on how to configure runtime chain internals.
--- a/docs/docs/concepts/streaming.mdx
+++ b/docs/docs/concepts/streaming.mdx
@ -1,20 +1,53 @@
 # Streaming
-<span data-heading-keywords="stream,streaming"></span>
+:::info Prerequisites
 * [Runnable Interface](/docs/concepts/runnables)
 * [Chat Models](/docs/concepts/chat_models)
 :::
-Individual LLM calls often run for much longer than traditional resource requests.
+:::info Related Resources
 This compounds when you build more complex chains or agents that require multiple reasoning steps.
-Fortunately, LLMs generate output iteratively, which means it's possible to show sensible intermediate results
+Please see the following how-to guides for specific examples of streaming in LangChain:
-before the final response is ready. Consuming output as soon as it becomes available has therefore become a vital part of the UX
+* [How to stream chat models](/docs/how_to/chat_streaming/)
-around building apps with LLMs to help alleviate latency issues, and LangChain aims to have first-class support for streaming.
+* [How to stream tool calls](/docs/how_to/tool_streaming/)
 * [How to stream Runnables](/docs/how_to/streaming/)
 :::
-Below, we'll discuss some concepts and considerations around streaming in LangChain.
+Streaming is critical in making applications based on [LLMs](/docs/concepts/chat_models) feel responsive to end-users.
-## `.stream()` and `.astream()`
+## Why Streaming?
-Most modules in LangChain include the `.stream()` method (and the equivalent `.astream()` method for [async](https://docs.python.org/3/library/asyncio.html) environments) as an ergonomic streaming interface.
+[LLMs](/docs/concepts/chat_models) have noticeable latency on the order of seconds. This is much longer than the typical response time for most APIs, which are usually sub-second. The latency issue compounds quickly
-`.stream()` returns an iterator, which you can consume with a simple `for` loop. Here's an example with a chat model:
+as you build more complex applications that involve multiple calls to a model.
 Fortunately, LLMs generate output iteratively, which means it's possible to show sensible intermediate results before the final response is ready. Consuming output as soon as it becomes available has therefore become a vital part of the UX around building apps with LLMs to help alleviate latency issues, and LangChain aims to have first-class support for streaming.
 ## Streaming APIs
 Every LangChain component that implements the [Runnable Interface](/docs/concepts/runnables) supports streaming.
 There are three main APIs for streaming in LangChain:
 1. sync [stream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.stream) and async [astream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream): yields the output a Runnable as it is generated.
 2. The async [astream_events](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream_events): a streaming API that allows streaming intermediate steps from a Runnable. This API returns a stream of events.
 3. The **legacy** async [astream_log](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream_log): This is an advanced streaming API that allows streaming intermediate steps from a Runnable. Users should **not** use this API when writing new code.
 ### Streaming with LangGraph
 [LangGraph](/docs/concepts/langgraph) compiled graphs are [Runnables](/docs/concepts/runnables) and support the same streaming APIs.
 In LangGraph the `stream` and `astream` methods are phrased in terms of changes to the [graph state](https://langchain-ai.github.io/langgraph/concepts/low_level/#state), and as a result are much more helpful for getting intermediate states of the graph as they are generated.
 Please review the [LangGraph streaming guide](https://langchain-ai.github.io/langgraph/concepts/streaming/) for more information on how to stream when working with LangGraph.
 ### `.stream()` and `.astream()`
 <span data-heading-keywords="stream"></span>
 :::info Related Resources
 * [How to use streaming](/docs/how_to/streaming/#using-stream)
 :::
 The `.stream()` returns an iterator, which you can consume with a simple `for` loop. Here's an example with a chat model:
 ```python
 from langchain_anthropic import ChatAnthropic
@ -34,41 +67,24 @@ Because this method is part of [LangChain Expression Language](/docs/concepts/#l
 you can handle formatting differences from different outputs using an [output parser](/docs/concepts/#output-parsers) to transform
 each yielded chunk.
-You can check out [this guide](/docs/how_to/streaming/#using-stream) for more detail on how to use `.stream()`.
+## Dispatching Custom Events
-## `.astream_events()`
+You can dispatch custom [callback events](/docs/concepts#callbacks) if you want to add custom data to the event stream of [astream events](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream_events).
 <span data-heading-keywords="astream_events,stream_events,stream events"></span>
-While the `.stream()` method is intuitive, it can only return the final generated value of your chain. This is fine for single LLM calls,
+You can use custom events to provide additional information about the progress of a long-running task.
 but as you build more complex chains of several LLM calls together, you may want to use the intermediate values of
 the chain alongside the final output - for example, returning sources alongside the final generation when building a chat
 over documents app.
-There are ways to do this [using callbacks](/docs/concepts/#callbacks-1), or by constructing your chain in such a way that it passes intermediate
+For example, if you have a long-running [tool](/docs/concepts/tools) that involves multiple steps (e.g., multiple API calls) with multiple steps, you can dispatch custom events between the steps and use these custom events to monitor progress. You could also surface these custom events to an end user of your application to show them how the current task is progressing.
 values to the end with something like chained [`.assign()`](/docs/how_to/passthrough/) calls, but LangChain also includes an
 `.astream_events()` method that combines the flexibility of callbacks with the ergonomics of `.stream()`. When called, it returns an iterator
 which yields [various types of events](/docs/how_to/streaming/#event-reference) that you can filter and process according
 to the needs of your project.
-Here's one small example that prints just events containing streamed chat model output:
+:::info Related Resources
 * [How to dispatch custom callback events](https://python.langchain.com/docs/how_to/callbacks_custom_events/#astream-events-api)
 :::
-```python
+## Chat Models
 from langchain_core.output_parsers import StrOutputParser
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_anthropic import ChatAnthropic
-model = ChatAnthropic(model="claude-3-sonnet-20240229")
+### "Auto-Streaming" Chat Models
-prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")
+## Using Astream Events API
 parser = StrOutputParser()
 chain = prompt | model | parser
-async for event in chain.astream_events({"topic": "parrot"}, version="v2"):
+### Async throughout
    kind = event["event"]
    if kind == "on_chat_model_stream":
        print(event, end="|", flush=True)
 ```
-You can roughly think of it as an iterator over callback events (though the format differs) - and you can use it on almost all LangChain components!
+Important LangChain primitives like chat models, output parsers, prompts, retrievers, and agents implement the LangChain Runnable Interface.
 See [this guide](/docs/how_to/streaming/#using-stream-events) for more detailed information on how to use `.astream_events()`, including a table listing available events.
--- a/docs/docs/concepts/structured_outputs.mdx
+++ b/docs/docs/concepts/structured_outputs.mdx
@ -0,0 +1,3 @@
 # Structured Outputs
 Place holder
--- a/docs/docs/concepts/tokens.mdx
+++ b/docs/docs/concepts/tokens.mdx
@ -0,0 +1,58 @@
 # Tokens
 Modern large language models (LLMs) are typically based on a transformer architecture that processes a sequence of units known as tokens. Tokens are the fundamental elements that models use to break down input and generate output. In this section, we'll discuss what tokens are and how they are used by language models.
 ## What is a token?
 A **token** is the basic unit that a language model reads, processes, and generates. These units can vary based on how the model provider defines them, but in general, they could represent:
 * A whole word (e.g., "apple"),
 * A part of a word (e.g., "app"),
 * Or other linguistic components such as punctuation or spaces.
 The way the model tokenizes the input depends on its **tokenizer algorithm**, which converts the input into tokens. Similarly, the model’s output comes as a stream of tokens, which is then decoded back into human-readable text.
 ## How tokens work in language models
 The reason language models use tokens is tied to how they understand and predict language. Rather than processing characters or entire sentences directly, language models focus on **tokens**, which represent meaningful linguistic units. Here's how the process works:
 1. **Input Tokenization**: When you provide a model with a prompt (e.g., "LangChain is cool!"), the tokenizer algorithm splits the text into tokens. For example, the sentence could be tokenized into parts like `["Lang", "Chain", " is", " cool", "!"]`. Note that token boundaries don’t always align with word boundaries.
    ![](/img/tokenization.png)
 2. **Processing**: The transformer architecture behind these models processes tokens sequentially to predict the next token in a sentence. It does this by analyzing the relationships between tokens, capturing context and meaning from the input.
 3. **Output Generation**: The model generates new tokens one by one. These output tokens are then decoded back into human-readable text.
 Using tokens instead of raw characters allows the model to focus on linguistically meaningful units, which helps it capture grammar, structure, and context more effectively.
 ## Tokens don’t have to be text
 Although tokens are most commonly used to represent text, they don’t have to be limited to textual data. Tokens can also serve as abstract representations of **multi-modal data**, such as:
 - **Images**,
 - **Audio**,
 - **Video**,
 - And other types of data.
 At the time of writing, virtually no models support **multi-modal output**, and only a few models can handle **multi-modal inputs** (e.g., text combined with images or audio). However, as advancements in AI continue, we expect **multi-modality** to become much more common. This would allow models to process and generate a broader range of media, significantly expanding the scope of what tokens can represent and how models can interact with diverse types of data.
 :::note
 In principle, **anything that can be represented as a sequence of tokens** could be modeled in a similar way. For example, **DNA sequences**—which are composed of a series of nucleotides (A, T, C, G)—can be tokenized and modeled to capture patterns, make predictions, or generate sequences. This flexibility allows transformer-based models to handle diverse types of sequential data, further broadening their potential applications across various domains, including bioinformatics, signal processing, and other fields that involve structured or unstructured sequences.
 :::
 Please see the [multimodality](/docs/concepts/multimodality) section for more information on multi-modal inputs and outputs.
 ## Why not use characters?
 Using tokens instead of individual characters makes models both more efficient and better at understanding context and grammar. Tokens represent meaningful units, like whole words or parts of words, allowing models to capture language structure more effectively than by processing raw characters. Token-level processing also reduces the number of units the model has to handle, leading to faster computation.
 In contrast, character-level processing would require handling a much larger sequence of input, making it harder for the model to learn relationships and context. Tokens enable models to focus on linguistic meaning, making them more accurate and efficient in generating responses.
 ## How tokens correspond to text
 Please see this post from [OpenAI](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them) for more details on how tokens are counted and how they correspond to text.
 According to the OpenAI post, the approximate token counts for English text are as follows:
 * 1 token ~= 4 chars in English
 * 1 token ~= ¾ words
 * 100 tokens ~= 75 words
--- a/docs/docs/concepts/tool_calling.mdx
+++ b/docs/docs/concepts/tool_calling.mdx
@ -0,0 +1,3 @@
 # Tool Calling
 Place holder
--- a/docs/docs/concepts/tools.mdx
+++ b/docs/docs/concepts/tools.mdx
@ -0,0 +1,4 @@
 # Tools
 PLACE HOLDER TO BE REPLACED BY ACTUAL DOCUMENTATION
 USED TO MAKE SURE THAT WE DO NOT FORGET TO ADD LINKS LATER