docs: table legend updated (#21351)

Compacted the table column legends. Added links. Similar to #21259
This commit is contained in:
Leonid Ganeline 2024-05-07 14:45:04 -07:00 committed by GitHub
parent d5bde4fa91
commit 7cbf1c31aa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 47 additions and 54 deletions

View File

@ -25,27 +25,28 @@ That means there are two different axes along which you can customize your text
## Types of Text Splitters ## Types of Text Splitters
LangChain offers many different types of text splitters. These all live in the `langchain-text-splitters` package. Below is a table listing all of them, along with a few characteristics: LangChain offers many different types of `text splitters`.
These all live in the `langchain-text-splitters` package.
**Name**: Name of the text splitter Table columns:
**Splits On**: How this text splitter splits text - **Name**: Name of the text splitter
- **Classes**: Classes that implement this text splitter
**Adds Metadata**: Whether or not this text splitter adds metadata about where each chunk came from. - **Splits On**: How this text splitter splits text
- **Adds Metadata**: Whether or not this text splitter adds metadata about where each chunk came from.
**Description**: Description of the splitter, including recommendation on when to use it. - **Description**: Description of the splitter, including recommendation on when to use it.
| Name | Splits On | Adds Metadata | Description | | Name | Classes | Splits On | Adds Metadata | Description |
|-----------|---------------------------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Recursive | A list of user defined characters | | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. | | Recursive | [RecursiveCharacterTextSplitter](/docs/modules/data_connection/document_transformers/recursive_text_splitter), [RecursiveJsonSplitter](/docs/modules/data_connection/document_transformers/recursive_json_splitter) | A list of user defined characters | | Recursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the `recommended way` to start splitting text. |
| HTML | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) | | HTML | [HTMLHeaderTextSplitter](/docs/modules/data_connection/document_transformers/HTML_header_metadata), [HTMLSectionSplitter](/docs/modules/data_connection/document_transformers/HTML_section_aware_splitter) | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |
| Markdown | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) | | Markdown | [MarkdownHeaderTextSplitter](/docs/modules/data_connection/document_transformers/markdown_header_metadata) | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
| Code | Code (Python, JS) specific characters | | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. | | Code | [many languages](/docs/modules/data_connection/document_transformers/code_splitter) | Code (Python, JS) specific characters | | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
| Token | Tokens | | Splits text on tokens. There exist a few different ways to measure tokens. | | Token | [many classes](/docs/modules/data_connection/document_transformers/split_by_token) | Tokens | | Splits text on tokens. There exist a few different ways to measure tokens. |
| Character | A user defined character | | Splits text based on a user defined character. One of the simpler methods. | | Character | [CharacterTextSplitter](/docs/modules/data_connection/document_transformers/character_text_splitter) | A user defined character | | Splits text based on a user defined character. One of the simpler methods. |
| [Experimental] Semantic Chunker | Sentences | | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) | | [Experimental] Semantic Chunker | [SemanticChunker](/docs/modules/data_connection/document_transformers/semantic-chunker) | Sentences | | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) |
| [AI21 Semantic Text Splitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter) | Semantics | ✅ | Identifies distinct topics that form coherent pieces of text and splits along those. | | AI21 Semantic Text Splitter | [AI21SemanticTextSplitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter) | ✅ | Identifies distinct topics that form coherent pieces of text and splits along those. |
## Evaluate text splitters ## Evaluate text splitters

View File

@ -10,27 +10,23 @@ A retriever is an interface that returns documents given an unstructured query.
A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used
as the backbone of a retriever, but there are other types of retrievers as well. as the backbone of a retriever, but there are other types of retrievers as well.
Retrievers accept a string query as input and return a list of `Document`'s as output. Retrievers accept a string `query` as input and return a list of `Document`'s as output.
## Advanced Retrieval Types ## Advanced Retrieval Types
LangChain provides several advanced retrieval types. A full list is below, along with the following information: Table columns:
**Name**: Name of the retrieval algorithm. - **Name**: Name of the retrieval algorithm.
- **Index Type**: Which index type (if any) this relies on.
**Index Type**: Which index type (if any) this relies on. - **Uses an LLM**: Whether this retrieval method uses an LLM.
- **When to Use**: Our commentary on when you should considering using this retrieval method.
**Uses an LLM**: Whether this retrieval method uses an LLM. - **Description**: Description of what this retrieval algorithm is doing.
**When to Use**: Our commentary on when you should considering using this retrieval method.
**Description**: Description of what this retrieval algorithm is doing.
| Name | Index Type | Uses an LLM | When to Use | Description | | Name | Index Type | Uses an LLM | When to Use | Description |
|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Vectorstore](./vectorstore) | Vectorstore | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. | | [Vectorstore](./vectorstore) | Vectorstore | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It creates embeddings for each piece of text. |
| [ParentDocument](./parent_document_retriever) | Vectorstore + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). | | [ParentDocument](./parent_document_retriever) | Vectorstore + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This indexes multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
| [Multi Vector](multi_vector) | Vectorstore + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. | | [Multi Vector](multi_vector) | Vectorstore + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This creates multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
| [Self Query](./self_query) | Vectorstore | Yes | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). | | [Self Query](./self_query) | Vectorstore | Yes | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
| [Contextual Compression](./contextual_compression) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. | | [Contextual Compression](./contextual_compression) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. |
| [Time-Weighted Vectorstore](./time_weighted_vectorstore) | Vectorstore | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) | | [Time-Weighted Vectorstore](./time_weighted_vectorstore) | Vectorstore | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) |

View File

@ -15,34 +15,30 @@ See [this quick-start guide](./quick_start) for an introduction to output parser
## Output Parser Types ## Output Parser Types
LangChain has lots of different types of output parsers. This is a list of output parsers LangChain supports. The table below has various pieces of information: LangChain has lots of different types of `output parsers`.
**Name**: The name of the output parser Table columns:
**Supports Streaming**: Whether the output parser supports streaming. - **Name**: The name of the output parser
- **Supports Streaming**: Whether the output parser supports streaming.
**Has Format Instructions**: Whether the output parser has format instructions. This is generally available except when (a) the desired schema is not specified in the prompt but rather in other parameters (like OpenAI function calling), or (b) when the OutputParser wraps another OutputParser. - **Has Format Instructions**: Whether the output parser has format instructions. This is generally available except when (a) the desired schema is not specified in the prompt but rather in other parameters (like OpenAI function calling), or (b) when the OutputParser wraps another OutputParser.
- **Calls LLM**: Whether this output parser itself calls an LLM. This is usually only done by output parsers that attempt to correct misformatted output.
**Calls LLM**: Whether this output parser itself calls an LLM. This is usually only done by output parsers that attempt to correct misformatted output. - **Input Type**: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific kwargs.
- **Output Type**: The output type of the object returned by the parser.
**Input Type**: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific kwargs. - **Description**: Our commentary on this output parser and when to use it.
**Output Type**: The output type of the object returned by the parser.
**Description**: Our commentary on this output parser and when to use it.
| Name | Supports Streaming | Has Format Instructions | Calls LLM | Input Type | Output Type | Description | | Name | Supports Streaming | Has Format Instructions | Calls LLM | Input Type | Output Type | Description |
|-----------------|--------------------|-------------------------------|-----------|----------------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |-----------------|--------------------|-------------------------------|-----------|----------------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [OpenAITools](./types/openai_tools) | | (Passes `tools` to model) | | `Message` (with `tool_choice`) | JSON object | Uses latest OpenAI function calling args `tools` and `tool_choice` to structure the return output. If you are using a model that supports function calling, this is generally the most reliable method. | | [OpenAITools](./types/openai_tools) | | (Passes `tools` to model) | | `Message` (with `tool_choice`) | JSON object | Uses latest OpenAI function calling args `tools` and `tool_choice` to structure the return output. If you are using a model that supports function calling, this is generally the most reliable method. |
| [OpenAIFunctions](./types/openai_functions) | ✅ | (Passes `functions` to model) | | `Message` (with `function_call`) | JSON object | Uses legacy OpenAI function calling args `functions` and `function_call` to structure the return output. | | [OpenAIFunctions](./types/openai_functions) | ✅ | (Passes `functions` to model) | | `Message` (with `function_call`) | JSON object | Uses legacy OpenAI function calling args `functions` and `function_call` to structure the return output. |
| [JSON](./types/json) | ✅ | ✅ | | `str` \| `Message` | JSON object | Returns a JSON object as specified. You can specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling. | | [JSON](./types/json) | ✅ | ✅ | | `str` \| `Message` | JSON object | Returns a JSON object as specified. You specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling. |
| [XML](./types/xml) | ✅ | ✅ | | `str` \| `Message` | `dict` | Returns a dictionary of tags. Use when XML output is needed. Use with models that are good at writing XML (like Anthropic's). | | [XML](./types/xml) | ✅ | ✅ | | `str` \| `Message` | `dict` | Returns a dictionary of tags. Use when XML output is needed. Use with models that are good at writing XML (like Anthropic's). |
| [CSV](./types/csv) | ✅ | ✅ | | `str` \| `Message` | `List[str]` | Returns a list of comma separated values. | | [CSV](./types/csv) | ✅ | ✅ | | `str` \| `Message` | `List[str]` | Returns a list of comma separated values. |
| [OutputFixing](./types/output_fixing) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the error message and the bad output to an LLM and ask it to fix the output. | | [OutputFixing](./types/output_fixing) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the error message and the bad output to an LLM and ask it to fix the output. |
| [RetryWithError](./types/retry) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the original inputs, the bad output, and the error message to an LLM and ask it to fix it. Compared to OutputFixingParser, this one also sends the original instructions. | | [RetryWithError](./types/retry) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the original inputs, the bad output, and the error message to an LLM and ask it to fix it. Compared to `OutputFixingParser`, this one also sends the original instructions. |
| [Pydantic](./types/pydantic) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. | | [Pydantic](./types/pydantic) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. |
| [YAML](./types/yaml) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. Uses YAML to encode it. | | [YAML](./types/yaml) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. Uses YAML to encode it. |
| [PandasDataFrame](./types/pandas_dataframe) | | ✅ | | `str` \| `Message` | `dict` | Useful for doing operations with pandas DataFrames. | | [PandasDataFrame](./types/pandas_dataframe) | | ✅ | | `str` \| `Message` | `dict` | Useful for doing operations with pandas DataFrames. |
| [Enum](./types/enum) | | ✅ | | `str` \| `Message` | `Enum` | Parses response into one of the provided enum values. | | [Enum](./types/enum) | | ✅ | | `str` \| `Message` | `Enum` | Parses response into one of the provided enum values. |
| [Datetime](./types/datetime) | | ✅ | | `str` \| `Message` | `datetime.datetime` | Parses response into a datetime string. | | [Datetime](./types/datetime) | | ✅ | | `str` \| `Message` | `datetime.datetime` | Parses response into a datetime string. |
| [Structured](./types/structured) | | ✅ | | `str` \| `Message` | `Dict[str, str]` | An output parser that returns structured information. It is less powerful than other output parsers since it only allows for fields to be strings. This can be useful when you are working with smaller LLMs. | | [Structured](./types/structured) | | ✅ | | `str` \| `Message` | `Dict[str, str]` | An output parser that returns structured information. It is less powerful than other output parsers since it only allows for fields to be strings. This useful when you are working with smaller LLMs. |