### Description
Changed the value specified for `content_key` in JSONLoader from a
single key to a value based on jq schema.
I created [similar
PR](https://github.com/langchain-ai/langchain/pull/11255) before, but it
has several conflicts because of the architectural change associated
stable version release, so I re-create this PR to fit new architecture.
### Why
For json data like the following, specify `.data[].attributes.message`
for page_content and `.data[].attributes.id` or
`.data[].attributes.attributes. tags`, etc., the `content_key` must also
parse the json structure.
<details>
<summary>sample json data</summary>
```json
{
"data": [
{
"attributes": {
"message": "message1",
"tags": [
"tag1"
]
},
"id": "1"
},
{
"attributes": {
"message": "message2",
"tags": [
"tag2"
]
},
"id": "2"
}
]
}
```
</details>
<details>
<summary>sample code</summary>
```python
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["source"] = None
metadata["id"] = record.get("id")
metadata["tags"] = record["attributes"].get("tags")
return metadata
sample_file = "sample1.json"
loader = JSONLoader(
file_path=sample_file,
jq_schema=".data[]",
content_key=".attributes.message", ## content_key is parsable into jq schema
is_content_key_jq_parsable=True, ## this is added parameter
metadata_func=metadata_func
)
data = loader.load()
data
```
</details>
### Dependencies
none
### Twitter handle
[kzk_maeda](https://twitter.com/kzk_maeda)
## PR title
Docs: Updated callbacks/index.mdx adding example on runnable methods
## PR message
- **Description:** Updated callbacks/index.mdx adding an example on how
to pass callbacks to the runnable methods (invoke, batch, ...)
- **Issue:** #16379
- **Dependencies:** None
### Description
This PR moves the Elasticsearch classes to a partners package.
Note that we will not move (and later remove) `ElasticKnnSearch`. It
were previously deprecated.
`ElasticVectorSearch` is going to stay in the community package since it
is used quite a lot still.
Also note that I left the `ElasticsearchTranslator` for self query
untouched because it resides in main `langchain` package.
### Dependencies
There will be another PR that updates the notebooks (potentially pulling
them into the partners package) and templates and removes the classes
from the community package, see
https://github.com/langchain-ai/langchain/pull/17468
#### Open question
How to make the transition smooth for users? Do we move the import
aliases and require people to install `langchain-elasticsearch`? Or do
we remove the import aliases from the `langchain` package all together?
What has worked well for other partner packages?
---------
Co-authored-by: Erick Friis <erick@langchain.dev>
**Description**
Adding different threshold types to the semantic chunker. I’ve had much
better and predictable performance when using standard deviations
instead of percentiles.

For all the documents I’ve tried, the distribution of distances look
similar to the above: positively skewed normal distribution. All skews
I’ve seen are less than 1 so that explains why standard deviations
perform well, but I’ve included IQR if anyone wants something more
robust.
Also, using the percentile method backwards, you can declare the number
of clusters and use semantic chunking to get an ‘optimal’ splitting.
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Hi, I'm from the LanceDB team.
Improves LanceDB integration by making it easier to use - now you aren't
required to create tables manually and pass them in the constructor,
although that is still backward compatible.
Bug fix - pandas was being used even though it's not a dependency for
LanceDB or langchain
PS - this issue was raised a few months ago but lost traction. It is a
feature improvement for our users kindly review this , Thanks !
- **Description:** This adds a delete method so that rocksetdb can be
used with `RecordManager`.
- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter handle:** `@_morgan_adams_`
---------
Co-authored-by: Rockset API Bot <admin@rockset.io>
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.
As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf',
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.
Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.
In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can
use fewer tokens
- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
- **Twitter handle:** @joelsprunger
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Ran
```python
import glob
import re
def update_prompt(x):
return re.sub(
r"(?P<start>\b)PromptTemplate\(template=(?P<template>.*), input_variables=(?:.*)\)",
"\g<start>PromptTemplate.from_template(\g<template>)",
x
)
for fn in glob.glob("docs/**/*", recursive=True):
try:
content = open(fn).readlines()
except:
continue
content = [update_prompt(l) for l in content]
with open(fn, "w") as f:
f.write("".join(content))
```
Replace this entire comment with:
- **Description:** Added missing link for Quickstart in Model IO
documentation,
- **Issue:** N/A,
- **Dependencies:** N/A,
- **Twitter handle:** N/A
<!--
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
-->
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
**Description:**
Updated the retry.ipynb notebook, it contains the illustrations of
RetryOutputParser in LangChain. But the notebook lacks to explain the
compatibility of RetryOutputParser with existing chains. This changes
adds some code to illustrate the workflow of using RetryOutputParser
with the user chain.
Changes:
1. Changed RetryWithErrorOutputParser with RetryOutputParser, as the
markdown text says so.
2. Added code at the last of the notebook to define a chain which passes
the LLM completions to the retry parser, which can be customised for
user needs.
**Issue:**
Since RetryOutputParser/RetryWithErrorOutputParser does not implement
the parse function it cannot be used with LLMChain directly like
[this](https://python.langchain.com/docs/expression_language/cookbook/prompt_llm_parser#prompttemplate-llm-outputparser).
This also raised various issues #15133#12175#11719 still open, instead
of adding new features/code changes its best to explain the "how to
integrate LLMChain with retry parsers" clearly with an example in the
corresponding notebook.
Inspired from:
https://github.com/langchain-ai/langchain/issues/15133#issuecomment-1868972580
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- **Description:** Syntax correction according to langchain version
update in 'Retry Parser' tutorial example,
- **Issue:** #16698
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
… converters
One way to convert anything to an OAI function:
convert_to_openai_function
One way to convert anything to an OAI tool: convert_to_openai_tool
Corresponding bind functions on OAI models: bind_functions, bind_tools
- **Description:**
This PR adds a VectorStore integration for SAP HANA Cloud Vector Engine,
which is an upcoming feature in the SAP HANA Cloud database
(https://blogs.sap.com/2023/11/02/sap-hana-clouds-vector-engine-announcement/).
- **Issue:** N/A
- **Dependencies:** [SAP HANA Python
Client](https://pypi.org/project/hdbcli/)
- **Twitter handle:** @sapopensource
Implementation of the integration:
`libs/community/langchain_community/vectorstores/hanavector.py`
Unit tests:
`libs/community/tests/unit_tests/vectorstores/test_hanavector.py`
Integration tests:
`libs/community/tests/integration_tests/vectorstores/test_hanavector.py`
Example notebook:
`docs/docs/integrations/vectorstores/hanavector.ipynb`
Access credentials for execution of the integration tests can be
provided to the maintainers.
---------
Co-authored-by: sascha <sascha.stoll@sap.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
The callbacks get started demo code was updated , replacing the
chain.run() command ( which is now depricated) ,with the updated
chain.invoke() command.
Solving the following issue : #16379
Twitter/X : @Hazxhx
- **Description:** Adds a text splitter based on
[Konlpy](https://konlpy.org/en/latest/#start) which is a Python package
for natural language processing (NLP) of the Korean language. (It is
like Spacy or NLTK for Korean)
- **Dependencies:** Konlpy would have to be installed before this
splitter is used,
- **Twitter handle:** @untilhamza
This PR adds `astream_events` method to Runnables to make it easier to
stream data from arbitrary chains.
* Streaming only works properly in async right now
* One should use `astream()` with if mixing in imperative code as might
be done with tool implementations
* Astream_log has been modified with minimal additive changes, so no
breaking changes are expected
* Underlying callback code / tracing code should be refactored at some
point to handle things more consistently (OK for now)
- ~~[ ] verify event for on_retry~~ does not work until we implement
streaming for retry
- ~~[ ] Any rrenaming? Should we rename "event" to "hook"?~~
- [ ] Any other feedback from community?
- [x] throw NotImplementedError for `RunnableEach` for now
## Example
See this [Example
Notebook](dbbc7fa0d6/docs/docs/modules/agents/how_to/streaming_events.ipynb)
for an example with streaming in the context of an Agent
## Event Hooks Reference
Here is a reference table that shows some events that might be emitted
by the various Runnable objects.
Definitions for some of the Runnable are included after the table.
| event | name | chunk | input | output |
|----------------------|------------------|---------------------------------|-----------------------------------------------|-------------------------------------------------|
| on_chat_model_start | [model name] | | {"messages": [[SystemMessage,
HumanMessage]]} | |
| on_chat_model_stream | [model name] | AIMessageChunk(content="hello")
| | |
| on_chat_model_end | [model name] | | {"messages": [[SystemMessage,
HumanMessage]]} | {"generations": [...], "llm_output": None, ...} |
| on_llm_start | [model name] | | {'input': 'hello'} | |
| on_llm_stream | [model name] | 'Hello' | | |
| on_llm_end | [model name] | | 'Hello human!' |
| on_chain_start | format_docs | | | |
| on_chain_stream | format_docs | "hello world!, goodbye world!" | | |
| on_chain_end | format_docs | | [Document(...)] | "hello world!,
goodbye world!" |
| on_tool_start | some_tool | | {"x": 1, "y": "2"} | |
| on_tool_stream | some_tool | {"x": 1, "y": "2"} | | |
| on_tool_end | some_tool | | | {"x": 1, "y": "2"} |
| on_retriever_start | [retriever name] | | {"query": "hello"} | |
| on_retriever_chunk | [retriever name] | {documents: [...]} | | |
| on_retriever_end | [retriever name] | | {"query": "hello"} |
{documents: [...]} |
| on_prompt_start | [template_name] | | {"question": "hello"} | |
| on_prompt_end | [template_name] | | {"question": "hello"} |
ChatPromptValue(messages: [SystemMessage, ...]) |
Here are declarations associated with the events shown above:
`format_docs`:
```python
def format_docs(docs: List[Document]) -> str:
'''Format the docs.'''
return ", ".join([doc.page_content for doc in docs])
format_docs = RunnableLambda(format_docs)
```
`some_tool`:
```python
@tool
def some_tool(x: int, y: str) -> dict:
'''Some_tool.'''
return {"x": x, "y": y}
```
`prompt`:
```python
template = ChatPromptTemplate.from_messages(
[("system", "You are Cat Agent 007"), ("human", "{question}")]
).with_config({"run_name": "my_template", "tags": ["my_template"]})
```
- vertex chat
- google
- some pip openai
- percent and openai
- all percent
- more
- pip
- fmt
- docs: google vertex partner docs
- fmt
- docs: more pip installs
<!-- Thank you for contributing to LangChain!
Please title your PR "<package>: <description>", where <package> is
whichever of langchain, community, core, experimental, etc. is being
modified.
Replace this entire comment with:
- **Description:** a description of the change,
- **Issue:** the issue # it fixes if applicable,
- **Dependencies:** any dependencies required for this change,
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` from the root
of the package you've modified to check this locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc: https://python.langchain.com/docs/contributing/
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in
`docs/docs/integrations` directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
-->
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>