mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-24 03:52:10 +00:00
community[minor]: use jq schema for content_key in json_loader (#18003)
### Description Changed the value specified for `content_key` in JSONLoader from a single key to a value based on jq schema. I created [similar PR](https://github.com/langchain-ai/langchain/pull/11255) before, but it has several conflicts because of the architectural change associated stable version release, so I re-create this PR to fit new architecture. ### Why For json data like the following, specify `.data[].attributes.message` for page_content and `.data[].attributes.id` or `.data[].attributes.attributes. tags`, etc., the `content_key` must also parse the json structure. <details> <summary>sample json data</summary> ```json { "data": [ { "attributes": { "message": "message1", "tags": [ "tag1" ] }, "id": "1" }, { "attributes": { "message": "message2", "tags": [ "tag2" ] }, "id": "2" } ] } ``` </details> <details> <summary>sample code</summary> ```python def metadata_func(record: dict, metadata: dict) -> dict: metadata["source"] = None metadata["id"] = record.get("id") metadata["tags"] = record["attributes"].get("tags") return metadata sample_file = "sample1.json" loader = JSONLoader( file_path=sample_file, jq_schema=".data[]", content_key=".attributes.message", ## content_key is parsable into jq schema is_content_key_jq_parsable=True, ## this is added parameter metadata_func=metadata_func ) data = loader.load() data ``` </details> ### Dependencies none ### Twitter handle [kzk_maeda](https://twitter.com/kzk_maeda)
This commit is contained in:
@@ -199,6 +199,58 @@ pprint(data)
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
### JSON file with jq schema `content_key`
|
||||
|
||||
To load documents from a JSON file using the content_key within the jq schema, set is_content_key_jq_parsable=True.
|
||||
Ensure that content_key is compatible and can be parsed using the jq schema.
|
||||
|
||||
```python
|
||||
file_path = './sample.json'
|
||||
pprint(Path(file_path).read_text())
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```json
|
||||
{"data": [
|
||||
{"attributes": {
|
||||
"message": "message1",
|
||||
"tags": [
|
||||
"tag1"]},
|
||||
"id": "1"},
|
||||
{"attributes": {
|
||||
"message": "message2",
|
||||
"tags": [
|
||||
"tag2"]},
|
||||
"id": "2"}]}
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
|
||||
```python
|
||||
loader = JSONLoader(
|
||||
file_path=file_path,
|
||||
jq_schema=".data[]",
|
||||
content_key=".attributes.message",
|
||||
is_content_key_jq_parsable=True,
|
||||
)
|
||||
|
||||
data = loader.load()
|
||||
```
|
||||
|
||||
```python
|
||||
pprint(data)
|
||||
```
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
[Document(page_content='message1', metadata={'source': '/path/to/sample.json', 'seq_num': 1}),
|
||||
Document(page_content='message2', metadata={'source': '/path/to/sample.json', 'seq_num': 2})]
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Extracting metadata
|
||||
|
||||
|
Reference in New Issue
Block a user