mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-01 19:03:25 +00:00
docs: Add Dell PowerScale Document Loader (#30209)
# Description Adds documentation on LangChain website for a Dell specific document loader for on-prem storage devices. Additional details on what the document loader is described in the PR as well as on our github repo: [https://github.com/dell/powerscale-rag-connector](https://github.com/dell/powerscale-rag-connector) This PR also creates a category on the document loader webpage as no existing category exists for on-prem. This follows the existing pattern already established as the website has a category for cloud providers. # Issue: New release, no issue. # Dependencies: None # Twitter handle: DellTech --------- Signed-off-by: Adam Brenner <adam@aeb.io> Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
parent
9fb0db6937
commit
f949d9a3d3
244
docs/docs/integrations/document_loaders/powerscale.ipynb
Normal file
244
docs/docs/integrations/document_loaders/powerscale.ipynb
Normal file
@ -0,0 +1,244 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Dell PowerScale Document Loader\n",
|
||||
"\n",
|
||||
"[Dell PowerScale](https://www.dell.com/en-us/shop/powerscale-family/sf/powerscale) is an enterprise scale out storage system that hosts industry leading OneFS filesystem that can be hosted on-prem or deployed in the cloud.\n",
|
||||
"\n",
|
||||
"This document loader utilizes unique capabilities from PowerScale that can determine what files that have been modified since an application's last run and only returns modified files for processing. This will eliminate the need to re-process (chunk and embed) files that have not been changed, improving the overall data ingestion workflow.\n",
|
||||
"\n",
|
||||
"This loader requires PowerScale's MetadataIQ feature enabled. Additional information can be found on our GitHub Repo: [https://github.com/dell/powerscale-rag-connector](https://github.com/dell/powerscale-rag-connector)\n",
|
||||
"\n",
|
||||
"## Overview\n",
|
||||
"### Integration details\n",
|
||||
"\n",
|
||||
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/__module_name___loader)|\n",
|
||||
"| :--- | :--- | :---: | :---: | :---: |\n",
|
||||
"| [PowerScaleDocumentLoader](https://github.com/dell/powerscale-rag-connector/blob/main/src/powerscale_rag_connector/PowerScaleDocumentLoader.py) | [powerscale-rag-connector](https://github.com/dell/powerscale-rag-connector) | ✅ | ❌ | ❌ | \n",
|
||||
"| [PowerScaleUnstructuredLoader](https://github.com/dell/powerscale-rag-connector/blob/main/src/powerscale_rag_connector/PowerScaleUnstructuredLoader.py) | [powerscale-rag-connector](https://github.com/dell/powerscale-rag-connector) | ✅ | ❌ | ❌ | \n",
|
||||
"### Loader features\n",
|
||||
"| Source | Document Lazy Loading | Native Async Support\n",
|
||||
"| :---: | :---: | :---: | \n",
|
||||
"| PowerScaleDocumentLoader | ✅ | ✅ | \n",
|
||||
"| PowerScaleUnstructuredLoader | ✅ | ✅ | \n",
|
||||
"\n",
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"This document loader requires the use of a Dell PowerScale system with MetadataIQ enabled. Additional information can be found on our github page: [https://github.com/dell/powerscale-rag-connector](https://github.com/dell/powerscale-rag-connector)\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Installation\n",
|
||||
"\n",
|
||||
"The document loader lives in an external pip package and can be installed using standard tooling"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install --upgrade --quiet powerscale-rag-connector"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialization\n",
|
||||
"\n",
|
||||
"Now we can instantiate document loader:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Generic Document Loader\n",
|
||||
"\n",
|
||||
"Our generic document loader can be used to incrementally load all files from PowerScale in the following manner:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from powerscale_rag_connector import PowerScaleDocumentLoader\n",
|
||||
"\n",
|
||||
"loader = PowerScaleDocumentLoader(\n",
|
||||
" es_host_url=\"http://elasticsearch:9200\",\n",
|
||||
" es_index_name=\"metadataiq\",\n",
|
||||
" es_api_key=\"your-api-key\",\n",
|
||||
" folder_path=\"/ifs/data\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### UnstructuredLoader Loader\n",
|
||||
"\n",
|
||||
"Optionally, the `PowerScaleUnstructuredLoader` can be used to locate the changed files _and_ automatically process the files producing elements of the source file. This is done using LangChain's `UnstructuredLoader` class."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from powerscale_rag_connector import PowerScaleUnstructuredLoader\n",
|
||||
"\n",
|
||||
"# Or load files with the Unstructured Loader\n",
|
||||
"loader = PowerScaleUnstructuredLoader(\n",
|
||||
" es_host_url=\"http://elasticsearch:9200\",\n",
|
||||
" es_index_name=\"metadataiq\",\n",
|
||||
" es_api_key=\"your-api-key\",\n",
|
||||
" folder_path=\"/ifs/data\",\n",
|
||||
" # 'elements' mode splits the document into more granular chunks\n",
|
||||
" # Use 'single' mode if you want the entire document as a single chunk\n",
|
||||
" mode=\"elements\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The fields:\n",
|
||||
" - `es_host_url` is the endpoint to to MetadataIQ Elasticsearch database\n",
|
||||
" - `es_index_index` is the name of the index where PowerScale writes it file system metadata\n",
|
||||
" - `es_api_key` is the **encoded** version of your elasticsearch API key\n",
|
||||
" - `folder_path` is the path on PowerScale to be queried for changes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Load\n",
|
||||
"\n",
|
||||
"Internally, all code is asynchronous with PowerScale and MetadataIQ and the load and lazy load methods will return a python generator. We recommend using the lazy load function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='' metadata={'source': '/ifs/pdfs/1994-Graph.Theoretic.Obstacles.to.Perfect.Hashing.TR0257.pdf', 'snapshot': 20834, 'change_types': ['ENTRY_ADDED']}),\n",
|
||||
"Document(page_content='' metadata={'source': '/ifs/pdfs/New.sendfile-FreeBSD.20.Feb.2015.pdf', 'snapshot': 20920, 'change_types': ['ENTRY_MODIFIED']}),\n",
|
||||
"Document(page_content='' metadata={'source': '/ifs/pdfs/FAST-Fast.Architecture.Sensitive.Tree.Search.on.Modern.CPUs.and.GPUs-Slides.pdf', 'snapshot': 20924, 'change_types': ['ENTRY_ADDED']})]\n"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for doc in loader.load():\n",
|
||||
" print(doc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Returned Object\n",
|
||||
"\n",
|
||||
"Both document loaders will keep track of what files were previously returned to your application. When called again, the document loader will only return new or modified files since your previous run.\n",
|
||||
"\n",
|
||||
" - The `metadata` fields in the returned `Document` will return the path on PowerScale that contains the modified file. You will use this path to read the data via NFS (or S3) and process the data in your application (e.g.: create chunks and embedding). \n",
|
||||
" - The `source` field is the path on PowerScale and not necessarily on your local system (depending on your mount strategy); OneFS expresses the entire storage system as a single tree rooted at `/ifs`.\n",
|
||||
" - The `change_types` property will inform you on what change occurred since the last one - e.g.: new, modified or delete.\n",
|
||||
"\n",
|
||||
"Your RAG application can use the information from `change_types` to add, update or delete entries your chunk and vector store.\n",
|
||||
"\n",
|
||||
"When using `PowerScaleUnstructuredLoader` the `page_content` field will be filled with data from the Unstructured Loader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Lazy Load\n",
|
||||
"\n",
|
||||
"Internally, all code is asynchronous with PowerScale and MetadataIQ and the load and lazy load methods will return a python generator. We recommend using the lazy load function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for doc in loader.lazy_load():\n",
|
||||
" print(doc) # do something specific with the document"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The same `Document` is returned as the load function with all the same properties mentioned above."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Additional Examples\n",
|
||||
"\n",
|
||||
"Additional examples and code can be found on our public github webpage: [https://github.com/dell/powerscale-rag-connector/tree/main/examples](https://github.com/dell/powerscale-rag-connector/tree/main/examples) that provide full working examples. \n",
|
||||
"\n",
|
||||
" - [PowerScale LangChain Document Loader](https://github.com/dell/powerscale-rag-connector/blob/main/examples/powerscale_langchain_doc_loader.py) - Working example of our standard document loader\n",
|
||||
" - [PowerScale LangChain Unstructured Loader](https://github.com/dell/powerscale-rag-connector/blob/main/examples/powerscale_langchain_unstructured_loader.py) - Working example of our standard document loader using unstructured loader for chunking and embedding\n",
|
||||
" - [PowerScale NVIDIA Retriever Microservice Loader](https://github.com/dell/powerscale-rag-connector/blob/main/examples/powerscale_nvingest_example.py) - Working example of our document loader with NVIDIA NeMo Retriever microservices for chunking and embedding"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## API reference\n",
|
||||
"\n",
|
||||
"For detailed documentation of all PowerScale Document Loader features and configurations head to the github page: [https://github.com/dell/powerscale-rag-connector/](https://github.com/dell/powerscale-rag-connector/)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
22
docs/docs/integrations/providers/dell.mdx
Normal file
22
docs/docs/integrations/providers/dell.mdx
Normal file
@ -0,0 +1,22 @@
|
||||
# Dell
|
||||
|
||||
Dell is a global technology company that provides a range of hardware, software, and
|
||||
services, including AI solutions. Their AI portfolio includes purpose-built
|
||||
infrastructure for AI workloads, including Dell PowerScale storage systems optimized
|
||||
for AI data management.
|
||||
|
||||
## PowerScale
|
||||
|
||||
Dell [PowerScale](https://www.dell.com/en-us/shop/powerscale-family/sf/powerscale) is
|
||||
an enterprise scale out storage system that hosts industry leading OneFS filesystem
|
||||
that can be hosted on-prem or deployed in the cloud.
|
||||
|
||||
### Installation and Setup
|
||||
|
||||
```bash
|
||||
pip install powerscale-rag-connector
|
||||
```
|
||||
|
||||
### Document loaders
|
||||
|
||||
See detail on available loaders [here](/docs/integrations/document_loaders/powerscale).
|
@ -519,4 +519,8 @@ packages:
|
||||
- name: langchain-xinference
|
||||
path: .
|
||||
repo: TheSongg/langchain-xinference
|
||||
|
||||
- name: powerscale-rag-connector
|
||||
name_title: PowerScale RAG Connector
|
||||
path: .
|
||||
repo: dell/powerscale-rag-connector
|
||||
provider_page: dell
|
||||
|
Loading…
Reference in New Issue
Block a user