langchain/libs/community/langchain_community/document_loaders/blob_loaders
Philippe PRADOS 6dd621d636
community[minor]: Add CloudBlobLoader that supports loading data from cloud buckets (#21957)
Thank you for contributing to LangChain!

- [ ] **PR title**: "Add CloudBlobLoader"
  - community: Add CloudBlobLoader

- [ ] **PR message**: Add cloud blob loader
    - **Description:** 
 Langchain provides several approaches to read different file formats:

Specific loaders (`CVSLoader`) or blob-compatible loaders
(`FileSystemBlobLoader`). The only implementation proposed for
BlobLoader is `FileSystemBlobLoader`.
      
Many projects retrieve files from cloud storage. We propose a new
implementation of `BlobLoader` to read files from the three cloud
storage systems. The interface is strictly identical to
`FileSystemBlobLoader`. The only difference is the constructor, which
takes a cloud "url" object such as `s3://my-bucket`, `az://my-bucket`,
or `gs://my-bucket`.
      
By streamlining the process, this novel implementation eliminates the
requirement to pre-download files from cloud storage to local temporary
files (which are seldom removed).
      
The code relies on the
[CloudPathLib](https://cloudpathlib.drivendata.org/stable/) library to
interpret cloud URLs. This has been added as an optional dependency.

```Python
loader = CloudBlobLoader("s3://mybucket/id")
for blob in loader.yield_blobs():
    print(blob)
```

- [X] **Dependencies:** CloudPathLib
- [X] **Twitter handle:** pprados


- [X] **Add tests and docs**: Add unit test, but it's easy to convert to
integration test, with some files in a cloud storage (see
`test_cloud_blob_loader.py`)

- [X] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified.

Hello from Paris @hwchase17. Can you review this PR?

---------

Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
2024-05-23 10:59:55 -04:00
..
__init__.py community[minor]: Add CloudBlobLoader that supports loading data from cloud buckets (#21957) 2024-05-23 10:59:55 -04:00
cloud_blob_loader.py community[minor]: Add CloudBlobLoader that supports loading data from cloud buckets (#21957) 2024-05-23 10:59:55 -04:00
file_system.py community[patch]: doc loaders mypy fixes (#17368) 2024-02-12 16:51:06 -08:00
schema.py core: Move document loader interfaces to core (#17723) 2024-03-06 13:59:00 -05:00
youtube_audio.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 2023-12-11 13:53:30 -08:00