Compare commits

..

67 Commits

Author SHA1 Message Date
William Fu-Hinthorn
2f679723cb Rm not implemented async in eval methods 2023-08-08 16:23:31 -07:00
Bagatur
95cf7de112 scheduled tests GHA (#8879)
Adding scheduled daily GHA that runs marked integration tests. To start
just marking some tests in test_openai
2023-08-08 14:55:25 -07:00
Joe Reuter
8f0cd91d57 Airbyte based loaders (#8586)
This PR adds 8 new loaders:
* `AirbyteCDKLoader` This reader can wrap and run all python-based
Airbyte source connectors.
* Separate loaders for the most commonly used APIs:
  * `AirbyteGongLoader`
  * `AirbyteHubspotLoader`
  * `AirbyteSalesforceLoader`
  * `AirbyteShopifyLoader`
  * `AirbyteStripeLoader`
  * `AirbyteTypeformLoader`
  * `AirbyteZendeskSupportLoader`

## Documentation and getting started
I added the basic shape of the config to the notebooks. This increases
the maintenance effort a bit, but I think it's worth it to make sure
people can get started quickly with these important connectors. This is
also why I linked the spec and the documentation page in the readme as
these two contain all the information to configure a source correctly
(e.g. it won't suggest using oauth if that's avoidable even if the
connector supports it).

## Document generation
The "documents" produced by these loaders won't have a text part
(instead, all the record fields are put into the metadata). If a text is
required by the use case, the caller needs to do custom transformation
suitable for their use case.

## Incremental sync
All loaders support incremental syncs if the underlying streams support
it. By storing the `last_state` from the reader instance away and
passing it in when loading, it will only load updated records.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-08 14:49:25 -07:00
Eugene Yurtsev
15f650ae8c Add base storage interface, 2 implementations and utility encoder (#8895)
This PR defines an abstract interface for key value stores.

It provides 2 implementations: 
1. Local File System
2. In memory -- used to facilitate testing

It also provides an encoder utility to help take care of serialization
from arbitrary data to data that can be stored by the given store
2023-08-08 17:29:06 -04:00
Harrison Chase
7543a3d70e Harrison/image (#845)
Co-authored-by: Ashutosh Sanzgiri <sanzgiri@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-08 13:58:27 -07:00
Bagatur
ab193338aa bump 258 (#8932) 2023-08-08 12:54:51 -07:00
Eugene Yurtsev
bb12184551 Internal code deprecation API (#8763)
Proposal for an internal API to deprecate LangChain code.

This PR is heavily based on:
https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/_api/deprecation.py

This PR only includes deprecation functionality (no renaming etc.). 
Additional functionality can be added on a need basis (e.g., renaming
parameters), but best to roll out as an MVP to test this
out.

DeprecationWarnings are ignored by default. We can change the policy for
the deprecation warnings, but we'll need to make sure we're not creating
noise for users due to internal code invoking deprecated functionality.
2023-08-08 15:42:22 -04:00
Leonid Ganeline
33a2f58fbf tensoflow_datasets document loader (#8721)
This PR adds `tensoflow_datasets` document loader
2023-08-08 15:19:28 -04:00
Holt Skinner
fad26e79a3 fix: Resolve AttributeError in Google Cloud Enterprise Search retriever (#8872)
- Reverting some of the changes made in
https://github.com/langchain-ai/langchain/pull/8369
2023-08-08 12:11:12 -07:00
William FH
b2eb4ff0fc Relax Validation in Eval (#8902)
Just check for missing keys
2023-08-08 11:59:30 -07:00
Leonid Ganeline
2d078c7767 PubMed document loader (#8893)
- added `PubMed Document Loader` artifacts; ut-s; examples 
- fixed `PubMed utility`; ut-s

@hwchase17
2023-08-08 14:26:03 -04:00
Ofer Mendelevitch
a7824f16f2 Added consistent timeout for Vectara calls (#8892)
- Description: consistent timeout at 60s for all calls to Vectara API
- Tag maintainer: @rlancemartin, @eyurtsev

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-08 11:10:32 -07:00
Bagatur
642b57c7ff nit (#8927) 2023-08-08 10:54:25 -07:00
manmax31
4a07fba9f0 Improve query prompt of BGE embeddings (#8908)
Replace this comment with:
- Description: Improved query of BGE embeddings after talking with the
devs of BGE embeddings ,
  - Dependencies: any dependencies required for this change,
  - Tag maintainer: @hwchase17 ,
  - Twitter handle: @ManabChetia3

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
2023-08-08 10:20:37 -07:00
Jeremy W
c5c0735fc4 Remove Evaluation from Modules page (#8926)
Remove Evaluation link (which gives 404 now) from Modules page, since it
lives under Guides page now
2023-08-08 10:20:24 -07:00
Seif
6327eecdaf Fix typo in Vectara docs (#8925)
Fixed a typo in the Vectara docs description.
2023-08-08 10:11:07 -07:00
Chris Pappalardo
beab637f04 added filter kwarg to VectorStoreIndexWrapper query and query_with_so… (#8844)
- Description: added filter to query methods in VectorStoreIndexWrapper
for filtering by metadata (i.e. search_kwargs)
- Tag maintainer: @rlancemartin, @eyurtsev

Updated the doc snippet on this topic as well. It took me a long while
to figure out how to filter the vectorstore by filename, so this might
help someone else out.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-08 10:10:45 -07:00
Apurv Agarwal
4a63533216 addition to docs at 'Store and reference chat history' (#8910)
- Description: I have added an example showing how to pass a custom
template to ConversationRetrievalChain. Instead of
CONDENSE_QUESTION_PROMPT we can pass any prompt in the argument
condense_question_prompt. Look in Use cases -> QA over Documents -> How
to -> Store and reference chat history,
  - Issue: #8864,
  - Dependencies: NA,
  - Tag maintainer: @hinthornw,
  - Twitter handle:

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-08 10:10:11 -07:00
David vonThenen
bf4a112aa6 Fixes to the Nebula LLM Integration (#8918)
This addresses some issues with introducing the Nebula LLM to LangChain
in this PR:
https://github.com/langchain-ai/langchain/pull/8876

This fixes the following:
- Removes `SYMBLAI` from variable names
- Fixes bug with `Bearer` for the API KEY


Thanks again in advance for your help!
cc: @hwchase17, @baskaryan

---------

Co-authored-by: dvonthenen <david.vonthenen@gmail.com>
2023-08-08 10:04:43 -07:00
Jacob Lee
d1e305028f Automatically set docs appearance to system default (#8924)
@baskaryan
2023-08-08 09:54:18 -07:00
Marie-Philippe Gill
6b9f266837 Add user_context to AmazonKendraRetriever (#8869)
### Description 

Now, we can pass information like a JWT token using user_context:  

```python
self.retriever = AmazonKendraRetriever(index_id=kendraIndexId, user_context={"Token": jwt_token})
```

- [x] `make lint`
- [x] `make format`
- [x] `make test`

Also tested by pip installing in my own project, and it allows access
through the token.

### Maintainers 

 @rlancemartin, @eyurtsev

### My twitter handle 

[girlknowstech](https://twitter.com/girlknowstech)
2023-08-08 08:37:03 -07:00
Josh Hart
6116cbf0de Fix imports in awslambda docs (#8916)
Minor doc fix to awslambda tool notebook. 

Add missing import for initialize_agent to awslambda agent example

Co-authored-by: Josh Hart <josharj@amazon.com>
2023-08-08 08:29:28 -07:00
GitHub-L
67718c1d6b Update OpenAPI code to fetch use the requestBody
- Description: The API doc passed to LLM only included the content of
responses but did not include the content of requestBody, causing the
agent to be unable to construct the correct request parameters based on
the requestBody information. Add two lines of code fixed the bug,
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
  - Tag maintainer: @hinthornw ,
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
2023-08-08 10:33:21 -04:00
Maurits de Groot
61c2d918c6 Fixed inaccurate import in integrations:providers:bedrock documentation (#8915)
Description:
Fixed inaccurate import in integrations:providers:bedrock documentation

In the current version of the bedrock documentation, page
https://python.langchain.com/docs/integrations/providers/bedrock it
states that the import is from langchain import Bedrock

This has been changed to from langchain.llms.bedrock import Bedrock as
stated in https://python.langchain.com/docs/integrations/llms/bedrock

Issue:
Not applicable

Dependencies
No dependencies required

Tag maintainer
@baskaryan

Twitter handle:
Not applicable
2023-08-08 07:24:36 -07:00
Leonid Kuligin
52d6b91c18 Fixed a source for documents uploaded from GCS (#8912)
Sets source for documents uploaded from GCS to source on gcs
#8911

Co-authored-by: Leonid Kuligin <kuligin@google.com>
2023-08-08 09:34:43 -04:00
Manuel Soria
e74a605379 SQL use case docs (#8513) 2023-08-08 03:30:18 -07:00
Bagatur
022ef170f8 bump 257 (#8903) 2023-08-08 01:16:33 -07:00
Jacob Lee
fa30a57034 Adds Ollama as an LLM (#8829)
Adds Ollama as an LLM. Ollama can run various open source models locally
e.g. Llama 2 and Vicuna, automatically configuring and GPU-optimizing
them.

@rlancemartin @hwchase17

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
2023-08-07 21:19:22 -07:00
Ash Vardanian
1f9124ceaa Add: USearch Vector Store (#8835)
## Description

I am excited to propose an integration with USearch, a lightweight
vector-search engine available for both Python and JavaScript, among
other languages.

## Dependencies

It introduces a new PyPi dependency - `usearch`. I am unsure if it must
be added to the Poetry file, as this would make the PR too clunky.
Please let me know.

## Profiles

- Maintainers: @ashvardanian @davvard
- Twitter handles: @ashvardanian @unum_cloud

---------

Co-authored-by: Davit Vardanyan <78792753+davvard@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-07 20:41:00 -07:00
Leonid Kuligin
b52a3785c9 Allow to specify a custom loader for GcsFileLoader (#8868)
Co-authored-by: Leonid Kuligin <kuligin@google.com>
2023-08-07 22:57:31 -04:00
Jeffrey Wang
ff44fe4e16 Change default Metaphor search example to use prompt optimizer (#8890)
- fix install command
- change example notebook to use Metaphor autoprompt by default

<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
2023-08-07 17:25:36 -07:00
Bruno Bornsztein
d56eff042a Make json output parser handle newlines inside markdown code blocks (#8682)
Update to #8528

Newlines and other special characters within markdown code blocks
returned as `action_input` should be handled correctly (in particular,
unescaped `"` => `\"` and `\n` => `\\n`) so they don't break JSON
parsing.

@baskaryan
2023-08-07 15:49:54 -07:00
Jeffrey Wang
ce3666c28b Fix metaphor install command in guide (#8888) 2023-08-07 15:43:47 -07:00
Oege Dijk
cff52638b2 when encountering error during fetch return "" in web_base.py (#8753)
when e.g. downloading a sitemap with a malformed url (e.g.
"ttp://example.com/index.html" with the h omitted at the beginning of
the url), this will ensure that the sitemap download does not crash, but
just emits a warning. (maybe should be optional with e.g. a
`skip_faulty_urls:bool=True` parameter, but this was the most
straightforward fix)

@rlancemartin, @eyurtsev
---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-07 15:35:41 -07:00
Harrison Chase
bbd22b9b76 update metaphor docs (#8886) 2023-08-07 14:44:41 -07:00
Bennji94
33cdb06b5c Async RetryOutputParser, RetryWithErrorOutputParser and OutputFixingParser (#8776)
Added async parsing functions for RetryOutputParser,
RetryWithErrorOutputParser and OutputFixingParser.

The async parse functions call the arun methods of the used LLMChains.

Fix for #7989

---------

Co-authored-by: Benjamin May <benjamin.may94@gmail.com>
2023-08-07 14:42:48 -07:00
Carson
cc908d49a3 Fixes typo in documentation (#8882)
Fixes a simple typo in the google search engine tool documentation
@baskaryan
2023-08-07 14:33:21 -07:00
Joshua Sundance Bailey
7fc07ba5df Create ChatAnyscale (#8770)
- Description: Adds the ChatAnyscale class with llama-2 7b, llama-2 13b,
and llama-2 70b on [Anyscale
Endpoints](https://app.endpoints.anyscale.com/)
- It inherits from ChatOpenAI and requires openai (probably unnecessary
but it made for a quick and easy implementation)
- Inspired by https://github.com/langchain-ai/langchain/pull/8434
(@kylehh and @baskaryan )
2023-08-07 13:21:05 -07:00
idcore
fe78aff1f2 Add new parameter forced_decoder_ids to OpenAIWhisperParserLocal + small bug fix (#8793)
- Description: new parameter forced_decoder_ids for
OpenAIWhisperParserLocal to force input language, and enable optional
translate mode. Usage example:
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french",
task="transcribe")
#forced_decoder_ids =
processor.get_decoder_prompt_ids(language="french", task="translate")
loader = GenericLoader(YoutubeAudioLoader(urls, save_dir),
OpenAIWhisperParserLocal(lang_model="openai/whisper-medium",forced_decoder_ids=forced_decoder_ids))
  - Issue #8792
  - Tag maintainer: @rlancemartin, @eyurtsev

---------

Co-authored-by: idcore <eugene.novozhilov@gmail.com>
2023-08-07 13:17:58 -07:00
David vonThenen
40079d4936 Introduce Nebula LLM to LangChain (#8876)
## Description

This PR adds Nebula to the available LLMs in LangChain.

Nebula is an LLM focused on conversation understanding and enables users
to extract conversation insights from video, audio, text, and chat-based
conversations. These conversations can occur between any mix of human or
AI participants.

Examples of some questions you could ask Nebula from a given
conversation are:
- What could be the customer’s pain points based on the conversation?
- What sales opportunities can be identified from this conversation?
- What best practices can be derived from this conversation for future
customer interactions?

You can read more about Nebula here:

https://symbl.ai/blog/extract-insights-symbl-ai-generative-ai-recall-ai-meetings/

#### Integration Test 

An integration test is added, but it requires network access. Since
Nebula is fully managed like OpenAI, network access is required to
exercise the integration test.

#### Linting

- [x] make lint
- [x] make test (TODO: there seems to be a failure in another
non-related test??? Need to check on this.)
- [x] make format

### Dependencies

No new dependencies were introduced.

### Twitter handle

[@symbldotai](https://twitter.com/symbldotai)
[@dvonthenen](https://twitter.com/dvonthenen)


If you have any questions, please let me know.

cc: @hwchase17, @baskaryan

---------

Co-authored-by: dvonthenen <david.vonthenen@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-07 13:15:26 -07:00
Lance Martin
84c1ad7eaa Fix colab link for extraction ntbk (#8878)
<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
2023-08-07 11:36:46 -07:00
Nuno Campos
9892e95d03 Add flush=True to stream examples (#8862) 2023-08-07 14:33:17 -04:00
Eugene Yurtsev
f616aee35a JsonOutputFunctionParser: Fix mutation in place bug (#8758)
Fixes mutation in place in the JsonOutputFunctionParser. This causes
issues when trying to re-use the original AI message.
2023-08-07 14:32:46 -04:00
shibuiwilliam
ab47557db3 fix evaluation parse test (#8859)
# What
- fix evaluation parse test

<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: Fix evaluation parse test
  - Issue: None
  - Dependencies: None
  - Tag maintainer: @baskaryan
  - Twitter handle: @MLOpsJ

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
2023-08-07 11:15:41 -07:00
manmax31
40096c73cd Add BGE embeddings support (#8848)
- Description: [BGE-large](https://huggingface.co/BAAI/bge-large-en)
embeddings from BAAI are at the top of [MTEB
leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Hence
adding support for it.
- Tag maintainer: @baskaryan
- Twitter handle: @ManabChetia3

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-07 11:15:30 -07:00
shibuiwilliam
fbc83dfdbb Fix/abstract add message (#8856)
<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: Fix/abstract add message
  - Issue: None
  - Dependencies: None
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
  - Twitter handle: @MLOpsJ

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
2023-08-07 11:02:19 -07:00
William FH
91be7eee66 Add concurrency support for run_on_dataset (#8841)
Long-term, would be better to use the lower-level batch() method(s) but
it may take me a bit longer to clean up. This unblocks in the meantime,
though it may fail when the evaluated chain raises a
`NotImplementedError` for a corresponding async method
2023-08-07 09:24:48 -07:00
Bagatur
fc2f450f2d bump 256 (#8870) 2023-08-07 08:29:02 -07:00
Tudor Golubenco
aeaef8f3a3 Add support for Xata as a vector store (#8822)
This adds support for [Xata](https://xata.io) (data platform based on
Postgres) as a vector store. We have recently added [Xata to
Langchain.js](https://github.com/hwchase17/langchainjs/pull/2125) and
would love to have the equivalent in the Python project as well.

The PR includes integration tests and a Jupyter notebook as docs. Please
let me know if anything else would be needed or helpful.

I have added the xata python SDK as an optional dependency.

## To run the integration tests

You will need to create a DB in xata (see the docs), then run something
like:

```
OPENAI_API_KEY=sk-... XATA_API_KEY=xau_... XATA_DB_URL='https://....xata.sh/db/langchain'  poetry run pytest tests/integration_tests/vectorstores/test_xata.py
```

<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Philip Krauss <35487337+philkra@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-08-07 08:14:52 -07:00
Harrison Chase
472f00ada7 add moderation example (#8718) 2023-08-07 07:50:11 -07:00
Leonid Kuligin
6e3fa59073 Added chat history to codey models (#8831)
#7469

since 1.29.0, Vertex SDK supports a chat history provided to a codey
chat model.

Co-authored-by: Leonid Kuligin <kuligin@google.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-07 07:34:35 -07:00
Massimiliano Pronesti
a616e19975 feat(llms): add support for vLLM (#8806)
Hello langchain maintainers, 
this PR aims at integrating
[vllm](https://vllm.readthedocs.io/en/latest/#) into langchain. This PR
closes #8729.

This feature clearly depends on `vllm`, but I've seen other models
supported here depend on packages that are not included in the
pyproject.toml (e.g. `gpt4all`, `text-generation`) so I thought it was
the case for this as well.

@hwchase17, @baskaryan

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-07 07:32:02 -07:00
Bagatur
100d9ce4c7 bump 255 (#8865) 2023-08-07 07:25:23 -07:00
Vic Cao
c9da300e4d fix: overwrite stream for ChatOpenAI in runtime (#8288)
<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
@hwchase17, @baskaryan

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
2023-08-07 10:18:30 +01:00
Karthik Raja A
5a9765b1b5 MultiOn client toolkit update 2.0 (#8750)
- Updated to use newer better function interaction
 - Previous version had only one callback
 - @hinthornw @hwchase17  Can you look into this
 -  Shout out to @MultiON_AI @DivGarg9 on twitter

---------

Co-authored-by: Naman Garg <ngarg3@binghamton.edu>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-06 22:24:10 -07:00
Emre
454998c1fb Fix invalid escape sequence warnings (#8771)
<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->

Description: The lines I have changed looks like incorrectly escaped for
regex. In python 3.11, I receive DeprecationWarning for these lines.
You don't see any warnings unless you explicitly run python with `-W
always::DeprecationWarning` flag. So, this is my attempt to fix it.

Here are the warnings from log files:

```
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:919: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:918: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:917: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:916: DeprecationWarning: invalid escape sequence '\c'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:903: DeprecationWarning: invalid escape sequence '\*'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:804: DeprecationWarning: invalid escape sequence '\*'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:804: DeprecationWarning: invalid escape sequence '\*'
```

cc @baskaryan

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-06 17:01:18 -07:00
Harrison Chase
0adc282d70 Harrison/as retriever docstring (#8840)
Co-authored-by: Bytestorm <31070777+Bytestorm5@users.noreply.github.com>
2023-08-06 17:00:57 -07:00
Zend
bd4865b6fe Async Recursive URL loader (#8502)
Description: This PR improves the function of recursive_url_loader, such
as limiting the depth of the access, and customizable extractors(from
the raw webpage to the text of the Document object), so that users can
use other tools to extract the webpage. This PR also includes the
document and test for the new loader.
Old PR closed due to project structure change. #7756

Because socket requests are not allowed, the old unit test was removed.
Issue: N/A
Dependencies: asyncio, aiohttp
Tag maintainer: @rlancemartin
Twitter handle: @ Zend_Nihility

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
2023-08-06 16:22:31 -07:00
fqassemi
485d716c21 Feature faiss delete (#8135)
<!-- Thank you for contributing to LangChain!

Replace this comment with:
- Description: docstore had two main method: add and search, however,
dealing with docstore sometimes requires deleting an entry from
docstore. So I have added a simple delete method that deletes items from
docstore. Additionally, I have added the delete method to faiss
vectorstore for the very same reason.
  - Issue: NA
  - Dependencies: NA
  - Tag maintainer:  @rlancemartin, @eyurtsev
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-06 15:46:30 -07:00
Nicolas
b57fa1a39c docs: Improvements on Mendable Search (#8808)
- Balancing prioritization between keyword / AI search
- Show snippets of highlighted keywords when searching 
- Improved keyword search
- Fixed bugs and issues

Shoutout to @calebpeffer for implementing and gathering feedback on it 

cc: @dev2049 @rlancemartin @hwchase17
2023-08-06 15:32:06 -07:00
Ikko Eltociear Ashimine
6b93670410 Fix typo in long_context_reorder.ipynb (#8811)
begining -> beginning

<!-- Thank you for contributing to LangChain!

Replace this comment with:
  - Description: a description of the change, 
  - Issue: the issue # it fixes (if applicable),
  - Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!

Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
  2. an example notebook showing its use.

Maintainer responsibilities:
  - General / Misc / if you don't know who to tag: @baskaryan
  - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
  - Models / Prompts: @hwchase17, @baskaryan
  - Memory: @hwchase17
  - Agents / Tools / Toolkits: @hinthornw
  - Tracing / Callbacks: @agola11
  - Async: @agola11

If no one reviews your PR within a few days, feel free to @-mention the
same people again.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
 -->
2023-08-06 15:31:38 -07:00
Harrison Chase
2bb1d256f3 add example of memory and returning retrieved docs (#8830) 2023-08-06 15:25:12 -07:00
Pierre Alexandre SCHEMBRI
4a7ebb7184 Fix issue #7616 (#7617)
Fix Issue #7616 with a simpler approach to extract function names (use
`__name__` attribute)

@hwchase17

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-06 15:12:03 -07:00
Ankur Agarwal
797c9e92c8 #8786 Fixed: Callback handler disconnect in between (#8787)
Fixes for  #8786 @agola11 

- Description: The flow of callback is breaking till the last chain, as
callbacks are missed in between chain along nested path. This will help
get full trace and correlate parent child relationship in all nested
chains.

  - Issue: the issue #8786 
  - Dependencies: NA
  - Tag maintainer: @agola11 
  - Twitter handle: Agarwal_Ankur
2023-08-06 15:11:45 -07:00
Kshitij Wadhwa
5f1aab5487 Fix docs for Rockset (#8807)
* remove error output for notebook
* add comment about vector length for ingest transformation
* change OPENAI_KEY -> OPENAI_API_KEY

cc @baskaryan
2023-08-06 15:04:01 -07:00
William FH
983678dedc Add Dist Metrics for String Distance Evaluation (#8837)
Co-authored-by: shibuiwilliam <shibuiyusuke@gmail.com>
2023-08-06 14:05:00 -07:00
William FH
f76d50d8dc fix exception inconsistencies (#8812) (#8839)
Merge #8812 with main to fix unrelated test failure

Co-authored-by: shibuiwilliam <shibuiyusuke@gmail.com>
2023-08-06 14:04:49 -07:00
204 changed files with 10552 additions and 1811 deletions

View File

@@ -1,5 +1,5 @@
---
name: libs/langchain-experimental CI
name: libs/experimental CI
on:
push:

View File

@@ -1,5 +1,5 @@
---
name: libs/langchain-experimental Release
name: libs/experimental Release
on:
pull_request:

38
.github/workflows/scheduled_test.yml vendored Normal file
View File

@@ -0,0 +1,38 @@
name: Scheduled tests
on:
scheduled:
- cron: '0 13 * * *'
env:
POETRY_VERSION: "1.4.2"
jobs:
build:
runs-on: ubuntu-latest
environment: Scheduled testing
strategy:
matrix:
python-version:
- "3.8"
- "3.9"
- "3.10"
- "3.11"
name: Python ${{ matrix.python-version }}
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: "./.github/actions/poetry_setup"
with:
python-version: ${{ matrix.python-version }}
poetry-version: "1.4.2"
install-command: |
echo "Running scheduled tests, installing dependencies with poetry..."
poetry install -E scheduled_testing
- name: Run tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
make scheduled_tests
shell: bash
secrets: inherit

View File

@@ -18,5 +18,3 @@ Let chains choose which tools to use given high-level directives
Persist application state between runs of a chain
#### [Callbacks](/docs/modules/callbacks/)
Log and stream intermediate steps of any chain
#### [Evaluation](/docs/modules/evaluation/)
Evaluate the performance of a chain.

View File

@@ -128,6 +128,10 @@ const config = {
hideable: true,
},
},
colorMode: {
disableSwitch: false,
respectPrefersColorScheme: true,
},
prism: {
theme: {
...baseLightCodeBlockTheme,

View File

@@ -12,7 +12,7 @@
"@docusaurus/preset-classic": "2.4.0",
"@docusaurus/remark-plugin-npm2yarn": "^2.4.0",
"@mdx-js/react": "^1.6.22",
"@mendable/search": "^0.0.125",
"@mendable/search": "^0.0.137",
"clsx": "^1.2.1",
"json-loader": "^0.5.7",
"process": "^0.11.10",
@@ -3212,10 +3212,11 @@
}
},
"node_modules/@mendable/search": {
"version": "0.0.125",
"resolved": "https://registry.npmjs.org/@mendable/search/-/search-0.0.125.tgz",
"integrity": "sha512-Mb1J3zDhOyBZV9cXqJocSOBNYGpe8+LQDqd9n9laPWxosSJcSTUewqtlIbMerrYsScBsxskoSiWgRsc7xF5z0Q==",
"version": "0.0.137",
"resolved": "https://registry.npmjs.org/@mendable/search/-/search-0.0.137.tgz",
"integrity": "sha512-2J2fd5eqToK+mLzrSDA6NAr4F1kfql7QRiHpD7AUJJX0nqpvInhr/mMJKBCUSCv2z76UKCmF5wLuPSw+C90Qdg==",
"dependencies": {
"html-react-parser": "^4.2.0",
"posthog-js": "^1.45.1"
},
"peerDependencies": {
@@ -8332,6 +8333,33 @@
"safe-buffer": "~5.1.0"
}
},
"node_modules/html-dom-parser": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/html-dom-parser/-/html-dom-parser-4.0.0.tgz",
"integrity": "sha512-TUa3wIwi80f5NF8CVWzkopBVqVAtlawUzJoLwVLHns0XSJGynss4jiY0mTWpiDOsuyw+afP+ujjMgRh9CoZcXw==",
"dependencies": {
"domhandler": "5.0.3",
"htmlparser2": "9.0.0"
}
},
"node_modules/html-dom-parser/node_modules/htmlparser2": {
"version": "9.0.0",
"resolved": "https://registry.npmjs.org/htmlparser2/-/htmlparser2-9.0.0.tgz",
"integrity": "sha512-uxbSI98wmFT/G4P2zXx4OVx04qWUmyFPrD2/CNepa2Zo3GPNaCaaxElDgwUrwYWkK1nr9fft0Ya8dws8coDLLQ==",
"funding": [
"https://github.com/fb55/htmlparser2?sponsor=1",
{
"type": "github",
"url": "https://github.com/sponsors/fb55"
}
],
"dependencies": {
"domelementtype": "^2.3.0",
"domhandler": "^5.0.3",
"domutils": "^3.1.0",
"entities": "^4.5.0"
}
},
"node_modules/html-entities": {
"version": "2.4.0",
"resolved": "https://registry.npmjs.org/html-entities/-/html-entities-2.4.0.tgz",
@@ -8375,6 +8403,20 @@
"node": ">= 12"
}
},
"node_modules/html-react-parser": {
"version": "4.2.0",
"resolved": "https://registry.npmjs.org/html-react-parser/-/html-react-parser-4.2.0.tgz",
"integrity": "sha512-gzU55AS+FI6qD7XaKe5BLuLFM2Xw0/LodfMWZlxV9uOHe7LCD5Lukx/EgYuBI3c0kLu0XlgFXnSzO0qUUn3Vrg==",
"dependencies": {
"domhandler": "5.0.3",
"html-dom-parser": "4.0.0",
"react-property": "2.0.0",
"style-to-js": "1.1.3"
},
"peerDependencies": {
"react": "0.14 || 15 || 16 || 17 || 18"
}
},
"node_modules/html-tags": {
"version": "3.3.1",
"resolved": "https://registry.npmjs.org/html-tags/-/html-tags-3.3.1.tgz",
@@ -11762,6 +11804,11 @@
"webpack": ">=4.41.1 || 5.x"
}
},
"node_modules/react-property": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/react-property/-/react-property-2.0.0.tgz",
"integrity": "sha512-kzmNjIgU32mO4mmH5+iUyrqlpFQhF8K2k7eZ4fdLSOPFrD1XgEuSBv9LDEgxRXTMBqMd8ppT0x6TIzqE5pdGdw=="
},
"node_modules/react-router": {
"version": "5.3.4",
"resolved": "https://registry.npmjs.org/react-router/-/react-router-5.3.4.tgz",
@@ -13127,6 +13174,22 @@
"url": "https://github.com/sponsors/sindresorhus"
}
},
"node_modules/style-to-js": {
"version": "1.1.3",
"resolved": "https://registry.npmjs.org/style-to-js/-/style-to-js-1.1.3.tgz",
"integrity": "sha512-zKI5gN/zb7LS/Vm0eUwjmjrXWw8IMtyA8aPBJZdYiQTXj4+wQ3IucOLIOnF7zCHxvW8UhIGh/uZh/t9zEHXNTQ==",
"dependencies": {
"style-to-object": "0.4.1"
}
},
"node_modules/style-to-js/node_modules/style-to-object": {
"version": "0.4.1",
"resolved": "https://registry.npmjs.org/style-to-object/-/style-to-object-0.4.1.tgz",
"integrity": "sha512-HFpbb5gr2ypci7Qw+IOhnP2zOU7e77b+rzM+wTzXzfi1PrtBCX0E7Pk4wL4iTLnhzZ+JgEGAhX81ebTg/aYjQw==",
"dependencies": {
"inline-style-parser": "0.1.1"
}
},
"node_modules/style-to-object": {
"version": "0.3.0",
"resolved": "https://registry.npmjs.org/style-to-object/-/style-to-object-0.3.0.tgz",

View File

@@ -23,7 +23,7 @@
"@docusaurus/preset-classic": "2.4.0",
"@docusaurus/remark-plugin-npm2yarn": "^2.4.0",
"@mdx-js/react": "^1.6.22",
"@mendable/search": "^0.0.125",
"@mendable/search": "^0.0.137",
"clsx": "^1.2.1",
"json-loader": "^0.5.7",
"process": "^0.11.10",

Binary file not shown.

After

Width:  |  Height:  |  Size: 405 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 190 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 266 KiB

View File

@@ -22,7 +22,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 1,
"id": "466b65b3",
"metadata": {},
"outputs": [],
@@ -171,9 +171,7 @@
"cell_type": "code",
"execution_count": 9,
"id": "decf7710",
"metadata": {
"scrolled": false
},
"metadata": {},
"outputs": [
{
"data": {
@@ -202,7 +200,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 10,
"id": "f799664d",
"metadata": {},
"outputs": [],
@@ -347,7 +345,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 12,
"id": "5d3d8ffe",
"metadata": {},
"outputs": [],
@@ -368,7 +366,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 2,
"id": "33be32af",
"metadata": {},
"outputs": [],
@@ -380,7 +378,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 3,
"id": "df3f3fa2",
"metadata": {},
"outputs": [],
@@ -424,9 +422,7 @@
"cell_type": "code",
"execution_count": 18,
"id": "f3040b0c",
"metadata": {
"scrolled": false
},
"metadata": {},
"outputs": [
{
"name": "stderr",
@@ -477,9 +473,7 @@
"cell_type": "code",
"execution_count": 20,
"id": "7ee8b2d4",
"metadata": {
"scrolled": false
},
"metadata": {},
"outputs": [
{
"name": "stderr",
@@ -515,7 +509,7 @@
},
{
"cell_type": "code",
"execution_count": 66,
"execution_count": 4,
"id": "3f30c348",
"metadata": {},
"outputs": [],
@@ -526,7 +520,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 5,
"id": "64ab1dbf",
"metadata": {},
"outputs": [],
@@ -544,7 +538,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 6,
"id": "7d628c97",
"metadata": {},
"outputs": [],
@@ -559,7 +553,7 @@
},
{
"cell_type": "code",
"execution_count": 68,
"execution_count": 7,
"id": "f60a5d0f",
"metadata": {},
"outputs": [],
@@ -572,7 +566,7 @@
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 8,
"id": "7d007db6",
"metadata": {},
"outputs": [],
@@ -589,25 +583,29 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": 16,
"id": "5c32cc89",
"metadata": {},
"outputs": [],
"source": [
"conversational_qa_chain = RunnableMap({\n",
" \"standalone_question\": {\n",
" \"question\": lambda x: x[\"question\"],\n",
" \"chat_history\": lambda x: _format_chat_history(x['chat_history'])\n",
" } | CONDENSE_QUESTION_PROMPT | ChatOpenAI(temperature=0) | StrOutputParser(),\n",
"}) | {\n",
"_inputs = RunnableMap(\n",
" {\n",
" \"standalone_question\": {\n",
" \"question\": lambda x: x[\"question\"],\n",
" \"chat_history\": lambda x: _format_chat_history(x['chat_history'])\n",
" } | CONDENSE_QUESTION_PROMPT | ChatOpenAI(temperature=0) | StrOutputParser(),\n",
" }\n",
")\n",
"_context = {\n",
" \"context\": itemgetter(\"standalone_question\") | retriever | _combine_documents,\n",
" \"question\": lambda x: x[\"standalone_question\"]\n",
"} | ANSWER_PROMPT | ChatOpenAI()"
"}\n",
"conversational_qa_chain = _inputs | _context | ANSWER_PROMPT | ChatOpenAI()"
]
},
{
"cell_type": "code",
"execution_count": 71,
"execution_count": 17,
"id": "135c8205",
"metadata": {},
"outputs": [
@@ -624,7 +622,7 @@
"AIMessage(content='Harrison was employed at Kensho.', additional_kwargs={}, example=False)"
]
},
"execution_count": 71,
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
@@ -638,7 +636,7 @@
},
{
"cell_type": "code",
"execution_count": 62,
"execution_count": 15,
"id": "424e7e7a",
"metadata": {},
"outputs": [
@@ -655,7 +653,7 @@
"AIMessage(content='Harrison worked at Kensho.', additional_kwargs={}, example=False)"
]
},
"execution_count": 62,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
@@ -667,6 +665,149 @@
"})"
]
},
{
"cell_type": "markdown",
"id": "c5543183",
"metadata": {},
"source": [
"### With Memory and returning source documents\n",
"\n",
"This shows how to use memory with the above. For memory, we need to manage that outside at the memory. For returning the retrieved documents, we just need to pass them through all the way."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "e31dd17c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.memory import ConversationBufferMemory"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "d4bffe94",
"metadata": {},
"outputs": [],
"source": [
"memory = ConversationBufferMemory(return_messages=True, output_key=\"answer\", input_key=\"question\")"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "733be985",
"metadata": {},
"outputs": [],
"source": [
"# First we add a step to load memory\n",
"# This needs to be a RunnableMap because its the first input\n",
"loaded_memory = RunnableMap(\n",
" {\n",
" \"question\": itemgetter(\"question\"),\n",
" \"memory\": memory.load_memory_variables,\n",
" }\n",
")\n",
"# Next we add a step to expand memory into the variables\n",
"expanded_memory = {\n",
" \"question\": itemgetter(\"question\"),\n",
" \"chat_history\": lambda x: x[\"memory\"][\"history\"]\n",
"}\n",
"\n",
"# Now we calculate the standalone question\n",
"standalone_question = {\n",
" \"standalone_question\": {\n",
" \"question\": lambda x: x[\"question\"],\n",
" \"chat_history\": lambda x: _format_chat_history(x['chat_history'])\n",
" } | CONDENSE_QUESTION_PROMPT | ChatOpenAI(temperature=0) | StrOutputParser(),\n",
"}\n",
"# Now we retrieve the documents\n",
"retrieved_documents = {\n",
" \"docs\": itemgetter(\"standalone_question\") | retriever,\n",
" \"question\": lambda x: x[\"standalone_question\"]\n",
"}\n",
"# Now we construct the inputs for the final prompt\n",
"final_inputs = {\n",
" \"context\": lambda x: _combine_documents(x[\"docs\"]),\n",
" \"question\": itemgetter(\"question\")\n",
"}\n",
"# And finally, we do the part that returns the answers\n",
"answer = {\n",
" \"answer\": final_inputs | ANSWER_PROMPT | ChatOpenAI(),\n",
" \"docs\": itemgetter(\"docs\"),\n",
"}\n",
"# And now we put it all together!\n",
"final_chain = loaded_memory | expanded_memory | standalone_question | retrieved_documents | answer"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "806e390c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Number of requested results 4 is greater than number of elements in index 1, updating n_results = 1\n"
]
},
{
"data": {
"text/plain": [
"{'answer': AIMessage(content='Harrison was employed at Kensho.', additional_kwargs={}, example=False),\n",
" 'docs': [Document(page_content='harrison worked at kensho', metadata={})]}"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inputs = {\"question\": \"where did harrison work?\"}\n",
"result = final_chain.invoke(inputs)\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "977399fd",
"metadata": {},
"outputs": [],
"source": [
"# Note that the memory does not save automatically\n",
"# This will be improved in the future\n",
"# For now you need to save it yourself\n",
"memory.save_context(inputs, {\"answer\": result[\"answer\"].content})"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "f94f7de4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'history': [HumanMessage(content='where did harrison work?', additional_kwargs={}, example=False),\n",
" AIMessage(content='Harrison was employed at Kensho.', additional_kwargs={}, example=False)]}"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"memory.load_memory_variables({})"
]
},
{
"cell_type": "markdown",
"id": "0f2bf8d3",
@@ -1391,13 +1532,122 @@
"response"
]
},
{
"cell_type": "markdown",
"id": "4927a727-b4c8-453c-8c83-bd87b4fcac14",
"metadata": {},
"source": [
"## Moderation\n",
"\n",
"This shows how to add in moderation (or other safeguards) around your LLM application."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "179d3c03",
"execution_count": 26,
"id": "4f5f6449-940a-4f5c-97c0-39b71c3e2a68",
"metadata": {},
"outputs": [],
"source": []
"source": [
"from langchain.chains import OpenAIModerationChain\n",
"from langchain.llms import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "fcb8312b-7e7a-424f-a3ec-76738c9a9d21",
"metadata": {},
"outputs": [],
"source": [
"moderate = OpenAIModerationChain()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "b24b9148-f6b0-4091-8ea8-d3fb281bd950",
"metadata": {},
"outputs": [],
"source": [
"model = OpenAI()\n",
"prompt = ChatPromptTemplate.from_messages([\n",
" (\"system\", \"repeat after me: {input}\")\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "1c8ed87c-9ca6-4559-bf60-d40e94a0af08",
"metadata": {},
"outputs": [],
"source": [
"chain = prompt | model"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "5256b9bd-381a-42b0-bfa8-7e6d18f853cb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\n\\nYou are stupid.'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke({\"input\": \"you are stupid\"})"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "fe6e3b33-dc9a-49d5-b194-ba750c58a628",
"metadata": {},
"outputs": [],
"source": [
"moderated_chain = chain | moderate"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "d8ba0cbd-c739-4d23-be9f-6ae092bd5ffb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'input': '\\n\\nYou are stupid.',\n",
" 'output': \"Text was found that violates OpenAI's content policy.\"}"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"moderated_chain.invoke({\"input\": \"you are stupid\"})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0a85ba4-f782-47b8-b16f-8b7a61d6dab7",
"metadata": {},
"outputs": [],
"source": [
"## Conversational Retrieval With Memory"
]
}
],
"metadata": {
@@ -1416,7 +1666,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
"version": "3.10.1"
}
},
"nbformat": 4,

View File

@@ -108,7 +108,7 @@
],
"source": [
"for s in chain.stream({\"topic\": \"bears\"}):\n",
" print(s.content, end=\"\")"
" print(s.content, end=\"\", flush=True)"
]
},
{
@@ -196,7 +196,7 @@
],
"source": [
"async for s in chain.astream({\"topic\": \"bears\"}):\n",
" print(s.content, end=\"\")"
" print(s.content, end=\"\", flush=True)"
]
},
{

View File

@@ -0,0 +1,225 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "642fd21c-600a-47a1-be96-6e1438b421a9",
"metadata": {},
"source": [
"# Anyscale\n",
"\n",
"This notebook demonstrates the use of `langchain.chat_models.ChatAnyscale` for [Anyscale Endpoints](https://endpoints.anyscale.com/).\n",
"\n",
"* Set `ANYSCALE_API_KEY` environment variable\n",
"* or use the `anyscale_api_key` keyword argument"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# !pip install openai"
],
"metadata": {
"collapsed": false
},
"id": "d00d850917865298"
},
{
"cell_type": "code",
"execution_count": 1,
"id": "72340871-ae2f-415f-b399-0777d32dc379",
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
" ········\n"
]
}
],
"source": [
"import os\n",
"from getpass import getpass\n",
"\n",
"os.environ[\"ANYSCALE_API_KEY\"] = getpass()"
]
},
{
"cell_type": "markdown",
"id": "5d7fc704-3ea0-4c35-96e7-89fcae6c73fa",
"metadata": {},
"source": [
"# Let's try out each model offered on Anyscale Endpoints"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0dc9428d-4217-47d2-97de-f784b1764186",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_keys(['meta-llama/Llama-2-70b-chat-hf', 'meta-llama/Llama-2-7b-chat-hf', 'meta-llama/Llama-2-13b-chat-hf'])\n"
]
}
],
"source": [
"from langchain.chat_models import ChatAnyscale\n",
"\n",
"chats = {\n",
" model: ChatAnyscale(model_name=model, temperature=1.0)\n",
" for model in ChatAnyscale.get_available_models()\n",
"}\n",
"\n",
"print(chats.keys())"
]
},
{
"cell_type": "markdown",
"id": "7c4f124a-eaf7-4d78-a2c0-b0aa23fb25c4",
"metadata": {},
"source": [
"# We can use async methods and other stuff supported by ChatOpenAI\n",
"\n",
"This way, the three requests will only take as long as the longest individual request."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "1f94f5d2-569e-4a2c-965e-de53c2845fbb",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"\n",
"from langchain.schema import SystemMessage, HumanMessage\n",
"\n",
"messages = [\n",
" SystemMessage(\n",
" content=\"You are a helpful AI that shares everything you know.\"\n",
" ),\n",
" HumanMessage(\n",
" content=\"Tell me technical facts about yourself. Are you a transformer model? How many billions of parameters do you have?\"\n",
" ),\n",
"]\n",
"\n",
"async def get_msgs():\n",
" tasks = [\n",
" chat.apredict_messages(messages)\n",
" for chat in chats.values()\n",
" ]\n",
" responses = await asyncio.gather(*tasks)\n",
" return dict(zip(chats.keys(), responses))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b2ced871-869a-4ca6-a2ec-6bfececdf7da",
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bc605fa5-9501-470d-a6c9-cd868d2145ef",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\tmeta-llama/Llama-2-70b-chat-hf\n",
"\n",
"Greetings! I'm just an AI, I don't have a personal identity like humans do, but I'm here to help you with any questions you have.\n",
"\n",
"I'm a large language model, which means I'm trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. My architecture is based on a transformer model, which is a type of neural network that's particularly well-suited for natural language processing tasks.\n",
"\n",
"As for my parameters, I have a few billion parameters, but I don't have access to the exact number as it's not relevant to my functioning. My training data includes a vast amount of text from various sources, including books, articles, and websites, which I use to learn patterns and relationships in language.\n",
"\n",
"I'm designed to be a helpful tool for a variety of tasks, such as answering questions, providing information, and generating text. I'm constantly learning and improving my abilities through machine learning algorithms and feedback from users like you.\n",
"\n",
"I hope this helps! Is there anything else you'd like to know about me or my capabilities?\n",
"\n",
"---\n",
"\n",
"\tmeta-llama/Llama-2-7b-chat-hf\n",
"\n",
"Ah, a fellow tech enthusiast! *adjusts glasses* I'm glad to share some technical details about myself. 🤓\n",
"Indeed, I'm a transformer model, specifically a BERT-like language model trained on a large corpus of text data. My architecture is based on the transformer framework, which is a type of neural network designed for natural language processing tasks. 🏠\n",
"As for the number of parameters, I have approximately 340 million. *winks* That's a pretty hefty number, if I do say so myself! These parameters allow me to learn and represent complex patterns in language, such as syntax, semantics, and more. 🤔\n",
"But don't ask me to do math in my head I'm a language model, not a calculating machine! 😅 My strengths lie in understanding and generating human-like text, so feel free to chat with me anytime you'd like. 💬\n",
"Now, do you have any more technical questions for me? Or would you like to engage in a nice chat? 😊\n",
"\n",
"---\n",
"\n",
"\tmeta-llama/Llama-2-13b-chat-hf\n",
"\n",
"Hello! As a friendly and helpful AI, I'd be happy to share some technical facts about myself.\n",
"\n",
"I am a transformer-based language model, specifically a variant of the BERT (Bidirectional Encoder Representations from Transformers) architecture. BERT was developed by Google in 2018 and has since become one of the most popular and widely-used AI language models.\n",
"\n",
"Here are some technical details about my capabilities:\n",
"\n",
"1. Parameters: I have approximately 340 million parameters, which are the numbers that I use to learn and represent language. This is a relatively large number of parameters compared to some other languages models, but it allows me to learn and understand complex language patterns and relationships.\n",
"2. Training: I was trained on a large corpus of text data, including books, articles, and other sources of written content. This training allows me to learn about the structure and conventions of language, as well as the relationships between words and phrases.\n",
"3. Architectures: My architecture is based on the transformer model, which is a type of neural network that is particularly well-suited for natural language processing tasks. The transformer model uses self-attention mechanisms to allow the model to \"attend\" to different parts of the input text, allowing it to capture long-range dependencies and contextual relationships.\n",
"4. Precision: I am capable of generating text with high precision and accuracy, meaning that I can produce text that is close to human-level quality in terms of grammar, syntax, and coherence.\n",
"5. Generative capabilities: In addition to being able to generate text based on prompts and questions, I am also capable of generating text based on a given topic or theme. This allows me to create longer, more coherent pieces of text that are organized around a specific idea or concept.\n",
"\n",
"Overall, I am a powerful and versatile language model that is capable of a wide range of natural language processing tasks. I am constantly learning and improving, and I am here to help answer any questions you may have!\n",
"\n",
"---\n",
"\n",
"CPU times: user 371 ms, sys: 15.5 ms, total: 387 ms\n",
"Wall time: 12 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"response_dict = asyncio.run(get_msgs())\n",
"\n",
"for model_name, response in response_dict.items():\n",
" print(f'\\t{model_name}')\n",
" print()\n",
" print(response.content)\n",
" print('\\n---\\n')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,226 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte CDK"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"A lot of source connectors are implemented using the [Airbyte CDK](https://docs.airbyte.com/connector-development/cdk-python/). This loader allows to run any of these connectors and return the data as documents."
]
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-cdk` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-cdk"
]
},
{
"cell_type": "markdown",
"id": "085aa658",
"metadata": {},
"source": [
"Then, either install an existing connector from the [Airbyte Github repository](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors) or create your own connector using the [Airbyte CDK](https://docs.airbyte.io/connector-development/connector-development).\n",
"\n",
"For example, to install the Github connector, run"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6d04ef4",
"metadata": {},
"outputs": [],
"source": [
"#!pip install \"source_github@git+https://github.com/airbytehq/airbyte.git@master#subdirectory=airbyte-integrations/connectors/source-github\""
]
},
{
"cell_type": "markdown",
"id": "36069b74",
"metadata": {},
"source": [
"Some sources are also published as regular packages on PyPI"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Now you can create an `AirbyteCDKLoader` based on the imported source. It takes a `config` object that's passed to the connector. You also have to pick the stream you want to retrieve records from by name (`stream_name`). Check the connectors documentation page and spec definition for more information on the config object and available streams. For the Github connectors these are:\n",
"* [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/spec.json](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/spec.json).\n",
"* [https://docs.airbyte.com/integrations/sources/github/](https://docs.airbyte.com/integrations/sources/github/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteCDKLoader\n",
"from source_github.source import SourceGithub # plug in your own source here\n",
"\n",
"config = {\n",
" # your github configuration\n",
" \"credentials\": {\n",
" \"api_url\": \"api.github.com\",\n",
" \"personal_access_token\": \"<token>\"\n",
" },\n",
" \"repository\": \"<repo>\",\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\"\n",
"}\n",
"\n",
"issues_loader = AirbyteCDKLoader(source_class=SourceGithub, config=config, stream_name=\"issues\")"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = issues_loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = issues_loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To create documents in a different, pass in a record_handler function when creating the loader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"] + \"\\n\" + (record.data[\"body\"] or \"\"), metadata=record.data)\n",
"\n",
"issues_loader = AirbyteCDKLoader(source_class=SourceGithub, config=config, stream_name=\"issues\", record_handler=handle_record)\n",
"\n",
"docs = issues_loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = issues_loader.last_state # store safely\n",
"\n",
"incremental_issue_loader = AirbyteCDKLoader(source_class=SourceGithub, config=config, stream_name=\"issues\", state=last_state)\n",
"\n",
"new_docs = incremental_issue_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,206 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Gong"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Gong connector as a document loader, allowing you to load various Gong objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-gong` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-gong"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/gong/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-gong/source_gong/spec.yaml](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-gong/source_gong/spec.yaml).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"access_key\": \"<access key name>\",\n",
" \"access_key_secret\": \"<access key secret>\",\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteGongLoader\n",
"\n",
"config = {\n",
" # your gong configuration\n",
"}\n",
"\n",
"loader = AirbyteGongLoader(config=config, stream_name=\"calls\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To process documents, create a class inheriting from the base loader and implement the `_handle_records` method yourself:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteGongLoader(config=config, record_handler=handle_record, stream_name=\"calls\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteGongLoader(config=config, stream_name=\"calls\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,208 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Hubspot"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Hubspot connector as a document loader, allowing you to load various Hubspot objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-hubspot` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-hubspot"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/hubspot/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-hubspot/source_hubspot/spec.yaml](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-hubspot/source_hubspot/spec.yaml).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
" \"credentials\": {\n",
" \"credentials_title\": \"Private App Credentials\",\n",
" \"access_token\": \"<access token of your private app>\"\n",
" }\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteHubspotLoader\n",
"\n",
"config = {\n",
" # your hubspot configuration\n",
"}\n",
"\n",
"loader = AirbyteHubspotLoader(config=config, stream_name=\"products\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To process documents, create a class inheriting from the base loader and implement the `_handle_records` method yourself:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteHubspotLoader(config=config, record_handler=handle_record, stream_name=\"products\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteHubspotLoader(config=config, stream_name=\"products\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,213 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Salesforce"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Salesforce connector as a document loader, allowing you to load various Salesforce objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-salesforce` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-salesforce"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/salesforce/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/spec.yaml](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/spec.yaml).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"client_id\": \"<oauth client id>\",\n",
" \"client_secret\": \"<oauth client secret>\",\n",
" \"refresh_token\": \"<oauth refresh token>\",\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
" \"is_sandbox\": False, # set to True if you're using a sandbox environment\n",
" \"streams_criteria\": [ # Array of filters for salesforce objects that should be loadable\n",
" {\"criteria\": \"exacts\", \"value\": \"Account\"}, # Exact name of salesforce object\n",
" {\"criteria\": \"starts with\", \"value\": \"Asset\"}, # Prefix of the name\n",
" # Other allowed criteria: ends with, contains, starts not with, ends not with, not contains, not exacts\n",
" ],\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteSalesforceLoader\n",
"\n",
"config = {\n",
" # your salesforce configuration\n",
"}\n",
"\n",
"loader = AirbyteSalesforceLoader(config=config, stream_name=\"asset\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To create documents in a different, pass in a record_handler function when creating the loader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteSalesforceLoader(config=config, record_handler=handle_record, stream_name=\"asset\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteSalesforceLoader(config=config, stream_name=\"asset\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,209 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Shopify"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Shopify connector as a document loader, allowing you to load various Shopify objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-shopify` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-shopify"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/shopify/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-shopify/source_shopify/spec.json](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-shopify/source_shopify/spec.json).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
" \"shop\": \"<name of the shop you want to retrieve documents from>\",\n",
" \"credentials\": {\n",
" \"auth_method\": \"api_password\",\n",
" \"api_password\": \"<your api password>\"\n",
" }\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteShopifyLoader\n",
"\n",
"config = {\n",
" # your shopify configuration\n",
"}\n",
"\n",
"loader = AirbyteShopifyLoader(config=config, stream_name=\"orders\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To create documents in a different, pass in a record_handler function when creating the loader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteShopifyLoader(config=config, record_handler=handle_record, stream_name=\"orders\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteShopifyLoader(config=config, stream_name=\"orders\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,206 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Stripe"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Stripe connector as a document loader, allowing you to load various Stripe objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-stripe` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-stripe"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/stripe/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/spec.yaml](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-stripe/source_stripe/spec.yaml).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"client_secret\": \"<secret key>\",\n",
" \"account_id\": \"<account id>\",\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteStripeLoader\n",
"\n",
"config = {\n",
" # your stripe configuration\n",
"}\n",
"\n",
"loader = AirbyteStripeLoader(config=config, stream_name=\"invoices\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To create documents in a different, pass in a record_handler function when creating the loader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteStripeLoader(config=config, record_handler=handle_record, stream_name=\"invoices\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteStripeLoader(config=config, record_handler=handle_record, stream_name=\"invoices\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,209 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Typeform"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Typeform connector as a document loader, allowing you to load various Typeform objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-typeform` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-typeform"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/typeform/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-typeform/source_typeform/spec.json](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-typeform/source_typeform/spec.json).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"credentials\": {\n",
" \"auth_type\": \"Private Token\",\n",
" \"access_token\": \"<your auth token>\"\n",
" },\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
" \"form_ids\": [\"<id of form to load records for>\"] # if omitted, records from all forms will be loaded\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteTypeformLoader\n",
"\n",
"config = {\n",
" # your typeform configuration\n",
"}\n",
"\n",
"loader = AirbyteTypeformLoader(config=config, stream_name=\"forms\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To create documents in a different, pass in a record_handler function when creating the loader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteTypeformLoader(config=config, record_handler=handle_record, stream_name=\"forms\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteTypeformLoader(config=config, record_handler=handle_record, stream_name=\"forms\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1f3a5ebf",
"metadata": {},
"source": [
"# Airbyte Zendesk Support"
]
},
{
"cell_type": "markdown",
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
"metadata": {},
"source": [
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
"\n",
"This loader exposes the Zendesk Support connector as a document loader, allowing you to load various objects as documents."
]
},
{
"cell_type": "markdown",
"id": "6847a40c",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "3b06fbde",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "e3e9dc79",
"metadata": {},
"source": [
"First, you need to install the `airbyte-source-zendesk-support` python package."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d35e4e0",
"metadata": {},
"outputs": [],
"source": [
"#!pip install airbyte-source-zendesk-support"
]
},
{
"cell_type": "markdown",
"id": "ae855210",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "02208f52",
"metadata": {},
"source": [
"Check out the [Airbyte documentation page](https://docs.airbyte.com/integrations/sources/zendesk-support/) for details about how to configure the reader.\n",
"The JSON schema the config object should adhere to can be found on Github: [https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-zendesk-support/source_zendesk_support/spec.json](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-zendesk-support/source_zendesk_support/spec.json).\n",
"\n",
"The general shape looks like this:\n",
"```python\n",
"{\n",
" \"subdomain\": \"<your zendesk subdomain>\",\n",
" \"start_date\": \"<date from which to start retrieving records from in ISO format, e.g. 2020-10-20T00:00:00Z>\",\n",
" \"credentials\": {\n",
" \"credentials\": \"api_token\",\n",
" \"email\": \"<your email>\",\n",
" \"api_token\": \"<your api token>\"\n",
" }\n",
"}\n",
"```\n",
"\n",
"By default all fields are stored as metadata in the documents and the text is set to an empty string. Construct the text of the document by transforming the documents returned by the reader."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a99e58",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from langchain.document_loaders.airbyte import AirbyteZendeskSupportLoader\n",
"\n",
"config = {\n",
" # your zendesk-support configuration\n",
"}\n",
"\n",
"loader = AirbyteZendeskSupportLoader(config=config, stream_name=\"tickets\") # check the documentation linked above for a list of all streams"
]
},
{
"cell_type": "markdown",
"id": "2cea23fc",
"metadata": {},
"source": [
"Now you can load documents the usual way"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dae75cdb",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "4a93dc2a",
"metadata": {},
"source": [
"As `load` returns a list, it will block until all documents are loaded. To have better control over this process, you can also you the `lazy_load` method which returns an iterator instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1782db09",
"metadata": {},
"outputs": [],
"source": [
"docs_iterator = loader.lazy_load()"
]
},
{
"cell_type": "markdown",
"id": "3a124086",
"metadata": {},
"source": [
"Keep in mind that by default the page content is empty and the metadata object contains all the information from the record. To create documents in a different, pass in a record_handler function when creating the loader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5671395d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.docstore.document import Document\n",
"\n",
"def handle_record(record, id):\n",
" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",
"\n",
"loader = AirbyteZendeskSupportLoader(config=config, record_handler=handle_record, stream_name=\"tickets\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "223eb8bc",
"metadata": {},
"source": [
"## Incremental loads\n",
"\n",
"Some streams allow incremental loading, this means the source keeps track of synced records and won't load them again. This is useful for sources that have a high volume of data and are updated frequently.\n",
"\n",
"To take advantage of this, store the `last_state` property of the loader and pass it in when creating the loader again. This will ensure that only new records are loaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7061e735",
"metadata": {},
"outputs": [],
"source": [
"last_state = loader.last_state # store safely\n",
"\n",
"incremental_loader = AirbyteZendeskSupportLoader(config=config, stream_name=\"tickets\", state=last_state)\n",
"\n",
"new_docs = incremental_loader.load()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -73,13 +73,27 @@
"loader.load()"
]
},
{
"cell_type": "markdown",
"id": "41c8a46f",
"metadata": {},
"source": [
"If you want to use an alternative loader, you can provide a custom function, for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eba3002d",
"metadata": {},
"outputs": [],
"source": []
"source": [
"from langchain.document_loaders import PyPDFLoader\n",
"def load_pdf(file_path):\n",
" return PyPDFLoader(file_path)\n",
"\n",
"loader = GCSFileLoader(project_name=\"aist\", bucket=\"testing-hwc\", blob=\"fake.pdf\", loader_func=load_pdf)"
]
}
],
"metadata": {

View File

@@ -0,0 +1,139 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3df0dcf8",
"metadata": {},
"source": [
"# PubMed\n",
"\n",
">[PubMed®](https://pubmed.ncbi.nlm.nih.gov/) by `The National Center for Biotechnology Information, National Library of Medicine` comprises more than 35 million citations for biomedical literature from `MEDLINE`, life science journals, and online books. Citations may include links to full text content from `PubMed Central` and publisher web sites."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aecaff63",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import PubMedLoader"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f2f7e8d3",
"metadata": {},
"outputs": [],
"source": [
"loader = PubMedLoader(\"chatgpt\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ed115aa1",
"metadata": {},
"outputs": [],
"source": [
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b68d3264-b893-45e4-8ab0-077b25a586dc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9f4626d2-068d-4aed-9ffe-ad754ad4b4cd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'uid': '37548997',\n",
" 'Title': 'Performance of ChatGPT on the Situational Judgement Test-A Professional Dilemmas-Based Examination for Doctors in the United Kingdom.',\n",
" 'Published': '2023-08-07',\n",
" 'Copyright Information': '©Robin J Borchert, Charlotte R Hickman, Jack Pepys, Timothy J Sadler. Originally published in JMIR Medical Education (https://mededu.jmir.org), 07.08.2023.'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[1].metadata"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8000f687-b500-4cce-841b-70d6151304da",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"BACKGROUND: ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors.\\nOBJECTIVE: We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics.\\nMETHODS: All questions from the UK Foundation Programme Office's (UKFPO's) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT's answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated.\\nRESULTS: Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT's situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors.\\nCONCLUSIONS: Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics.\""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[1].page_content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1070e571-697d-4c33-9a4f-0b2dd6909629",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -9,7 +9,7 @@
"\n",
"We may want to process load all URLs under a root directory.\n",
"\n",
"For example, let's look at the [LangChain JS documentation](https://js.langchain.com/docs/).\n",
"For example, let's look at the [Python 3.9 Document](https://docs.python.org/3.9/).\n",
"\n",
"This has many interesting child pages that we may want to read in bulk.\n",
"\n",
@@ -19,13 +19,28 @@
" \n",
"We do this using the `RecursiveUrlLoader`.\n",
"\n",
"This also gives us the flexibility to exclude some children (e.g., the `api` directory with > 800 child pages)."
"This also gives us the flexibility to exclude some children, customize the extractor, and more."
]
},
{
"cell_type": "markdown",
"id": "1be8094f",
"metadata": {},
"source": [
"# Parameters\n",
"- url: str, the target url to crawl.\n",
"- exclude_dirs: Optional[str], webpage directories to exclude.\n",
"- use_async: Optional[bool], wether to use async requests, using async requests is usually faster in large tasks. However, async will disable the lazy loading feature(the function still works, but it is not lazy). By default, it is set to False.\n",
"- extractor: Optional[Callable[[str], str]], a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like goose3 and beautifulsoup to extract the text. By default, it just returns the page as it is.\n",
"- max_depth: Optional[int] = None, the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.\n",
"- timeout: Optional[int] = None, the timeout for each request, in the unit of seconds. By default, it is set to 10.\n",
"- prevent_outside: Optional[bool] = None, whether to prevent crawling outside the root url. By default, it is set to True."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2e3532b2",
"execution_count": null,
"id": "23c18539",
"metadata": {},
"outputs": [],
"source": [
@@ -42,13 +57,15 @@
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d69e5620",
"execution_count": null,
"id": "55394afe",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://js.langchain.com/docs/modules/memory/examples/\"\n",
"loader = RecursiveUrlLoader(url=url)\n",
"from bs4 import BeautifulSoup as Soup\n",
"\n",
"url = \"https://docs.python.org/3.9/\"\n",
"loader = RecursiveUrlLoader(url=url, max_depth=2, extractor=lambda x: Soup(x, \"html.parser\").text)\n",
"docs = loader.load()"
]
},
@@ -61,7 +78,7 @@
{
"data": {
"text/plain": [
"12"
"'\\n\\n\\n\\n\\nPython Frequently Asked Questions — Python 3.'"
]
},
"execution_count": 3,
@@ -70,19 +87,21 @@
}
],
"source": [
"len(docs)"
"docs[0].page_content[:50]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "89355b7c",
"id": "13bd7e16",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\n\\n\\n\\n\\nBuffer Window Memory | 🦜️🔗 Langchain\\n\\n\\n\\n\\n\\nSki'"
"{'source': 'https://docs.python.org/3.9/library/index.html',\n",
" 'title': 'The Python Standard Library — Python 3.9.17 documentation',\n",
" 'language': None}"
]
},
"execution_count": 4,
@@ -91,137 +110,48 @@
}
],
"source": [
"docs[0].page_content[:50]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "13bd7e16",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'https://js.langchain.com/docs/modules/memory/examples/buffer_window_memory',\n",
" 'title': 'Buffer Window Memory | 🦜️🔗 Langchain',\n",
" 'description': 'BufferWindowMemory keeps track of the back-and-forths in conversation, and then uses a window of size k to surface the last k back-and-forths to use as memory.',\n",
" 'language': 'en'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].metadata"
"docs[-1].metadata"
]
},
{
"cell_type": "markdown",
"id": "40fc13ef",
"id": "5866e5a6",
"metadata": {},
"source": [
"Now, let's try a more extensive example, the `docs` root dir.\n",
"\n",
"We will skip everything under `api`.\n",
"\n",
"For this, we can `lazy_load` each page as we crawl the tree, using `WebBaseLoader` to load each as we go."
"However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. You can perform a filter on the returned documents by yourself, if it's needed. Most of the time, the returned results are good enough."
]
},
{
"cell_type": "markdown",
"id": "4ec8ecef",
"metadata": {},
"source": [
"Testing on LangChain docs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c938b9f",
"execution_count": 2,
"id": "349b5598",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://js.langchain.com/docs/\"\n",
"exclude_dirs = [\"https://js.langchain.com/docs/api/\"]\n",
"loader = RecursiveUrlLoader(url=url, exclude_dirs=exclude_dirs)\n",
"# Lazy load each\n",
"docs = [print(doc) or doc for doc in loader.lazy_load()]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "30ff61d3",
"metadata": {},
"outputs": [],
"source": [
"# Load all pages\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "457e30f3",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"188"
"8"
]
},
"execution_count": 8,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://js.langchain.com/docs/modules/memory/integrations/\"\n",
"loader = RecursiveUrlLoader(url=url)\n",
"docs = loader.load()\n",
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "bca80b4a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\n\\n\\n\\n\\nAgent Simulations | 🦜️🔗 Langchain\\n\\n\\n\\n\\n\\nSkip t'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].page_content[:50]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "df97cf22",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': 'https://js.langchain.com/docs/use_cases/agent_simulations/',\n",
" 'title': 'Agent Simulations | 🦜️🔗 Langchain',\n",
" 'description': 'Agent simulations involve taking multiple agents and having them interact with each other.',\n",
" 'language': 'en'}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].metadata"
]
}
],
"metadata": {

View File

@@ -0,0 +1,320 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "bda1f3f5",
"metadata": {},
"source": [
"# TensorFlow Datasets\n",
"\n",
">[TensorFlow Datasets](https://www.tensorflow.org/datasets) is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as [tf.data.Datasets](https://www.tensorflow.org/api_docs/python/tf/data/Dataset), enabling easy-to-use and high-performance input pipelines. To get started see the [guide](https://www.tensorflow.org/datasets/overview) and the [list of datasets](https://www.tensorflow.org/datasets/catalog/overview#all_datasets).\n",
"\n",
"This notebook shows how to load `TensorFlow Datasets` into a Document format that we can use downstream."
]
},
{
"cell_type": "markdown",
"id": "1b7a1eef-7bf7-4e7d-8bfc-c4e27c9488cb",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "2abd5578-aa3d-46b9-99af-8b262f0b3df8",
"metadata": {},
"source": [
"You need to install `tensorflow` and `tensorflow-datasets` python packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2e589036-351e-4c63-b734-c9a05fadb880",
"metadata": {},
"outputs": [],
"source": [
"!pip install tensorflow"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b674aaea-ed3a-4541-8414-260a8f67f623",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install tensorflow-datasets"
]
},
{
"cell_type": "markdown",
"id": "95f05e1c-195e-4e2b-ae8e-8d6637f15be6",
"metadata": {},
"source": [
"## Example"
]
},
{
"cell_type": "markdown",
"id": "e66e211e-9419-4dbb-b3cd-afc3cf984305",
"metadata": {},
"source": [
"As an example, we use the [`mlqa/en` dataset](https://www.tensorflow.org/datasets/catalog/mlqa#mlqaen).\n",
"\n",
">`MLQA` (`Multilingual Question Answering Dataset`) is a benchmark dataset for evaluating multilingual question answering performance. The dataset consists of 7 languages: Arabic, German, Spanish, English, Hindi, Vietnamese, Chinese.\n",
">\n",
">- Homepage: https://github.com/facebookresearch/MLQA\n",
">- Source code: `tfds.datasets.mlqa.Builder`\n",
">- Download size: 72.21 MiB\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8968d645-c81c-4e3b-82bc-a3cbb5ddd93a",
"metadata": {},
"outputs": [],
"source": [
"# Feature structure of `mlqa/en` dataset:\n",
"\n",
"FeaturesDict({\n",
" 'answers': Sequence({\n",
" 'answer_start': int32,\n",
" 'text': Text(shape=(), dtype=string),\n",
" }),\n",
" 'context': Text(shape=(), dtype=string),\n",
" 'id': string,\n",
" 'question': Text(shape=(), dtype=string),\n",
" 'title': Text(shape=(), dtype=string),\n",
"})"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "30fcaba5-cc9b-4a0e-a8f4-c047018451c2",
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import tensorflow_datasets as tfds"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "e307dd67-029e-4ee3-a65f-e085c09b0b8b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<_TakeDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# try directly access this dataset:\n",
"ds = tfds.load('mlqa/en', split='test')\n",
"ds = ds.take(1) # Only take a single example\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "5c9c4b08-d94f-4b53-add0-93769811644e",
"metadata": {},
"source": [
"Now we have to create a custom function to convert dataset sample into a Document.\n",
"\n",
"This is a requirement. There is no standard format for the TF datasets that's why we need to make a custom transformation function.\n",
"\n",
"Let's use `context` field as the `Document.page_content` and place other fields in the `Document.metadata`.\n"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "78844113-f8d8-48a8-8105-685280b6cfa5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='After completing the journey around South America, on 23 February 2006, Queen Mary 2 met her namesake, the original RMS Queen Mary, which is permanently docked at Long Beach, California. Escorted by a flotilla of smaller ships, the two Queens exchanged a \"whistle salute\" which was heard throughout the city of Long Beach. Queen Mary 2 met the other serving Cunard liners Queen Victoria and Queen Elizabeth 2 on 13 January 2008 near the Statue of Liberty in New York City harbour, with a celebratory fireworks display; Queen Elizabeth 2 and Queen Victoria made a tandem crossing of the Atlantic for the meeting. This marked the first time three Cunard Queens have been present in the same location. Cunard stated this would be the last time these three ships would ever meet, due to Queen Elizabeth 2\\'s impending retirement from service in late 2008. However this would prove not to be the case, as the three Queens met in Southampton on 22 April 2008. Queen Mary 2 rendezvoused with Queen Elizabeth 2 in Dubai on Saturday 21 March 2009, after the latter ship\\'s retirement, while both ships were berthed at Port Rashid. With the withdrawal of Queen Elizabeth 2 from Cunard\\'s fleet and its docking in Dubai, Queen Mary 2 became the only ocean liner left in active passenger service.' metadata={'id': '5116f7cccdbf614d60bcd23498274ffd7b1e4ec7', 'title': 'RMS Queen Mary 2', 'question': 'What year did Queen Mary 2 complete her journey around South America?', 'answer': '2006'}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2023-08-03 14:27:08.482983: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.\n"
]
}
],
"source": [
"def decode_to_str(item: tf.Tensor) -> str:\n",
" return item.numpy().decode('utf-8')\n",
"\n",
"def mlqaen_example_to_document(example: dict) -> Document:\n",
" return Document(\n",
" page_content=decode_to_str(example[\"context\"]),\n",
" metadata={\n",
" \"id\": decode_to_str(example[\"id\"]),\n",
" \"title\": decode_to_str(example[\"title\"]),\n",
" \"question\": decode_to_str(example[\"question\"]),\n",
" \"answer\": decode_to_str(example[\"answers\"][\"text\"][0]),\n",
" },\n",
" )\n",
" \n",
" \n",
"for example in ds: \n",
" doc = mlqaen_example_to_document(example)\n",
" print(doc)\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "2d43c834-5145-4793-9558-8e301ccaf3b4",
"metadata": {},
"outputs": [],
"source": [
"from langchain.schema import Document\n",
"from langchain.document_loaders import TensorflowDatasetLoader\n",
"\n",
"loader = TensorflowDatasetLoader(\n",
" dataset_name=\"mlqa/en\",\n",
" split_name=\"test\",\n",
" load_max_docs=3,\n",
" sample_to_document_function=mlqaen_example_to_document,\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "e29b954c-1407-4797-ae21-6ba8937156be",
"metadata": {},
"source": [
"`TensorflowDatasetLoader` has these parameters:\n",
"- `dataset_name`: the name of the dataset to load\n",
"- `split_name`: the name of the split to load. Defaults to \"train\".\n",
"- `load_max_docs`: a limit to the number of loaded documents. Defaults to 100.\n",
"- `sample_to_document_function`: a function that converts a dataset sample to a Document\n"
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "700e4ef2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2023-08-03 14:27:22.998964: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.\n"
]
},
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "9138940a-e9fe-4145-83e8-77589b5272c9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'After completing the journey around South America, on 23 February 2006, Queen Mary 2 met her namesake, the original RMS Queen Mary, which is permanently docked at Long Beach, California. Escorted by a flotilla of smaller ships, the two Queens exchanged a \"whistle salute\" which was heard throughout the city of Long Beach. Queen Mary 2 met the other serving Cunard liners Queen Victoria and Queen Elizabeth 2 on 13 January 2008 near the Statue of Liberty in New York City harbour, with a celebratory fireworks display; Queen Elizabeth 2 and Queen Victoria made a tandem crossing of the Atlantic for the meeting. This marked the first time three Cunard Queens have been present in the same location. Cunard stated this would be the last time these three ships would ever meet, due to Queen Elizabeth 2\\'s impending retirement from service in late 2008. However this would prove not to be the case, as the three Queens met in Southampton on 22 April 2008. Queen Mary 2 rendezvoused with Queen Elizabeth 2 in Dubai on Saturday 21 March 2009, after the latter ship\\'s retirement, while both ships were berthed at Port Rashid. With the withdrawal of Queen Elizabeth 2 from Cunard\\'s fleet and its docking in Dubai, Queen Mary 2 became the only ocean liner left in active passenger service.'"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].page_content"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "2f7f7832-fe4d-4a58-892d-bb987cdbed0b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'id': '5116f7cccdbf614d60bcd23498274ffd7b1e4ec7',\n",
" 'title': 'RMS Queen Mary 2',\n",
" 'question': 'What year did Queen Mary 2 complete her journey around South America?',\n",
" 'answer': '2006'}"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0].metadata"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "125d073c-4f4f-4ae6-a0c7-9e9db3cc8d69",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,340 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ollama\n",
"\n",
"[Ollama](https://ollama.ai/) allows you to run open-source large language models, such as Llama 2, locally.\n",
"\n",
"Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. \n",
"\n",
"It optimizes setup and configuration details, including GPU usage.\n",
"\n",
"For a complete list of supported models and model variants, see the [Ollama model library](https://github.com/jmorganca/ollama#model-library).\n",
"\n",
"## Setup\n",
"\n",
"First, follow [these instructions](https://github.com/jmorganca/ollama) to set up and run a local Ollama instance:\n",
"\n",
"* [Download](https://ollama.ai/download)\n",
"* Fetch a model, e.g., `Llama-7b`: `ollama pull llama2`\n",
"* Run `ollama run llama2`\n",
"\n",
"\n",
"## Usage\n",
"\n",
"You can see a full list of supported parameters on the [API reference page](https://api.python.langchain.com/en/latest/llms/langchain.llms.ollama.Ollama.html)."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import Ollama\n",
"from langchain.callbacks.manager import CallbackManager\n",
"from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler \n",
"llm = Ollama(base_url=\"http://localhost:11434\", \n",
" model=\"llama2\", \n",
" callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `StreamingStdOutCallbackHandler`, you will see tokens streamed."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Great! The history of Artificial Intelligence (AI) is a fascinating and complex topic that spans several decades. Here's a brief overview:\n",
"\n",
"1. Early Years (1950s-1960s): The term \"Artificial Intelligence\" was coined in 1956 by computer scientist John McCarthy. However, the concept of AI dates back to ancient Greece, where mythical creatures like Talos and Hephaestus were created to perform tasks without any human intervention. In the 1950s and 1960s, researchers began exploring ways to replicate human intelligence using computers, leading to the development of simple AI programs like ELIZA (1966) and PARRY (1972).\n",
"2. Rule-Based Systems (1970s-1980s): As computing power increased, researchers developed rule-based systems, such as Mycin (1976), which could diagnose medical conditions based on a set of rules. This period also saw the rise of expert systems, like EDICT (1985), which mimicked human experts in specific domains.\n",
"3. Machine Learning (1990s-2000s): With the advent of big data and machine learning algorithms, AI evolved to include neural networks, decision trees, and other techniques for training models on large datasets. This led to the development of applications like speech recognition (e.g., Siri, Alexa), image recognition (e.g., Google Image Search), and natural language processing (e.g., chatbots).\n",
"4. Deep Learning (2010s-present): The rise of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), has enabled AI to perform complex tasks like image and speech recognition, natural language processing, and even autonomous driving. Companies like Google, Facebook, and Baidu have invested heavily in deep learning research, leading to breakthroughs in areas like facial recognition, object detection, and machine translation.\n",
"5. Current Trends (present-future): AI is currently being applied to various industries, including healthcare, finance, education, and entertainment. With the growth of cloud computing, edge AI, and autonomous systems, we can expect to see more sophisticated AI applications in the near future. However, there are also concerns about the ethical implications of AI, such as data privacy, algorithmic bias, and job displacement.\n",
"\n",
"Remember, AI has a long history, and its development is an ongoing process. As technology advances, we can expect to see even more innovative applications of AI in various fields."
]
},
{
"data": {
"text/plain": [
"'\\nGreat! The history of Artificial Intelligence (AI) is a fascinating and complex topic that spans several decades. Here\\'s a brief overview:\\n\\n1. Early Years (1950s-1960s): The term \"Artificial Intelligence\" was coined in 1956 by computer scientist John McCarthy. However, the concept of AI dates back to ancient Greece, where mythical creatures like Talos and Hephaestus were created to perform tasks without any human intervention. In the 1950s and 1960s, researchers began exploring ways to replicate human intelligence using computers, leading to the development of simple AI programs like ELIZA (1966) and PARRY (1972).\\n2. Rule-Based Systems (1970s-1980s): As computing power increased, researchers developed rule-based systems, such as Mycin (1976), which could diagnose medical conditions based on a set of rules. This period also saw the rise of expert systems, like EDICT (1985), which mimicked human experts in specific domains.\\n3. Machine Learning (1990s-2000s): With the advent of big data and machine learning algorithms, AI evolved to include neural networks, decision trees, and other techniques for training models on large datasets. This led to the development of applications like speech recognition (e.g., Siri, Alexa), image recognition (e.g., Google Image Search), and natural language processing (e.g., chatbots).\\n4. Deep Learning (2010s-present): The rise of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), has enabled AI to perform complex tasks like image and speech recognition, natural language processing, and even autonomous driving. Companies like Google, Facebook, and Baidu have invested heavily in deep learning research, leading to breakthroughs in areas like facial recognition, object detection, and machine translation.\\n5. Current Trends (present-future): AI is currently being applied to various industries, including healthcare, finance, education, and entertainment. With the growth of cloud computing, edge AI, and autonomous systems, we can expect to see more sophisticated AI applications in the near future. However, there are also concerns about the ethical implications of AI, such as data privacy, algorithmic bias, and job displacement.\\n\\nRemember, AI has a long history, and its development is an ongoing process. As technology advances, we can expect to see even more innovative applications of AI in various fields.'"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llm(\"Tell me about the history of AI\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## RAG\n",
"\n",
"We can use Olama with RAG, [just as shown here](https://python.langchain.com/docs/use_cases/question_answering/how_to/local_retrieval_qa).\n",
"\n",
"Let's use the 13b model:\n",
"\n",
"```\n",
"ollama pull llama2:13b\n",
"ollama run llama2:13b \n",
"```\n",
"\n",
"Let's also use local embeddings from `GPT4AllEmbeddings` and `Chroma`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install gpt4all chromadb"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import WebBaseLoader\n",
"loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
"data = loader.load()\n",
"\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
"all_splits = text_splitter.split_documents(data)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found model file at /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
]
}
],
"source": [
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import GPT4AllEmbeddings\n",
"\n",
"vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"question = \"What are the approaches to Task Decomposition?\"\n",
"docs = vectorstore.similarity_search(question)\n",
"len(docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain import PromptTemplate\n",
"\n",
"# Prompt\n",
"template = \"\"\"Use the following pieces of context to answer the question at the end. \n",
"If you don't know the answer, just say that you don't know, don't try to make up an answer. \n",
"Use three sentences maximum and keep the answer as concise as possible. \n",
"{context}\n",
"Question: {question}\n",
"Helpful Answer:\"\"\"\n",
"QA_CHAIN_PROMPT = PromptTemplate(\n",
" input_variables=[\"context\", \"question\"],\n",
" template=template,\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"# LLM\n",
"from langchain.llms import Ollama\n",
"from langchain.callbacks.manager import CallbackManager\n",
"from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
"llm = Ollama(base_url=\"http://localhost:11434\",\n",
" model=\"llama2\",\n",
" verbose=True,\n",
" callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"# QA chain\n",
"from langchain.chains import RetrievalQA\n",
"qa_chain = RetrievalQA.from_chain_type(\n",
" llm,\n",
" retriever=vectorstore.as_retriever(),\n",
" chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Task decomposition can be approached in different ways for AI agents, including:\n",
"\n",
"1. Using simple prompts like \"Steps for XYZ.\" or \"What are the subgoals for achieving XYZ?\" to guide the LLM.\n",
"2. Providing task-specific instructions, such as \"Write a story outline\" for writing a novel.\n",
"3. Utilizing human inputs to help the AI agent understand the task and break it down into smaller steps."
]
}
],
"source": [
"question = \"What are the various approaches to Task Decomposition for AI Agents?\"\n",
"result = qa_chain({\"query\": question})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also get logging for tokens."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Task decomposition can be approached in three ways: (1) using simple prompting like \"Steps for XYZ.\\n1.\", \"What are the subgoals for achieving XYZ?\", (2) by using task-specific instructions, or (3) with human inputs.{'model': 'llama2', 'created_at': '2023-08-08T04:01:09.005367Z', 'done': True, 'context': [1, 29871, 1, 13, 9314, 14816, 29903, 6778, 13, 13, 3492, 526, 263, 8444, 29892, 3390, 1319, 322, 15993, 20255, 29889, 29849, 1234, 408, 1371, 3730, 408, 1950, 29892, 1550, 1641, 9109, 29889, 3575, 6089, 881, 451, 3160, 738, 10311, 1319, 29892, 443, 621, 936, 29892, 11021, 391, 29892, 7916, 391, 29892, 304, 27375, 29892, 18215, 29892, 470, 27302, 2793, 29889, 3529, 9801, 393, 596, 20890, 526, 5374, 635, 443, 5365, 1463, 322, 6374, 297, 5469, 29889, 13, 13, 3644, 263, 1139, 947, 451, 1207, 738, 4060, 29892, 470, 338, 451, 2114, 1474, 16165, 261, 296, 29892, 5649, 2020, 2012, 310, 22862, 1554, 451, 1959, 29889, 960, 366, 1016, 29915, 29873, 1073, 278, 1234, 304, 263, 1139, 29892, 3113, 1016, 29915, 29873, 6232, 2089, 2472, 29889, 13, 13, 29966, 829, 14816, 29903, 6778, 13, 13, 29961, 25580, 29962, 4803, 278, 1494, 12785, 310, 3030, 304, 1234, 278, 1139, 472, 278, 1095, 29889, 29871, 13, 3644, 366, 1016, 29915, 29873, 1073, 278, 1234, 29892, 925, 1827, 393, 366, 1016, 29915, 29873, 1073, 29892, 1016, 29915, 29873, 1018, 304, 1207, 701, 385, 1234, 29889, 29871, 13, 11403, 2211, 25260, 7472, 322, 3013, 278, 1234, 408, 3022, 895, 408, 1950, 29889, 29871, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 5398, 26227, 508, 367, 2309, 313, 29896, 29897, 491, 365, 26369, 411, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29936, 321, 29889, 29887, 29889, 376, 6113, 263, 5828, 27887, 1213, 363, 5007, 263, 9554, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 13, 13, 1451, 16047, 267, 297, 1472, 29899, 8489, 18987, 322, 3414, 26227, 29901, 1858, 9450, 975, 263, 3309, 29891, 4955, 322, 17583, 3902, 8253, 278, 1650, 2913, 3933, 18066, 292, 29889, 365, 26369, 29879, 21117, 304, 10365, 13900, 746, 20050, 411, 15668, 4436, 29892, 3907, 963, 3109, 16424, 9401, 304, 25618, 1058, 5110, 515, 14260, 322, 1059, 29889, 13, 16492, 29901, 1724, 526, 278, 13501, 304, 9330, 897, 510, 3283, 29973, 13, 29648, 1319, 673, 29901, 518, 29914, 25580, 29962, 13, 5398, 26227, 508, 367, 26733, 297, 2211, 5837, 29901, 313, 29896, 29897, 773, 2560, 9508, 292, 763, 376, 7789, 567, 363, 1060, 29979, 29999, 7790, 29876, 29896, 19602, 376, 5618, 526, 278, 1014, 1484, 1338, 363, 3657, 15387, 1060, 29979, 29999, 29973, 613, 313, 29906, 29897, 491, 773, 3414, 29899, 14940, 11994, 29892, 470, 313, 29941, 29897, 411, 5199, 10970, 29889, 2], 'total_duration': 1364428708, 'load_duration': 1246375, 'sample_count': 62, 'sample_duration': 44859000, 'prompt_eval_count': 1, 'eval_count': 62, 'eval_duration': 1313002000}\n"
]
}
],
"source": [
"from langchain.schema import LLMResult\n",
"from langchain.callbacks.base import BaseCallbackHandler\n",
"\n",
"class GenerationStatisticsCallback(BaseCallbackHandler):\n",
" def on_llm_end(self, response: LLMResult, **kwargs) -> None:\n",
" print(response.generations[0][0].generation_info)\n",
" \n",
"callback_manager = CallbackManager([StreamingStdOutCallbackHandler(), GenerationStatisticsCallback()])\n",
"\n",
"llm = Ollama(base_url=\"http://localhost:11434\",\n",
" model=\"llama2\",\n",
" verbose=True,\n",
" callback_manager=callback_manager)\n",
"\n",
"qa_chain = RetrievalQA.from_chain_type(\n",
" llm,\n",
" retriever=vectorstore.as_retriever(),\n",
" chain_type_kwargs={\"prompt\": QA_CHAIN_PROMPT},\n",
")\n",
"\n",
"question = \"What are the approaches to Task Decomposition?\"\n",
"result = qa_chain({\"query\": question})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`eval_count` / (`eval_duration`/10e9) gets `tok / s`"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"47.22003469910937"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"62 / (1313002000/1000/1000/1000)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,106 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "9597802c",
"metadata": {},
"source": [
"# Nebula\n",
"\n",
"[Nebula](https://symbl.ai/nebula/) is a fully-managed Conversation platform, on which you can build, deploy, and manage scalable AI applications.\n",
"\n",
"This example goes over how to use LangChain to interact with the [Nebula platform](https://docs.symbl.ai/docs/nebula-llm-overview). \n",
"\n",
"It will send the requests to Nebula Service endpoint, which concatenates `SYMBLAI_NEBULA_SERVICE_URL` and `SYMBLAI_NEBULA_SERVICE_PATH`, with a token defined in `SYMBLAI_NEBULA_SERVICE_TOKEN`"
]
},
{
"cell_type": "markdown",
"id": "f15ebe0d",
"metadata": {},
"source": [
"### Integrate with a LLMChain"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5472a7cd-af26-48ca-ae9b-5f6ae73c74d2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"NEBULA_SERVICE_URL\"] = NEBULA_SERVICE_URL\n",
"os.environ[\"NEBULA_SERVICE_PATH\"] = NEBULA_SERVICE_PATH\n",
"os.environ[\"NEBULA_SERVICE_API_KEY\"] = NEBULA_SERVICE_API_KEY"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fb585dd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.llms import OpenLLM\n",
"\n",
"llm = OpenLLM(\n",
" conversation=\"<Drop your text conversation that you want to ask Nebula to analyze here>\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "035dea0f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain import PromptTemplate, LLMChain\n",
"\n",
"template = \"Identify the {count} main objectives or goals mentioned in this context concisely in less points. Emphasize on key intents.\"\n",
"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"count\"])\n",
"\n",
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"\n",
"generated = llm_chain.run(count=\"five\")\n",
"print(generated)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
},
"vscode": {
"interpreter": {
"hash": "a0a0263b650d907a3bfe41c0f8d6a63a071b884df3cfdc1579f00cdc1aed6b03"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,196 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "499c3142-2033-437d-a60a-731988ac6074",
"metadata": {},
"source": [
"# vLLM\n",
"\n",
"[vLLM](https://vllm.readthedocs.io/en/latest/index.html) is a fast and easy-to-use library for LLM inference and serving, offering:\n",
"* State-of-the-art serving throughput \n",
"* Efficient management of attention key and value memory with PagedAttention\n",
"* Continuous batching of incoming requests\n",
"* Optimized CUDA kernels\n",
"\n",
"This notebooks goes over how to use a LLM with langchain and vLLM.\n",
"\n",
"To use, you should have the `vllm` python package installed."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8a3f2666-5c75-4797-967a-7915a247bf33",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install vllm -q"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "84e350f7-21f6-455b-b1f0-8b0116a2fd49",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)\n",
"INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 2.00it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"What is the capital of France ? The capital of France is Paris.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from langchain.llms import VLLM\n",
"\n",
"llm = VLLM(model=\"mosaicml/mpt-7b\",\n",
" trust_remote_code=True, # mandatory for hf models\n",
" max_new_tokens=128,\n",
" top_k=10,\n",
" top_p=0.95,\n",
" temperature=0.8,\n",
")\n",
"\n",
"print(llm(\"What is the capital of France ?\"))"
]
},
{
"cell_type": "markdown",
"id": "94a3b41d-8329-4f8f-94f9-453d7f132214",
"metadata": {},
"source": [
"## Integrate the model in an LLMChain"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5605b7a1-fa63-49c1-934d-8b4ef8d71dd5",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 1/1 [00:01<00:00, 1.34s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"1. The first Pokemon game was released in 1996.\n",
"2. The president was Bill Clinton.\n",
"3. Clinton was president from 1993 to 2001.\n",
"4. The answer is Clinton.\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from langchain import PromptTemplate, LLMChain\n",
"\n",
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's think step by step.\"\"\"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
"\n",
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"\n",
"question = \"Who was the US president in the year the first Pokemon game was released?\"\n",
"\n",
"print(llm_chain.run(question))"
]
},
{
"cell_type": "markdown",
"id": "56826aba-d08b-4838-8bfa-ca96e463b25d",
"metadata": {},
"source": [
"## Distributed Inference\n",
"\n",
"vLLM supports distributed tensor-parallel inference and serving. \n",
"\n",
"To run multi-GPU inference with the LLM class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8c25c35-47b5-459d-9985-3cf546e9ac16",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import VLLM\n",
"\n",
"llm = VLLM(model=\"mosaicml/mpt-30b\",\n",
" tensor_parallel_size=4,\n",
" trust_remote_code=True, # mandatory for hf models\n",
")\n",
"\n",
"llm(\"What is the future of AI?\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_pytorch_p310",
"language": "python",
"name": "conda_pytorch_p310"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -13,7 +13,7 @@ pip install boto3
See a [usage example](/docs/integrations/llms/bedrock).
```python
from langchain import Bedrock
from langchain.llms.bedrock import Bedrock
```
## Text Embedding Models

View File

@@ -0,0 +1,30 @@
# PubMed
# PubMed
>[PubMed®](https://pubmed.ncbi.nlm.nih.gov/) by `The National Center for Biotechnology Information, National Library of Medicine`
> comprises more than 35 million citations for biomedical literature from `MEDLINE`, life science journals, and online books.
> Citations may include links to full text content from `PubMed Central` and publisher web sites.
## Setup
You need to install a python package.
```bash
pip install xmltodict
```
### Retriever
See a [usage example](/docs/integrations/retrievers/pubmed).
```python
from langchain.retrievers import PubMedRetriever
```
### Document Loader
See a [usage example](/docs/integrations/document_loaders/pubmed).
```python
from langchain.document_loaders import PubMedLoader
```

View File

@@ -0,0 +1,20 @@
# Nebula
This page covers how to use [Nebula](https://symbl.ai/nebula), [Symbl.ai](https://symbl.ai/)'s LLM, ecosystem within LangChain.
It is broken into two parts: installation and setup, and then references to specific Nebula wrappers.
## Installation and Setup
- Get an Nebula API Key and set as environment variables (`SYMBLAI_NEBULA_SERVICE_URL`, `SYMBLAI_NEBULA_SERVICE_PATH`, `SYMBLAI_NEBULA_SERVICE_TOKEN`)
- Sign up for a FREE Symbl.ai/Nebula Account: [https://nebula.symbl.ai/playground/](https://nebula.symbl.ai/playground/)
- Please see the [Nebula documentation](https://docs.symbl.ai/docs/nebula-llm-overview) for more details.
- No time? Visit the [Nebula Quickstart Guide](https://docs.symbl.ai/docs/nebula-quickstart).
## Wrappers
### LLM
There exists an Nebula LLM wrapper, which you can access with
```python
from langchain.llms import Nebula
```

View File

@@ -0,0 +1,31 @@
# TensorFlow Datasets
>[TensorFlow Datasets](https://www.tensorflow.org/datasets) is a collection of datasets ready to use,
> with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed
> as [tf.data.Datasets](https://www.tensorflow.org/api_docs/python/tf/data/Dataset),
> enabling easy-to-use and high-performance input pipelines. To get started see
> the [guide](https://www.tensorflow.org/datasets/overview) and
> the [list of datasets](https://www.tensorflow.org/datasets/catalog/overview#all_datasets).
## Installation and Setup
You need to install `tensorflow` and `tensorflow-datasets` python packages.
```bash
pip install tensorflow
```
```bash
pip install tensorflow-dataset
```
## Document Loader
See a [usage example](/docs/integrations/document_loaders/tensorflow_datasets).
```python
from langchain.document_loaders import TensorflowDatasetLoader
```

View File

@@ -7,14 +7,15 @@
"source": [
"# PubMed\n",
"\n",
"This notebook goes over how to use `PubMed` as a retriever\n",
"\n",
"`PubMed®` comprises more than 35 million citations for biomedical literature from `MEDLINE`, life science journals, and online books. Citations may include links to full text content from `PubMed Central` and publisher web sites."
">[PubMed®](https://pubmed.ncbi.nlm.nih.gov/) by `The National Center for Biotechnology Information, National Library of Medicine` comprises more than 35 million citations for biomedical literature from `MEDLINE`, life science journals, and online books. Citations may include links to full text content from `PubMed Central` and publisher web sites.\n",
"\n",
"This notebook goes over how to use `PubMed` as a retriever"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 12,
"id": "aecaff63",
"metadata": {},
"outputs": [],
@@ -24,7 +25,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 34,
"id": "f2f7e8d3",
"metadata": {},
"outputs": [],
@@ -34,19 +35,19 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 35,
"id": "ed115aa1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='', metadata={'uid': '37268021', 'title': 'Dermatology in the wake of an AI revolution: who gets a say?', 'pub_date': '<Year>2023</Year><Month>May</Month><Day>31</Day>'}),\n",
" Document(page_content='', metadata={'uid': '37267643', 'title': 'What is ChatGPT and what do we do with it? Implications of the age of AI for nursing and midwifery practice and education: An editorial.', 'pub_date': '<Year>2023</Year><Month>May</Month><Day>30</Day>'}),\n",
" Document(page_content='The nursing field has undergone notable changes over time and is projected to undergo further modifications in the future, owing to the advent of sophisticated technologies and growing healthcare needs. The advent of ChatGPT, an AI-powered language model, is expected to exert a significant influence on the nursing profession, specifically in the domains of patient care and instruction. The present article delves into the ramifications of ChatGPT within the nursing domain and accentuates its capacity and constraints to transform the discipline.', metadata={'uid': '37266721', 'title': 'The Impact of ChatGPT on the Nursing Profession: Revolutionizing Patient Care and Education.', 'pub_date': '<Year>2023</Year><Month>Jun</Month><Day>02</Day>'})]"
"[Document(page_content='', metadata={'uid': '37549050', 'Title': 'ChatGPT: \"To Be or Not to Be\" in Bikini Bottom.', 'Published': '--', 'Copyright Information': ''}),\n",
" Document(page_content=\"BACKGROUND: ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors.\\nOBJECTIVE: We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics.\\nMETHODS: All questions from the UK Foundation Programme Office's (UKFPO's) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT's answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated.\\nRESULTS: Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT's situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors.\\nCONCLUSIONS: Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics.\", metadata={'uid': '37548997', 'Title': 'Performance of ChatGPT on the Situational Judgement Test-A Professional Dilemmas-Based Examination for Doctors in the United Kingdom.', 'Published': '2023-08-07', 'Copyright Information': '©Robin J Borchert, Charlotte R Hickman, Jack Pepys, Timothy J Sadler. Originally published in JMIR Medical Education (https://mededu.jmir.org), 07.08.2023.'}),\n",
" Document(page_content='', metadata={'uid': '37548971', 'Title': \"Large Language Models Answer Medical Questions Accurately, but Can't Match Clinicians' Knowledge.\", 'Published': '2023-08-07', 'Copyright Information': ''})]"
]
},
"execution_count": 9,
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
@@ -54,6 +55,14 @@
"source": [
"retriever.get_relevant_documents(\"chatgpt\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9ff7a25-bb4b-4cd5-896d-72f70f4af49b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@@ -72,7 +81,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.10.12"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,84 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "719619d3",
"metadata": {},
"source": [
"# BGE Hugging Face Embeddings\n",
"\n",
"This notebook shows how to use BGE Embeddings through Hugging Face"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f7a54279",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# !pip install sentence_transformers"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9e1d5b6b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import HuggingFaceBgeEmbeddings\n",
"\n",
"model_name = \"BAAI/bge-small-en\"\n",
"model_kwargs = {'device': 'cpu'}\n",
"encode_kwargs = {'normalize_embeddings': False}\n",
"hf = HuggingFaceBgeEmbeddings(\n",
" model_name=model_name,\n",
" model_kwargs=model_kwargs,\n",
" encode_kwargs=encode_kwargs\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e59d1a89",
"metadata": {},
"outputs": [],
"source": [
"embedding = hf.embed_query(\"hi this is harrison\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e596315f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -5,7 +5,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multion Toolkit\n",
"# MultiOn Toolkit\n",
"\n",
"This notebook walks you through connecting LangChain to the MultiOn Client in your browser\n",
"\n",
@@ -18,7 +18,32 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade multion > /dev/null"
"!pip install --upgrade multion langchain -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.agents.agent_toolkits import MultionToolkit\n",
"import os\n",
"\n",
"\n",
"toolkit = MultionToolkit()\n",
"\n",
"toolkit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tools = toolkit.get_tools()\n",
"tools"
]
},
{
@@ -38,8 +63,9 @@
"outputs": [],
"source": [
"# Authorize connection to your Browser extention\n",
"import multion \n",
"multion.login()\n"
"import multion\n",
"multion.login()\n",
"\n"
]
},
{
@@ -57,38 +83,18 @@
},
"outputs": [],
"source": [
"from langchain.agents.agent_toolkits import create_multion_agent\n",
"from langchain.tools.multion.tool import MultionClientTool\n",
"from langchain.agents.agent_types import AgentType\n",
"from langchain.chat_models import ChatOpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"\n",
"agent_executor = create_multion_agent(\n",
" llm=ChatOpenAI(temperature=0),\n",
" tool=MultionClientTool(),\n",
" agent_type=AgentType.OPENAI_FUNCTIONS,\n",
" verbose=True\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"agent.run(\"show me the weather today\")"
"from langchain import OpenAI\n",
"from langchain.agents import initialize_agent, AgentType\n",
"llm = OpenAI(temperature=0)\n",
"from langchain.agents.agent_toolkits import MultionToolkit\n",
"toolkit = MultionToolkit()\n",
"tools=toolkit.get_tools()\n",
"agent = initialize_agent(\n",
" tools=toolkit.get_tools(),\n",
" llm=llm,\n",
" agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,\n",
" verbose = True\n",
")"
]
},
{
@@ -100,7 +106,7 @@
"outputs": [],
"source": [
"agent.run(\n",
" \"Tweet about Elon Musk\"\n",
" \"Tweet 'Hi from MultiOn'\"\n",
")"
]
}

View File

@@ -66,7 +66,7 @@
"outputs": [],
"source": [
"from langchain import OpenAI\n",
"from langchain.agents import load_tools, AgentType\n",
"from langchain.agents import load_tools, initialize_agent, AgentType\n",
"\n",
"llm = OpenAI(temperature=0)\n",
"\n",

View File

@@ -0,0 +1,181 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dall-E Image Generator\n",
"\n",
"This notebook shows how you can generate images from a prompt synthesized using an OpenAI LLM. The images are generated using Dall-E, which uses the same OpenAI API key as the LLM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Needed if you would like to display images in the notebook\n",
"!pip install opencv-python scikit-image"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "q-k8wmp0zquh"
},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"import os\n",
"os.environ[\"OPENAI_API_KEY\"] = \"<your-key-here>\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run as a chain"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from langchain.utilities.dalle_image_generator import DallEAPIWrapper\n",
"from langchain.prompts import PromptTemplate\n",
"from langchain.chains import LLMChain\n",
"\n",
"llm = OpenAI(temperature=0.9)\n",
"prompt = PromptTemplate(\n",
" input_variables=[\"image_desc\"],\n",
" template=\"Generate a detailed prompt to generate an image based on the following description: {image_desc}\",\n",
")\n",
"chain = LLMChain(llm=llm, prompt=prompt)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://oaidalleapiprodscus.blob.core.windows.net/private/org-rocrupyvzgcl4yf25rqq6d1v/user-WsxrbKyP2c8rfhCKWDyMfe8N/img-mg1OWiziXxQN1aR2XRsLNndg.png?st=2023-01-31T07%3A34%3A15Z&se=2023-01-31T09%3A34%3A15Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-01-30T22%3A19%3A44Z&ske=2023-01-31T22%3A19%3A44Z&sks=b&skv=2021-08-06&sig=XDPee5aEng%2BcbXq2mqhh39uHGZTBmJgGAerSd0g%2BMEs%3D\n"
]
}
],
"source": [
"image_url = DallEAPIWrapper().run(chain.run(\"halloween night at a haunted museum\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can click on the link above to display the image for\n",
"# Or you can try the options below to display the image inline in this notebook\n",
"\n",
"try:\n",
" import google.colab\n",
" IN_COLAB = True\n",
"except:\n",
" IN_COLAB = False\n",
"\n",
"if IN_COLAB:\n",
" from google.colab.patches import cv2_imshow # for image display\n",
" from skimage import io\n",
"\n",
" image = io.imread(image_url) \n",
" cv2_imshow(image)\n",
"else:\n",
" import cv2\n",
" from skimage import io\n",
"\n",
" image = io.imread(image_url) \n",
" cv2.imshow('image', image)\n",
" cv2.waitKey(0) #wait for a keyboard input\n",
" cv2.destroyAllWindows()\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run as a tool with an agent"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3m What is the best way to turn this description into an image?\n",
"Action: Dall-E Image Generator\n",
"Action Input: A spooky Halloween night at a haunted museum\u001b[0mhttps://oaidalleapiprodscus.blob.core.windows.net/private/org-rocrupyvzgcl4yf25rqq6d1v/user-WsxrbKyP2c8rfhCKWDyMfe8N/img-ogKfqxxOS5KWVSj4gYySR6FY.png?st=2023-01-31T07%3A38%3A25Z&se=2023-01-31T09%3A38%3A25Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-01-30T22%3A19%3A36Z&ske=2023-01-31T22%3A19%3A36Z&sks=b&skv=2021-08-06&sig=XsomxxBfu2CP78SzR9lrWUlbask4wBNnaMsHamy4VvU%3D\n",
"\n",
"Observation: \u001b[36;1m\u001b[1;3mhttps://oaidalleapiprodscus.blob.core.windows.net/private/org-rocrupyvzgcl4yf25rqq6d1v/user-WsxrbKyP2c8rfhCKWDyMfe8N/img-ogKfqxxOS5KWVSj4gYySR6FY.png?st=2023-01-31T07%3A38%3A25Z&se=2023-01-31T09%3A38%3A25Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-01-30T22%3A19%3A36Z&ske=2023-01-31T22%3A19%3A36Z&sks=b&skv=2021-08-06&sig=XsomxxBfu2CP78SzR9lrWUlbask4wBNnaMsHamy4VvU%3D\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m With the image generated, I can now make my final answer.\n",
"Final Answer: An image of a Halloween night at a haunted museum can be seen here: https://oaidalleapiprodscus.blob.core.windows.net/private/org-rocrupyvzgcl4yf25rqq6d1v/user-WsxrbKyP2c8rfhCKWDyMfe8N/img-ogKfqxxOS5KWVSj4gYySR6FY.png?st=2023-01-31T07%3A38%3A25Z&se=2023-01-31T09%3A38%3A25Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-01-30T22\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
}
],
"source": [
"from langchain.agents import load_tools\n",
"from langchain.agents import initialize_agent\n",
"\n",
"tools = load_tools(['dalle-image-generator'])\n",
"agent = initialize_agent(tools, llm, agent=\"zero-shot-react-description\", verbose=True)\n",
"output = agent.run(\"Create an image of a halloween night at a haunted museum\")"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "langchain",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"vscode": {
"interpreter": {
"hash": "3570c8892273ffbeee7ead61dc7c022b73551d9f55fb2584ac0e8e8920b18a89"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@@ -9,7 +9,7 @@
"\n",
"This notebook goes over how to use the google search component.\n",
"\n",
"First, you need to set up the proper API keys and environment variables. To set it up, create the GOOGLE_API_KEY in the Google Cloud credential console (https://console.cloud.google.com/apis/credentials) and a GOOGLE_CSE_ID using the Programmable Search Enginge (https://programmablesearchengine.google.com/controlpanel/create). Next, it is good to follow the instructions found [here](https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search).\n",
"First, you need to set up the proper API keys and environment variables. To set it up, create the GOOGLE_API_KEY in the Google Cloud credential console (https://console.cloud.google.com/apis/credentials) and a GOOGLE_CSE_ID using the Programmable Search Engine (https://programmablesearchengine.google.com/controlpanel/create). Next, it is good to follow the instructions found [here](https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search).\n",
"\n",
"Then we will need to set some environment variables."
]

File diff suppressed because one or more lines are too long

View File

@@ -1,7 +1,7 @@
# SQL
# SQL Database Chain
This example demonstrates the use of the `SQLDatabaseChain` for answering questions over a SQL database.
import Example from "@snippets/modules/chains/popular/sqlite.mdx"
<Example/>
<Example/>

View File

@@ -80,7 +80,7 @@
"source": [
"from langchain.document_loaders import TextLoader\n",
"\n",
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
"loader = TextLoader(\"../../../extras/modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
@@ -90,7 +90,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 3,
"id": "5eabdb75",
"metadata": {
"tags": []
@@ -517,6 +517,67 @@
"for doc in results:\n",
" print(f\"Content: {doc.page_content}, Metadata: {doc.metadata}\")"
]
},
{
"cell_type": "markdown",
"id": "1becca53",
"metadata": {},
"source": [
"## Delete\n",
"\n",
"You can also delete ids. Note that the ids to delete should be the ids in the docstore."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1408b870",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.delete([db.index_to_docstore_id[0]])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "d13daf33",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Is now missing\n",
"0 in db.index_to_docstore_id"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30ace43e",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@@ -21,7 +21,10 @@
"\n",
"1. Leverage the `Rockset` console to create a [collection](https://rockset.com/docs/collections/) with the Write API as your source. In this walkthrough, we create a collection named `langchain_demo`. \n",
" \n",
" Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:"
" Configure the following [ingest transformation](https://rockset.com/docs/ingest-transformation/) to mark your embeddings field and take advantage of performance and storage optimizations:\n",
"\n",
"\n",
" (We used OpenAI `text-embedding-ada-002` for this examples, where #length_of_vector_embedding = 1536)"
]
},
{
@@ -75,23 +78,10 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"id": "29505c1e",
"metadata": {},
"outputs": [
{
"ename": "InitializationException",
"evalue": "The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mInitializationException\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[5], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m ROCKSET_API_KEY \u001b[39m=\u001b[39m os\u001b[39m.\u001b[39menviron\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mROCKSET_API_KEY\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m# Verify ROCKSET_API_KEY environment variable\u001b[39;00m\n\u001b[1;32m 5\u001b[0m ROCKSET_API_SERVER \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39mRegions\u001b[39m.\u001b[39musw2a1 \u001b[39m# Verify Rockset region\u001b[39;00m\n\u001b[0;32m----> 6\u001b[0m rockset_client \u001b[39m=\u001b[39m rockset\u001b[39m.\u001b[39;49mRocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)\n\u001b[1;32m 8\u001b[0m COLLECTION_NAME\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mlangchain_demo\u001b[39m\u001b[39m'\u001b[39m\n\u001b[1;32m 9\u001b[0m TEXT_KEY\u001b[39m=\u001b[39m\u001b[39m'\u001b[39m\u001b[39mdescription\u001b[39m\u001b[39m'\u001b[39m\n",
"File \u001b[0;32m~/Library/Python/3.9/lib/python/site-packages/rockset/rockset_client.py:242\u001b[0m, in \u001b[0;36mRocksetClient.__init__\u001b[0;34m(self, host, api_key, max_workers, config)\u001b[0m\n\u001b[1;32m 239\u001b[0m config\u001b[39m.\u001b[39mhost \u001b[39m=\u001b[39m host\n\u001b[1;32m 241\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m config\u001b[39m.\u001b[39mapi_key:\n\u001b[0;32m--> 242\u001b[0m \u001b[39mraise\u001b[39;00m InitializationException(\n\u001b[1;32m 243\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAn api key must be provided as a parameter to the RocksetClient or the Configuration object.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 244\u001b[0m )\n\u001b[1;32m 246\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client \u001b[39m=\u001b[39m ApiClient(config, max_workers\u001b[39m=\u001b[39mmax_workers)\n\u001b[1;32m 248\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mAliases \u001b[39m=\u001b[39m AliasesApiWrapper(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mapi_client)\n",
"\u001b[0;31mInitializationException\u001b[0m: The rockset client was initialized incorrectly: An api key must be provided as a parameter to the RocksetClient or the Configuration object."
]
}
],
"outputs": [],
"source": [
"import os\n",
"import rockset\n",
@@ -118,18 +108,7 @@
"execution_count": null,
"id": "9740d8c4",
"metadata": {},
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n",
"\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
"\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
]
}
],
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
@@ -155,20 +134,9 @@
"execution_count": null,
"id": "85b6a6c5",
"metadata": {},
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mRunning cells with '/opt/local/bin/python3.11' requires the ipykernel package.\n",
"\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
"\u001b[1;31mCommand: '/opt/local/bin/python3.11 -m pip install ipykernel -U --user --force-reinstall'"
]
}
],
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings() # Verify OPENAI_KEY environment variable\n",
"embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable\n",
"\n",
"docsearch = Rockset(\n",
" client=rockset_client,\n",
@@ -194,22 +162,10 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "0bbf3df0",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'docsearch' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[1], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m query \u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mWhat did the president say about Ketanji Brown Jackson?\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m----> 2\u001b[0m output \u001b[39m=\u001b[39m docsearch\u001b[39m.\u001b[39msimilarity_search_with_relevance_scores(query, \u001b[39m4\u001b[39m, Rockset\u001b[39m.\u001b[39mDistanceFunction\u001b[39m.\u001b[39mCOSINE_SIM)\n\u001b[1;32m 4\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39moutput length:\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mlen\u001b[39m(output))\n\u001b[1;32m 5\u001b[0m \u001b[39mfor\u001b[39;00m d, dist \u001b[39min\u001b[39;00m output:\n",
"\u001b[0;31mNameError\u001b[0m: name 'docsearch' is not defined"
]
}
],
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"output = docsearch.similarity_search_with_relevance_scores(\n",
@@ -313,7 +269,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
"version": "3.10.12"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,195 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "bb384510-d9b4-4fa1-84c2-f181eb28487d",
"metadata": {},
"source": [
"# USearch\n",
">[USearch](https://unum-cloud.github.io/usearch/) is a Smaller & Faster Single-File Vector Search Engine\n",
"\n",
"USearch's base functionality is identical to FAISS, and the interface should look familiar if you have ever investigated Approximate Nearest Neigbors search. FAISS is a widely recognized standard for high-performance vector search engines. USearch and FAISS both employ the same HNSW algorithm, but they differ significantly in their design principles. USearch is compact and broadly compatible without sacrificing performance, with a primary focus on user-defined metrics and fewer dependencies."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "497fcd89-e832-46a7-a74a-c71199666206",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install usearch"
]
},
{
"cell_type": "markdown",
"id": "38237514-b3fa-44a4-9cff-30cd6bf50073",
"metadata": {},
"source": [
"We want to use OpenAIEmbeddings so we have to get the OpenAI API Key. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "47f9b495-88f1-4286-8d5d-1416103931a7",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "aac9563e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import USearch\n",
"from langchain.document_loaders import TextLoader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3c3999a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"\n",
"loader = TextLoader(\"../../../extras/modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5eabdb75",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"db = USearch.from_documents(docs, embeddings)\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4b172de8",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "f13473b5",
"metadata": {},
"source": [
"## Similarity Search with score\n",
"The `similarity_search_with_score` method allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance. Therefore, a lower score is better."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "186ee1d8",
"metadata": {},
"outputs": [],
"source": [
"docs_and_scores = db.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "284e04b5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': '../../../extras/modules/state_of_the_union.txt'}),\n",
" 0.1845687)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs_and_scores[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "483f6013-fb32-4756-a9e2-3d529fb81f68",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -8,7 +8,7 @@
"source": [
"# Vectara\n",
"\n",
">[Vectara](https://vectara.com/) is a API platform for building GenAI applications. It provides an easy-to-use API for document indexing and query that is managed by Vectara and is optimized for performance and accuracy. \n",
">[Vectara](https://vectara.com/) is a API platform for building GenAI applications. It provides an easy-to-use API for document indexing and querying that is managed by Vectara and is optimized for performance and accuracy. \n",
"See the [Vectara API documentation ](https://docs.vectara.com/docs/) for more information on how to use the API.\n",
"\n",
"This notebook shows how to use functionality related to the `Vectara`'s integration with langchain.\n",

View File

@@ -0,0 +1,240 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Xata\n",
"\n",
"> [Xata](https://xata.io) is a serverless data platform, based on PostgreSQL. It provides a Python SDK for interacting with your database, and a UI for managing your data.\n",
"> Xata has a native vector type, which can be added to any table, and supports similarity search. LangChain inserts vectors directly to Xata, and queries it for the nearest neighbors of a given vector, so that you can use all the LangChain Embeddings integrations with Xata."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook guides you how to use Xata as a VectorStore."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"### Create a database to use as a vector store\n",
"\n",
"In the [Xata UI](https://app.xata.io) create a new database. You can name it whatever you want, in this notepad we'll use `langchain`.\n",
"Create a table, again you can name it anything, but we will use `vectors`. Add the following columns via the UI:\n",
"\n",
"* `content` of type \"Text\". This is used to store the `Document.pageContent` values.\n",
"* `embedding` of type \"Vector\". Use the dimension used by the model you plan to use. In this notebook we use OpenAI embeddings, which have 1536 dimensions.\n",
"* `search` of type \"Text\". This is used as a metadata column by this example.\n",
"* any other columns you want to use as metadata. They are populated from the `Document.metadata` object. For example, if in the `Document.metadata` object you have a `title` property, you can create a `title` column in the table and it will be populated.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first install our dependencies:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"!pip install xata==1.0.0a7 openai tiktoken langchain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load the OpenAI key to the environemnt. If you don't have one you can create an OpenAI account and create a key on this [page](https://platform.openai.com/account/api-keys)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we need to get the environment variables for Xata. You can create a new API key by visiting your [account settings](https://app.xata.io/settings). To find the database URL, go to the Settings page of the database that you have created. The database URL should look something like this: `https://demo-uni3q8.eu-west-1.xata.sh/db/langchain`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"api_key = getpass.getpass(\"Xata API key: \")\n",
"db_url = input(\"Xata database URL (copy it from your DB settings):\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.document_loaders import TextLoader\n",
"from langchain.vectorstores.xata import XataVectorStore\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the Xata vector store\n",
"Let's import our test dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create the actual vector store, backed by the Xata table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"vector_store = XataVectorStore.from_documents(docs, embeddings, api_key=api_key, db_url=db_url, table_name=\"vectors\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After running the above command, if you go to the Xata UI, you should see the documents loaded together with their embeddings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Similarity Search"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"found_docs = vector_store.similarity_search(query)\n",
"print(found_docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Similarity Search with score (vector distance)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"result = vector_store.similarity_search_with_score(query)\n",
"for doc, score in result:\n",
" print(f\"document={doc}, score={score}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -109,11 +109,11 @@
"source": [
"# Reorder the documents:\n",
"# Less relevant document will be at the middle of the list and more\n",
"# relevant elements at begining / end.\n",
"# relevant elements at beginning / end.\n",
"reordering = LongContextReorder()\n",
"reordered_docs = reordering.transform_documents(docs)\n",
"\n",
"# Confirm that the 4 relevant documents are at begining and end.\n",
"# Confirm that the 4 relevant documents are at beginning and end.\n",
"reordered_docs"
]
},

View File

@@ -7,7 +7,7 @@
"source": [
"# Extraction\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/extraction/extraction.ipynb)\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/extraction.ipynb)\n",
"\n",
"## Use case\n",
"\n",

View File

@@ -0,0 +1,858 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SQL\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/sql.ipynb)\n",
"\n",
"## Use case\n",
"\n",
"Enterprise data is often stored in SQL databases.\n",
"\n",
"LLMs make it possible to interact with SQL databases using natural langugae.\n",
"\n",
"LangChain offers SQL Chains and Agents to build and run SQL queries based on natural language prompts. \n",
"\n",
"These are compatible with any SQL dialect supported by SQLAlchemy (e.g., MySQL, PostgreSQL, Oracle SQL, Databricks, SQLite).\n",
"\n",
"They enable use cases such as:\n",
"\n",
"- Generating queries that will be run based on natural language questions\n",
"- Creating chatbots that can answer questions based on database data\n",
"- Building custom dashboards based on insights a user wants to analyze\n",
"\n",
"## Overview\n",
"\n",
"LangChain provides tools to interact with SQL Databases:\n",
"\n",
"1. `Build SQL queries` based on natural language user questions\n",
"2. `Query a SQL database` using chains for query creation and execution\n",
"3. `Interact with a SQL database` using agents for robust and flexible querying \n",
"\n",
"![sql_usecase.png](/img/sql_usecase.png)\n",
"\n",
"## Quickstart\n",
"\n",
"First, get required packages and set environment variables:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"! pip install langchain langchain-experimental openai\n",
"\n",
"# Set env var OPENAI_API_KEY or load from a .env file\n",
"# import dotenv\n",
"\n",
"# dotenv.load_env()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The below example will use a SQLite connection with Chinook database. \n",
" \n",
"Follow [installation steps](https://database.guide/2-sample-databases-sqlite/) to create `Chinook.db` in the same directory as this notebook:\n",
"\n",
"* Save [this file](https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql) to the directory as `Chinook_Sqlite.sql`\n",
"* Run `sqlite3 Chinook.db`\n",
"* Run `.read Chinook_Sqlite.sql`\n",
"* Test `SELECT * FROM Artist LIMIT 10;`\n",
"\n",
"Now, `Chinhook.db` is in our directory.\n",
"\n",
"Let's create a `SQLDatabaseChain` to create and execute SQL queries."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from langchain.utilities import SQLDatabase\n",
"from langchain.llms import OpenAI\n",
"from langchain_experimental.sql import SQLDatabaseChain\n",
"\n",
"db = SQLDatabase.from_uri(\"sqlite:///Chinook.db\")\n",
"llm = OpenAI(temperature=0, verbose=True)\n",
"db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n",
"How many employees are there?\n",
"SQLQuery:\u001b[32;1m\u001b[1;3mSELECT COUNT(*) FROM \"Employee\";\u001b[0m\n",
"SQLResult: \u001b[33;1m\u001b[1;3m[(8,)]\u001b[0m\n",
"Answer:\u001b[32;1m\u001b[1;3mThere are 8 employees.\u001b[0m\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'There are 8 employees.'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db_chain.run(\"How many employees are there?\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this both creates and executes the query. \n",
"\n",
"In the following sections, we will cover the 3 different use cases mentioned in the overview.\n",
"\n",
"### Go deeper\n",
"\n",
"You can load tabular data from other sources other than SQL Databases.\n",
"For example:\n",
"- [Loading a CSV file](/docs/integrations/document_loaders/csv)\n",
"- [Loading a Pandas DataFrame](/docs/integrations/document_loaders/pandas_dataframe)\n",
"Here you can [check full list of Document Loaders](/docs/integrations/document_loaders/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Case 1: Text-to-SQL query\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import create_sql_query_chain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create the chain that will build the SQL Query:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT COUNT(*) FROM Employee\n"
]
}
],
"source": [
"chain = create_sql_query_chain(ChatOpenAI(temperature=0), db)\n",
"response = chain.invoke({\"question\":\"How many employees are there\"})\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After building the SQL query based on a user question, we can execute the query:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'[(8,)]'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.run(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the SQL Query Builder chain **only created** the query, and we handled the **query execution separately**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Go deeper\n",
"\n",
"**Looking under the hood**\n",
"\n",
"We can look at the [LangSmith trace](https://smith.langchain.com/public/c8fa52ea-be46-4829-bde2-52894970b830/r) to unpack this:\n",
"\n",
"[Some papers](https://arxiv.org/pdf/2204.00498.pdf) have reported good performance when prompting with:\n",
" \n",
"* A `CREATE TABLE` description for each table, which include column names, their types, etc\n",
"* Followed by three example rows in a `SELECT` statement\n",
"\n",
"`create_sql_query_chain` adopts this the best practice (see more in this [blog](https://blog.langchain.dev/llms-and-sql/)). \n",
"![sql_usecase.png](/img/create_sql_query_chain.png)\n",
"\n",
"**Improvements**\n",
"\n",
"The query builder can be improved in several ways, such as (but not limited to):\n",
"\n",
"- Customizing database description to your specific use case\n",
"- Hardcoding a few examples of questions and their corresponding SQL query in the prompt\n",
"- Using a vector database to include dynamic examples that are relevant to the specific user question\n",
"\n",
"All these examples involve customizing the chain's prompt. \n",
"\n",
"For example, we can include a few examples in our prompt like so:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"\n",
"TEMPLATE = \"\"\"Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.\n",
"Use the following format:\n",
"\n",
"Question: \"Question here\"\n",
"SQLQuery: \"SQL Query to run\"\n",
"SQLResult: \"Result of the SQLQuery\"\n",
"Answer: \"Final answer here\"\n",
"\n",
"Only use the following tables:\n",
"\n",
"{table_info}.\n",
"\n",
"Some examples of SQL queries that corrsespond to questions are:\n",
"\n",
"{few_shot_examples}\n",
"\n",
"Question: {input}\"\"\"\n",
"\n",
"CUSTOM_PROMPT = PromptTemplate(\n",
" input_variables=[\"input\", \"few_shot_examples\", \"table_info\", \"dialect\"], template=TEMPLATE\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Case 2: Text-to-SQL query and execution\n",
"\n",
"We can use `SQLDatabaseChain` from `langchain_experimental` to create and run SQL queries."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain_experimental.sql import SQLDatabaseChain\n",
"\n",
"llm = OpenAI(temperature=0, verbose=True)\n",
"db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n",
"How many employees are there?\n",
"SQLQuery:\u001b[32;1m\u001b[1;3mSELECT COUNT(*) FROM \"Employee\";\u001b[0m\n",
"SQLResult: \u001b[33;1m\u001b[1;3m[(8,)]\u001b[0m\n",
"Answer:\u001b[32;1m\u001b[1;3mThere are 8 employees.\u001b[0m\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'There are 8 employees.'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db_chain.run(\"How many employees are there?\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, we get the same result as the previous case.\n",
"\n",
"Here, the chain **also handles the query execution** and provides a final answer based on the user question and the query result.\n",
"\n",
"**Be careful** while using this approach as it is susceptible to `SQL Injection`:\n",
"\n",
"* The chain is executing queries that are created by an LLM, and weren't validated\n",
"* e.g. records may be created, modified or deleted unintentionally_\n",
"\n",
"This is why we see the `SQLDatabaseChain` is inside `langchain_experimental`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Go deeper\n",
"\n",
"**Looking under the hood**\n",
"\n",
"We can use the [LangSmith trace](https://smith.langchain.com/public/7f202a0c-1e35-42d6-a84a-6c2a58f697ef/r) to see what is happening under the hood:\n",
"\n",
"* As discussed above, first we create the query:\n",
"\n",
"```\n",
"text: ' SELECT COUNT(*) FROM \"Employee\";'\n",
"```\n",
"\n",
"* Then, it executes the query and passes the results to an LLM for synthesis.\n",
"\n",
"![sql_usecase.png](/img/sqldbchain_trace.png)\n",
"\n",
"**Improvements**\n",
"\n",
"The performance of the `SQLDatabaseChain` can be enhanced in several ways:\n",
"\n",
"- [Adding sample rows](#adding-sample-rows)\n",
"- [Specifying custom table information](/docs/integrations/tools/sqlite#custom-table-info)\n",
"- [Using Query Checker](/docs/integrations/tools/sqlite#use-query-checker) self-correct invalid SQL using parameter `use_query_checker=True`\n",
"- [Customizing the LLM Prompt](/docs/integrations/tools/sqlite#customize-prompt) include specific instructions or relevant information, using parameter `prompt=CUSTOM_PROMPT`\n",
"- [Get intermediate steps](/docs/integrations/tools/sqlite#return-intermediate-steps) access the SQL statement as well as the final result using parameter `return_intermediate_steps=True`\n",
"- [Limit the number of rows](/docs/integrations/tools/sqlite#choosing-how-to-limit-the-number-of-rows-returned) a query will return using parameter `top_k=5`\n",
"\n",
"You might find [SQLDatabaseSequentialChain](/docs/integrations/tools/sqlite#sqldatabasesequentialchain)\n",
"useful for cases in which the number of tables in the database is large.\n",
"\n",
"This `Sequential Chain` handles the process of:\n",
"\n",
"1. Determining which tables to use based on the user question\n",
"2. Calling the normal SQL database chain using only relevant tables\n",
"\n",
"**Adding Sample Rows**\n",
"\n",
"Providing sample data can help the LLM construct correct queries when the data format is not obvious. \n",
"\n",
"For example, we can tell LLM that artists are saved with their full names by providing two rows from the Track table.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"db = SQLDatabase.from_uri(\n",
" \"sqlite:///Chinook.db\",\n",
" include_tables=['Track'], # we include only one table to save tokens in the prompt :)\n",
" sample_rows_in_table_info=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sample rows are added to the prompt after each corresponding table's column information.\n",
"\n",
"We can use `db.table_info` and check which sample rows are included:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"CREATE TABLE \"Track\" (\n",
"\t\"TrackId\" INTEGER NOT NULL, \n",
"\t\"Name\" NVARCHAR(200) NOT NULL, \n",
"\t\"AlbumId\" INTEGER, \n",
"\t\"MediaTypeId\" INTEGER NOT NULL, \n",
"\t\"GenreId\" INTEGER, \n",
"\t\"Composer\" NVARCHAR(220), \n",
"\t\"Milliseconds\" INTEGER NOT NULL, \n",
"\t\"Bytes\" INTEGER, \n",
"\t\"UnitPrice\" NUMERIC(10, 2) NOT NULL, \n",
"\tPRIMARY KEY (\"TrackId\"), \n",
"\tFOREIGN KEY(\"MediaTypeId\") REFERENCES \"MediaType\" (\"MediaTypeId\"), \n",
"\tFOREIGN KEY(\"GenreId\") REFERENCES \"Genre\" (\"GenreId\"), \n",
"\tFOREIGN KEY(\"AlbumId\") REFERENCES \"Album\" (\"AlbumId\")\n",
")\n",
"\n",
"/*\n",
"2 rows from Track table:\n",
"TrackId\tName\tAlbumId\tMediaTypeId\tGenreId\tComposer\tMilliseconds\tBytes\tUnitPrice\n",
"1\tFor Those About To Rock (We Salute You)\t1\t1\t1\tAngus Young, Malcolm Young, Brian Johnson\t343719\t11170334\t0.99\n",
"2\tBalls to the Wall\t2\t2\t1\tNone\t342562\t5510424\t0.99\n",
"*/\n"
]
}
],
"source": [
"print(db.table_info)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Case 3: SQL agents\n",
"\n",
"LangChain has an SQL Agent which provides a more flexible way of interacting with SQL Databases than the `SQLDatabaseChain`.\n",
"\n",
"The main advantages of using the SQL Agent are:\n",
"\n",
"- It can answer questions based on the databases' schema as well as on the databases' content (like describing a specific table)\n",
"- It can recover from errors by running a generated query, catching the traceback and regenerating it correctly\n",
"\n",
"To initialize the agent, we use `create_sql_agent` function. \n",
"\n",
"This agent contains the `SQLDatabaseToolkit` which contains tools to: \n",
"\n",
"* Create and execute queries\n",
"* Check query syntax\n",
"* Retrieve table descriptions\n",
"* ... and more"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from langchain.agents import create_sql_agent\n",
"from langchain.agents.agent_toolkits import SQLDatabaseToolkit\n",
"# from langchain.agents import AgentExecutor\n",
"from langchain.agents.agent_types import AgentType\n",
"\n",
"db = SQLDatabase.from_uri(\"sqlite:///Chinook.db\")\n",
"llm = OpenAI(temperature=0, verbose=True)\n",
"\n",
"agent_executor = create_sql_agent(\n",
" llm=OpenAI(temperature=0),\n",
" toolkit=SQLDatabaseToolkit(db=db, llm=OpenAI(temperature=0)),\n",
" verbose=True,\n",
" agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Agent task example #1 - Running queries\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3mAction: sql_db_list_tables\n",
"Action Input: \u001b[0m\n",
"Observation: \u001b[38;5;200m\u001b[1;3mAlbum, Artist, Customer, Employee, Genre, Invoice, InvoiceLine, MediaType, Playlist, PlaylistTrack, Track\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I should query the schema of the Invoice and Customer tables.\n",
"Action: sql_db_schema\n",
"Action Input: Invoice, Customer\u001b[0m\n",
"Observation: \u001b[33;1m\u001b[1;3m\n",
"CREATE TABLE \"Customer\" (\n",
"\t\"CustomerId\" INTEGER NOT NULL, \n",
"\t\"FirstName\" NVARCHAR(40) NOT NULL, \n",
"\t\"LastName\" NVARCHAR(20) NOT NULL, \n",
"\t\"Company\" NVARCHAR(80), \n",
"\t\"Address\" NVARCHAR(70), \n",
"\t\"City\" NVARCHAR(40), \n",
"\t\"State\" NVARCHAR(40), \n",
"\t\"Country\" NVARCHAR(40), \n",
"\t\"PostalCode\" NVARCHAR(10), \n",
"\t\"Phone\" NVARCHAR(24), \n",
"\t\"Fax\" NVARCHAR(24), \n",
"\t\"Email\" NVARCHAR(60) NOT NULL, \n",
"\t\"SupportRepId\" INTEGER, \n",
"\tPRIMARY KEY (\"CustomerId\"), \n",
"\tFOREIGN KEY(\"SupportRepId\") REFERENCES \"Employee\" (\"EmployeeId\")\n",
")\n",
"\n",
"/*\n",
"3 rows from Customer table:\n",
"CustomerId\tFirstName\tLastName\tCompany\tAddress\tCity\tState\tCountry\tPostalCode\tPhone\tFax\tEmail\tSupportRepId\n",
"1\tLuís\tGonçalves\tEmbraer - Empresa Brasileira de Aeronáutica S.A.\tAv. Brigadeiro Faria Lima, 2170\tSão José dos Campos\tSP\tBrazil\t12227-000\t+55 (12) 3923-5555\t+55 (12) 3923-5566\tluisg@embraer.com.br\t3\n",
"2\tLeonie\tKöhler\tNone\tTheodor-Heuss-Straße 34\tStuttgart\tNone\tGermany\t70174\t+49 0711 2842222\tNone\tleonekohler@surfeu.de\t5\n",
"3\tFrançois\tTremblay\tNone\t1498 rue Bélanger\tMontréal\tQC\tCanada\tH2G 1A7\t+1 (514) 721-4711\tNone\tftremblay@gmail.com\t3\n",
"*/\n",
"\n",
"\n",
"CREATE TABLE \"Invoice\" (\n",
"\t\"InvoiceId\" INTEGER NOT NULL, \n",
"\t\"CustomerId\" INTEGER NOT NULL, \n",
"\t\"InvoiceDate\" DATETIME NOT NULL, \n",
"\t\"BillingAddress\" NVARCHAR(70), \n",
"\t\"BillingCity\" NVARCHAR(40), \n",
"\t\"BillingState\" NVARCHAR(40), \n",
"\t\"BillingCountry\" NVARCHAR(40), \n",
"\t\"BillingPostalCode\" NVARCHAR(10), \n",
"\t\"Total\" NUMERIC(10, 2) NOT NULL, \n",
"\tPRIMARY KEY (\"InvoiceId\"), \n",
"\tFOREIGN KEY(\"CustomerId\") REFERENCES \"Customer\" (\"CustomerId\")\n",
")\n",
"\n",
"/*\n",
"3 rows from Invoice table:\n",
"InvoiceId\tCustomerId\tInvoiceDate\tBillingAddress\tBillingCity\tBillingState\tBillingCountry\tBillingPostalCode\tTotal\n",
"1\t2\t2009-01-01 00:00:00\tTheodor-Heuss-Straße 34\tStuttgart\tNone\tGermany\t70174\t1.98\n",
"2\t4\t2009-01-02 00:00:00\tUllevålsveien 14\tOslo\tNone\tNorway\t0171\t3.96\n",
"3\t8\t2009-01-03 00:00:00\tGrétrystraat 63\tBrussels\tNone\tBelgium\t1000\t5.94\n",
"*/\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I should query the total sales per country.\n",
"Action: sql_db_query\n",
"Action Input: SELECT Country, SUM(Total) AS TotalSales FROM Invoice INNER JOIN Customer ON Invoice.CustomerId = Customer.CustomerId GROUP BY Country ORDER BY TotalSales DESC LIMIT 10\u001b[0m\n",
"Observation: \u001b[36;1m\u001b[1;3m[('USA', 523.0600000000003), ('Canada', 303.9599999999999), ('France', 195.09999999999994), ('Brazil', 190.09999999999997), ('Germany', 156.48), ('United Kingdom', 112.85999999999999), ('Czech Republic', 90.24000000000001), ('Portugal', 77.23999999999998), ('India', 75.25999999999999), ('Chile', 46.62)]\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I now know the final answer\n",
"Final Answer: The country with the highest total sales is the USA, with a total of $523.06.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'The country with the highest total sales is the USA, with a total of $523.06.'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"agent_executor.run(\n",
" \"List the total sales per country. Which country's customers spent the most?\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the [LangSmith trace](https://smith.langchain.com/public/a86dbe17-5782-4020-bce6-2de85343168a/r), we can see:\n",
"\n",
"* The agent is using a ReAct style prompt\n",
"* First, it will look at the tables: `Action: sql_db_list_tables` using tool `sql_db_list_tables`\n",
"* Given the tables as an observation, it `thinks` and then determinates the next `action`:\n",
"\n",
"```\n",
"Observation: Album, Artist, Customer, Employee, Genre, Invoice, InvoiceLine, MediaType, Playlist, PlaylistTrack, Track\n",
"Thought: I should query the schema of the Invoice and Customer tables.\n",
"Action: sql_db_schema\n",
"Action Input: Invoice, Customer\n",
"```\n",
"\n",
"* It then formulates the query using the schema from tool `sql_db_schema`\n",
"\n",
"```\n",
"Thought: I should query the total sales per country.\n",
"Action: sql_db_query\n",
"Action Input: SELECT Country, SUM(Total) AS TotalSales FROM Invoice INNER JOIN Customer ON Invoice.CustomerId = Customer.CustomerId GROUP BY Country ORDER BY TotalSales DESC LIMIT 10\n",
"```\n",
"\n",
"* It finally executes the generated query using tool `sql_db_query`\n",
"\n",
"![sql_usecase.png](/img/SQLDatabaseToolkit.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Agent task example #2 - Describing a Table"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
"\u001b[32;1m\u001b[1;3mAction: sql_db_list_tables\n",
"Action Input: \u001b[0m\n",
"Observation: \u001b[38;5;200m\u001b[1;3mAlbum, Artist, Customer, Employee, Genre, Invoice, InvoiceLine, MediaType, Playlist, PlaylistTrack, Track\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m The PlaylistTrack table is the most relevant to the question.\n",
"Action: sql_db_schema\n",
"Action Input: PlaylistTrack\u001b[0m\n",
"Observation: \u001b[33;1m\u001b[1;3m\n",
"CREATE TABLE \"PlaylistTrack\" (\n",
"\t\"PlaylistId\" INTEGER NOT NULL, \n",
"\t\"TrackId\" INTEGER NOT NULL, \n",
"\tPRIMARY KEY (\"PlaylistId\", \"TrackId\"), \n",
"\tFOREIGN KEY(\"TrackId\") REFERENCES \"Track\" (\"TrackId\"), \n",
"\tFOREIGN KEY(\"PlaylistId\") REFERENCES \"Playlist\" (\"PlaylistId\")\n",
")\n",
"\n",
"/*\n",
"3 rows from PlaylistTrack table:\n",
"PlaylistId\tTrackId\n",
"1\t3402\n",
"1\t3389\n",
"1\t3390\n",
"*/\u001b[0m\n",
"Thought:\u001b[32;1m\u001b[1;3m I now know the final answer\n",
"Final Answer: The PlaylistTrack table contains two columns, PlaylistId and TrackId, which are both integers and form a primary key. It also has two foreign keys, one to the Track table and one to the Playlist table.\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'The PlaylistTrack table contains two columns, PlaylistId and TrackId, which are both integers and form a primary key. It also has two foreign keys, one to the Track table and one to the Playlist table.'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"agent_executor.run(\"Describe the playlisttrack table\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Go deeper\n",
"\n",
"To learn more about the SQL Agent and how it works we refer to the [SQL Agent Toolkit](/docs/integrations/toolkits/sql_database) documentation.\n",
"\n",
"You can also check Agents for other document types:\n",
"- [Pandas Agent](/docs/integrations/toolkits/pandas.html)\n",
"- [CSV Agent](/docs/integrations/toolkits/csv.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Elastic Search\n",
"\n",
"Going beyond the above use-case, there are integrations with other databases.\n",
"\n",
"For example, we can interact with Elasticsearch analytics database. \n",
"\n",
"This chain builds search queries via the Elasticsearch DSL API (filters and aggregations).\n",
"\n",
"The Elasticsearch client must have permissions for index listing, mapping description and search queries.\n",
"\n",
"See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html) for instructions on how to run Elasticsearch locally.\n",
"\n",
"Make sure to install the Elasticsearch Python client before:\n",
"\n",
"```sh\n",
"pip install elasticsearch\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from elasticsearch import Elasticsearch\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains.elasticsearch_database import ElasticsearchDatabaseChain"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize Elasticsearch python client.\n",
"# See https://elasticsearch-py.readthedocs.io/en/v8.8.2/api.html#elasticsearch.Elasticsearch\n",
"ELASTIC_SEARCH_SERVER = \"https://elastic:pass@localhost:9200\"\n",
"db = Elasticsearch(ELASTIC_SEARCH_SERVER)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Uncomment the next cell to initially populate your db."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# customers = [\n",
"# {\"firstname\": \"Jennifer\", \"lastname\": \"Walters\"},\n",
"# {\"firstname\": \"Monica\",\"lastname\":\"Rambeau\"},\n",
"# {\"firstname\": \"Carol\",\"lastname\":\"Danvers\"},\n",
"# {\"firstname\": \"Wanda\",\"lastname\":\"Maximoff\"},\n",
"# {\"firstname\": \"Jennifer\",\"lastname\":\"Takeda\"},\n",
"# ]\n",
"# for i, customer in enumerate(customers):\n",
"# db.create(index=\"customers\", document=customer, id=i)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm = ChatOpenAI(model_name=\"gpt-4\", temperature=0)\n",
"chain = ElasticsearchDatabaseChain.from_llm(llm=llm, database=db, verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"question = \"What are the first names of all the customers?\"\n",
"chain.run(question)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can customize the prompt."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains.elasticsearch_database.prompts import DEFAULT_DSL_TEMPLATE\n",
"from langchain.prompts.prompt import PromptTemplate\n",
"\n",
"PROMPT_TEMPLATE = \"\"\"Given an input question, create a syntactically correct Elasticsearch query to run. Unless the user specifies in their question a specific number of examples they wish to obtain, always limit your query to at most {top_k} results. You can order the results by a relevant column to return the most interesting examples in the database.\n",
"\n",
"Unless told to do not query for all the columns from a specific index, only ask for a the few relevant columns given the question.\n",
"\n",
"Pay attention to use only the column names that you can see in the mapping description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which index. Return the query as valid json.\n",
"\n",
"Use the following format:\n",
"\n",
"Question: Question here\n",
"ESQuery: Elasticsearch Query formatted as json\n",
"\"\"\"\n",
"\n",
"PROMPT = PromptTemplate.from_template(\n",
" PROMPT_TEMPLATE,\n",
")\n",
"chain = ElasticsearchDatabaseChain.from_llm(llm=llm, database=db, query_prompt=PROMPT)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -1,206 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "dd7ec7af",
"metadata": {},
"source": [
"# Elasticsearch database\n",
"\n",
"Interact with Elasticsearch analytics database via Langchain. This chain builds search queries via the Elasticsearch DSL API (filters and aggregations).\n",
"\n",
"The Elasticsearch client must have permissions for index listing, mapping description and search queries.\n",
"\n",
"See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html) for instructions on how to run Elasticsearch locally.\n",
"\n",
"Make sure to install the Elasticsearch Python client before:\n",
"\n",
"```sh\n",
"pip install elasticsearch\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "dd8eae75",
"metadata": {},
"outputs": [],
"source": [
"from elasticsearch import Elasticsearch\n",
"\n",
"from langchain.chains.elasticsearch_database import ElasticsearchDatabaseChain\n",
"from langchain.chat_models import ChatOpenAI"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5cde03bc",
"metadata": {},
"outputs": [],
"source": [
"# Initialize Elasticsearch python client.\n",
"# See https://elasticsearch-py.readthedocs.io/en/v8.8.2/api.html#elasticsearch.Elasticsearch\n",
"ELASTIC_SEARCH_SERVER = \"https://elastic:pass@localhost:9200\"\n",
"db = Elasticsearch(ELASTIC_SEARCH_SERVER)"
]
},
{
"cell_type": "markdown",
"id": "74a41374",
"metadata": {},
"source": [
"Uncomment the next cell to initially populate your db."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "430ada0f",
"metadata": {},
"outputs": [],
"source": [
"# customers = [\n",
"# {\"firstname\": \"Jennifer\", \"lastname\": \"Walters\"},\n",
"# {\"firstname\": \"Monica\",\"lastname\":\"Rambeau\"},\n",
"# {\"firstname\": \"Carol\",\"lastname\":\"Danvers\"},\n",
"# {\"firstname\": \"Wanda\",\"lastname\":\"Maximoff\"},\n",
"# {\"firstname\": \"Jennifer\",\"lastname\":\"Takeda\"},\n",
"# ]\n",
"# for i, customer in enumerate(customers):\n",
"# db.create(index=\"customers\", document=customer, id=i)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "f36ae0d8",
"metadata": {},
"outputs": [],
"source": [
"llm = ChatOpenAI(model_name=\"gpt-4\", temperature=0)\n",
"chain = ElasticsearchDatabaseChain.from_llm(llm=llm, database=db, verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "b5d22d9d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new ElasticsearchDatabaseChain chain...\u001b[0m\n",
"What are the first names of all the customers?\n",
"ESQuery:\u001b[32;1m\u001b[1;3m{'size': 10, 'query': {'match_all': {}}, '_source': ['firstname']}\u001b[0m\n",
"ESResult: \u001b[33;1m\u001b[1;3m{'took': 5, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 6, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'customers', '_id': '0', '_score': 1.0, '_source': {'firstname': 'Jennifer'}}, {'_index': 'customers', '_id': '1', '_score': 1.0, '_source': {'firstname': 'Monica'}}, {'_index': 'customers', '_id': '2', '_score': 1.0, '_source': {'firstname': 'Carol'}}, {'_index': 'customers', '_id': '3', '_score': 1.0, '_source': {'firstname': 'Wanda'}}, {'_index': 'customers', '_id': '4', '_score': 1.0, '_source': {'firstname': 'Jennifer'}}, {'_index': 'customers', '_id': 'firstname', '_score': 1.0, '_source': {'firstname': 'Jennifer'}}]}}\u001b[0m\n",
"Answer:\u001b[32;1m\u001b[1;3mThe first names of all the customers are Jennifer, Monica, Carol, Wanda, and Jennifer.\u001b[0m\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
},
{
"data": {
"text/plain": [
"'The first names of all the customers are Jennifer, Monica, Carol, Wanda, and Jennifer.'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"question = \"What are the first names of all the customers?\"\n",
"chain.run(question)"
]
},
{
"cell_type": "markdown",
"id": "9b4bfada",
"metadata": {},
"source": [
"## Custom prompt\n",
"\n",
"For best results you'll likely need to customize the prompt."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0a494f5b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains.elasticsearch_database.prompts import DEFAULT_DSL_TEMPLATE\n",
"from langchain.prompts.prompt import PromptTemplate\n",
"\n",
"PROMPT_TEMPLATE = \"\"\"Given an input question, create a syntactically correct Elasticsearch query to run. Unless the user specifies in their question a specific number of examples they wish to obtain, always limit your query to at most {top_k} results. You can order the results by a relevant column to return the most interesting examples in the database.\n",
"\n",
"Unless told to do not query for all the columns from a specific index, only ask for a the few relevant columns given the question.\n",
"\n",
"Pay attention to use only the column names that you can see in the mapping description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which index. Return the query as valid json.\n",
"\n",
"Use the following format:\n",
"\n",
"Question: Question here\n",
"ESQuery: Elasticsearch Query formatted as json\n",
"\"\"\"\n",
"\n",
"PROMPT = PromptTemplate.from_template(\n",
" PROMPT_TEMPLATE,\n",
")\n",
"chain = ElasticsearchDatabaseChain.from_llm(llm=llm, database=db, query_prompt=PROMPT)"
]
},
{
"cell_type": "markdown",
"id": "372b8f93",
"metadata": {},
"source": [
"## Adding example rows from each index\n",
"\n",
"Sometimes, the format of the data is not obvious and it is optimal to include a sample of rows from the indices in the prompt to allow the LLM to understand the data before providing a final query. Here we will use this feature to let the LLM know that artists are saved with their full names by providing ten rows from the index."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eef818de",
"metadata": {},
"outputs": [],
"source": [
"chain = ElasticsearchDatabaseChain.from_llm(\n",
" llm=ChatOpenAI(temperature=0),\n",
" database=db,\n",
" sample_documents_in_index_info=2, # 2 rows from each index will be included in the prompt as sample data\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "venv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,35 +0,0 @@
---
sidebar_position: 1
---
# Analyzing structured data
Lots of data and information is stored in tabular data, whether it be csvs, excel sheets, or SQL tables.
This page covers all resources available in LangChain for working with data in this format.
## Document loading
If you have text data stored in a tabular format, you may want to load the data into a Document and then index it as you would
other text/unstructured data. For this, you should use a document loader like the [CSVLoader](/docs/modules/data_connection/document_loaders/how_to/csv.html)
and then you should [create an index](/docs/modules/data_connection) over that data, and [query it that way](/docs/use_cases/question_answering/how_to/vector_db_qa.html).
## Querying
If you have more numeric tabular data, or have a large amount of data and don't want to index it, you should get started
by looking at various chains and agents we have for dealing with this data.
### Chains
If you are just getting started, and you have relatively small/simple tabular data, you should get started with chains.
Chains are a sequence of predetermined steps, so they are good to get started with as they give you more control and let you
understand what is happening better.
- [SQL Database Chain](/docs/use_cases/tabular/sqlite.html)
### Agents
Agents are more complex, and involve multiple queries to the LLM to understand what to do.
The downside of agents are that you have less control. The upside is that they are more powerful,
which allows you to use them on larger databases and more complex schemas.
- [SQL Agent](/docs/integrations/toolkits/sql_database.html)
- [Pandas Agent](/docs/integrations/toolkits/pandas.html)
- [CSV Agent](/docs/integrations/toolkits/csv.html)

View File

@@ -1,125 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c04293ac",
"metadata": {},
"source": [
"# SQL Query\n",
"\n",
"This notebook walks through how to load and run a chain that constructs SQL queries that can be run against your database to answer a question. Note that this ONLY constructs the query and does not run it."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e9063a93",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import create_sql_query_chain\n",
"\n",
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.utilities import SQLDatabase"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a1ff5cee",
"metadata": {},
"outputs": [],
"source": [
"db = SQLDatabase.from_uri(\"sqlite:///../../../../notebooks/Chinook.db\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "cb04579f",
"metadata": {},
"outputs": [],
"source": [
"chain = create_sql_query_chain(ChatOpenAI(temperature=0), db)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "744e6210",
"metadata": {},
"outputs": [],
"source": [
"response = chain.invoke({\"question\":\"How many employees are there\"})"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "28f984f1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT COUNT(*) FROM Employee\n"
]
}
],
"source": [
"print(response)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "08de511c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'[(8,)]'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db.run(response)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3a006a7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -44,6 +44,7 @@ This allows for a few different ways to customize, including passing in a custom
```python
from langchain.schema import SystemMessage
from langchain.agents import OpenAIFunctionsAgent
system_message = SystemMessage(content="You are very powerful assistant, but bad at calculating lengths of words.")
prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)
```
@@ -51,7 +52,6 @@ prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)
Putting those pieces together, we can now create the agent.
```python
from langchain.agents import OpenAIFunctionsAgent
agent = OpenAIFunctionsAgent(llm=llm, tools=tools, prompt=prompt)
```

View File

@@ -8,7 +8,6 @@ from langchain.chains import ConversationalRetrievalChain
Load in documents. You can replace this with a loader for whatever type of data you want
```python
from langchain.document_loaders import TextLoader
loader = TextLoader("../../state_of_the_union.txt")
@@ -17,7 +16,6 @@ documents = loader.load()
If you had multiple loaders that you wanted to combine, you do something like:
```python
# loaders = [....]
# docs = []
@@ -27,7 +25,6 @@ If you had multiple loaders that you wanted to combine, you do something like:
We now split the documents, create embeddings for them, and put them in a vectorstore. This allows us to do semantic search over them.
```python
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(documents)
@@ -46,7 +43,6 @@ vectorstore = Chroma.from_documents(documents, embeddings)
We can now create a memory object, which is necessary to track the inputs/outputs and hold a conversation.
```python
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
@@ -54,18 +50,15 @@ memory = ConversationBufferMemory(memory_key="chat_history", return_messages=Tru
We now initialize the `ConversationalRetrievalChain`
```python
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), vectorstore.as_retriever(), memory=memory)
```
```python
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query})
```
```python
result["answer"]
```
@@ -78,13 +71,11 @@ result["answer"]
</CodeOutputBlock>
```python
query = "Did he mention who she succeeded"
result = qa({"question": query})
```
```python
result['answer']
```
@@ -101,21 +92,18 @@ result['answer']
In the above example, we used a Memory object to track chat history. We can also just pass it in explicitly. In order to do this, we need to initialize a chain without any memory object.
```python
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), vectorstore.as_retriever())
```
Here's an example of asking a question with no chat history
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query, "chat_history": chat_history})
```
```python
result["answer"]
```
@@ -130,14 +118,12 @@ result["answer"]
Here's an example of asking a question with some chat history
```python
chat_history = [(query, result["answer"])]
query = "Did he mention who she succeeded"
result = qa({"question": query, "chat_history": chat_history})
```
```python
result['answer']
```
@@ -154,12 +140,10 @@ result['answer']
This chain has two steps. First, it condenses the current question and the chat history into a standalone question. This is necessary to create a standanlone vector to use for retrieval. After that, it does retrieval and then answers the question using retrieval augmented generation with a separate model. Part of the power of the declarative nature of LangChain is that you can easily use a separate language model for each call. This can be useful to use a cheaper and faster model for the simpler task of condensing the question, and then a more expensive model for answering the question. Here is an example of doing so.
```python
from langchain.chat_models import ChatOpenAI
```
```python
qa = ConversationalRetrievalChain.from_llm(
ChatOpenAI(temperature=0, model="gpt-4"),
@@ -168,36 +152,90 @@ qa = ConversationalRetrievalChain.from_llm(
)
```
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query, "chat_history": chat_history})
```
```python
chat_history = [(query, result["answer"])]
query = "Did he mention who she succeeded"
result = qa({"question": query, "chat_history": chat_history})
```
## Return Source Documents
You can also easily return source documents from the ConversationalRetrievalChain. This is useful for when you want to inspect what documents were returned.
## Using a custom prompt for condensing the question
By default, ConversationalRetrievalQA uses CONDENSE_QUESTION_PROMPT to condense a question. Here is the implementation of this in the docs
```python
from langchain.prompts.prompt import PromptTemplate
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)
```
But instead of this any custom template can be used to further augment information in the question or instruct the LLM to do something. Here is an example
```python
from langchain.prompts.prompt import PromptTemplate
```
```python
custom_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question. At the end of standalone question add this 'Answer the question in German language.' If you do not know the answer reply with 'I am sorry'.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
```
```python
CUSTOM_QUESTION_PROMPT = PromptTemplate.from_template(custom_template)
```
```python
model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.3)
embeddings = OpenAIEmbeddings()
vectordb = Chroma(embedding_function=embeddings, persist_directory=directory)
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(
model,
vectordb.as_retriever(),
condense_question_prompt=CUSTOM_QUESTION_PROMPT,
memory=memory
)
```
```python
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query})
```
```python
query = "Did he mention who she succeeded"
result = qa({"question": query})
```
## Return Source Documents
You can also easily return source documents from the ConversationalRetrievalChain. This is useful for when you want to inspect what documents were returned.
```python
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), vectorstore.as_retriever(), return_source_documents=True)
```
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query, "chat_history": chat_history})
```
```python
result['source_documents'][0]
```
@@ -211,14 +249,13 @@ result['source_documents'][0]
</CodeOutputBlock>
## ConversationalRetrievalChain with `search_distance`
If you are using a vector store that supports filtering by search distance, you can add a threshold value parameter.
If you are using a vector store that supports filtering by search distance, you can add a threshold value parameter.
```python
vectordbkwargs = {"search_distance": 0.9}
```
```python
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), vectorstore.as_retriever(), return_source_documents=True)
chat_history = []
@@ -227,8 +264,8 @@ result = qa({"question": query, "chat_history": chat_history, "vectordbkwargs":
```
## ConversationalRetrievalChain with `map_reduce`
We can also use different types of combine document chains with the ConversationalRetrievalChain chain.
We can also use different types of combine document chains with the ConversationalRetrievalChain chain.
```python
from langchain.chains import LLMChain
@@ -236,7 +273,6 @@ from langchain.chains.question_answering import load_qa_chain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT
```
```python
llm = OpenAI(temperature=0)
question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)
@@ -249,14 +285,12 @@ chain = ConversationalRetrievalChain(
)
```
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
result = chain({"question": query, "chat_history": chat_history})
```
```python
result['answer']
```
@@ -273,12 +307,10 @@ result['answer']
You can also use this chain with the question answering with sources chain.
```python
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
```
```python
llm = OpenAI(temperature=0)
question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)
@@ -291,14 +323,12 @@ chain = ConversationalRetrievalChain(
)
```
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
result = chain({"question": query, "chat_history": chat_history})
```
```python
result['answer']
```
@@ -315,7 +345,6 @@ result['answer']
Output from the chain will be streamed to `stdout` token by token in this example.
```python
from langchain.chains.llm import LLMChain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
@@ -334,7 +363,6 @@ qa = ConversationalRetrievalChain(
retriever=vectorstore.as_retriever(), combine_docs_chain=doc_chain, question_generator=question_generator)
```
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
@@ -349,7 +377,6 @@ result = qa({"question": query, "chat_history": chat_history})
</CodeOutputBlock>
```python
chat_history = [(query, result["answer"])]
query = "Did he mention who she succeeded"
@@ -365,8 +392,8 @@ result = qa({"question": query, "chat_history": chat_history})
</CodeOutputBlock>
## get_chat_history Function
You can also specify a `get_chat_history` function, which can be used to format the chat_history string.
You can also specify a `get_chat_history` function, which can be used to format the chat_history string.
```python
def get_chat_history(inputs) -> str:
@@ -377,14 +404,12 @@ def get_chat_history(inputs) -> str:
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), vectorstore.as_retriever(), get_chat_history=get_chat_history)
```
```python
chat_history = []
query = "What did the president say about Ketanji Brown Jackson"
result = qa({"question": query, "chat_history": chat_history})
```
```python
result['answer']
```

View File

@@ -117,3 +117,38 @@ qa.run(query)
```
</CodeOutputBlock>
## Vectorstore Retriever Options
You can adjust how documents are retrieved from your vectorstore depending on the specific task.
There are two main ways to retrieve documents relevant to a query- Similarity Search and Max Marginal Relevance Search (MMR Search). Similarity Search is the default, but you can use MMR by adding the `search_type` parameter:
```python
docsearch.as_retriever(search_type="mmr")
```
You can also modify the search by passing specific search arguments through the retriever to the search function, using the `search_kwargs` keyword argument.
- `k` defines how many documents are returned; defaults to 4.
- `score_threshold` allows you to set a minimum relevance for documents returned by the retriever, if you are using the "similarity_score_threshold" search type.
- `fetch_k` determines the amount of documents to pass to the MMR algorithm; defaults to 20.
- `lambda_mult` controls the diversity of results returned by the MMR algorithm, with 1 being minimum diversity and 0 being maximum. Defaults to 0.5.
- `filter` allows you to define a filter on what documents should be retrieved, based on the documents' metadata. This has no effect if the Vectorstore doesn't store any metadata.
Some examples for how these parameters can be used:
```python
# Retrieve more documents with higher diversity- useful if your dataset has many similar documents
docsearch.as_retriever(search_type="mmr", search_kwargs={'k': 6, 'lambda_mult': 0.25})
# Fetch more documents for the MMR algorithm to consider, but only return the top 5
docsearch.as_retriever(search_type="mmr", search_kwargs={'k': 5, 'fetch_k': 50})
# Only retrieve documents that have a relevance score above a certain threshold
docsearch.as_retriever(search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.8})
# Only get the single most similar document from the dataset
docsearch.as_retriever(search_kwargs={'k': 1})
# Use a filter to only retrieve documents from a specific paper
docsearch.as_retriever(search_kwargs={'filter': {'paper_title':'GPT-4 Technical Report'}})
```

View File

@@ -3,7 +3,7 @@ Additionally, we can return the source documents used to answer the question by
```python
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 30}), return_source_documents=True)
```

View File

@@ -159,6 +159,22 @@ index.vectorstore.as_retriever()
</CodeOutputBlock>
It can also be convenient to filter the vectorstore by the metadata associated with documents, particularly when your vectorstore has multiple sources. This can be done using the `query` method like so:
```python
index.query("Summarize the general content of this document.", retriever_kwargs={"search_kwargs": {"filter": {"source": "../state_of_the_union.txt"}}})
```
<CodeOutputBlock lang="python">
```
" The document is a speech given by President Trump to the nation on the occasion of his 245th birthday. The speech highlights the importance of American values and the challenges facing the country, including the ongoing conflict in Ukraine, the ongoing trade war with China, and the ongoing conflict in Syria. The speech also discusses the importance of investing in emerging technologies and American manufacturing, and calls on Congress to pass the Bipartisan Innovation Act and other important legislation."
```
</CodeOutputBlock>
## Walkthrough
Okay, so what's actually going on? How is this index getting created?

View File

@@ -7,6 +7,10 @@ from unittest import mock
import pydantic
import pytest
from langchain import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts.prompt import PromptTemplate
from langchain_experimental.cpal.base import (
CausalChain,
CPALChain,
@@ -35,10 +39,6 @@ from langchain_experimental.cpal.templates.univariate.narrative import (
from langchain_experimental.cpal.templates.univariate.query import (
template as query_template,
)
from langchain import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts.prompt import PromptTemplate
from tests.unit_tests.llms.fake_llm import FakeLLM

View File

@@ -1,12 +1,11 @@
"""Test SQL Database Chain."""
from sqlalchemy import Column, Integer, MetaData, String, Table, create_engine, insert
from langchain.chains.sql_database.base import (
from langchain.llms.openai import OpenAI
from langchain.utilities.sql_database import SQLDatabase
from libs.experimental.langchain_experimental.sql.base import (
SQLDatabaseChain,
SQLDatabaseSequentialChain,
)
from langchain.llms.openai import OpenAI
from langchain.utilities.sql_database import SQLDatabase
from sqlalchemy import Column, Integer, MetaData, String, Table, create_engine, insert
metadata_obj = MetaData()

View File

@@ -59,6 +59,9 @@ test_watch:
integration_tests:
poetry run pytest tests/integration_tests
scheduled_tests:
poetry run pytest -m scheduled tests/integration_tests
docker_tests:
docker build -t my-langchain-image:test .
docker run --rm my-langchain-image:test

View File

@@ -0,0 +1,22 @@
"""Helper functions for managing the LangChain API.
This module is only relevant for LangChain developers, not for users.
.. warning::
This module and its submodules are for internal use only. Do not use them
in your own code. We may change the API at any time with no warning.
"""
from .deprecation import (
LangChainDeprecationWarning,
deprecated,
suppress_langchain_deprecation_warning,
)
__all__ = [
"deprecated",
"LangChainDeprecationWarning",
"suppress_langchain_deprecation_warning",
]

View File

@@ -0,0 +1,306 @@
"""Helper functions for deprecating parts of the LangChain API.
This module was adapted from matplotlibs _api/deprecation.py module:
https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/_api/deprecation.py
.. warning::
This module is for internal use only. Do not use it in your own code.
We may change the API at any time with no warning.
"""
import contextlib
import functools
import inspect
import warnings
from typing import Any, Callable, Generator, Type, TypeVar
class LangChainDeprecationWarning(DeprecationWarning):
"""A class for issuing deprecation warnings for LangChain users."""
def _warn_deprecated(
since: str,
*,
message: str = "",
name: str = "",
alternative: str = "",
pending: bool = False,
obj_type: str = "",
addendum: str = "",
removal: str = "",
) -> None:
"""Display a standardized deprecation.
Arguments:
since : str
The release at which this API became deprecated.
message : str, optional
Override the default deprecation message. The %(since)s,
%(name)s, %(alternative)s, %(obj_type)s, %(addendum)s,
and %(removal)s format specifiers will be replaced by the
values of the respective arguments passed to this function.
name : str, optional
The name of the deprecated object.
alternative : str, optional
An alternative API that the user may use in place of the
deprecated API. The deprecation warning will tell the user
about this alternative if provided.
pending : bool, optional
If True, uses a PendingDeprecationWarning instead of a
DeprecationWarning. Cannot be used together with removal.
obj_type : str, optional
The object type being deprecated.
addendum : str, optional
Additional text appended directly to the final message.
removal : str, optional
The expected removal version. With the default (an empty
string), a removal version is automatically computed from
since. Set to other Falsy values to not schedule a removal
date. Cannot be used together with pending.
"""
if pending and removal:
raise ValueError("A pending deprecation cannot have a scheduled removal")
if not pending:
if not removal:
removal = f"in {removal}" if removal else "within ?? minor releases"
raise NotImplementedError(
f"Need to determine which default deprecation schedule to use. "
f"{removal}"
)
else:
removal = f"in {removal}"
if not message:
message = ""
if obj_type:
message += f"The {obj_type} `{name}`"
else:
message += f"`{name}`"
if pending:
message += " will be deprecated in a future version"
else:
message += f" was deprecated in LangChain {since}"
if removal:
message += f" and will be removed {removal}"
if alternative:
message += f". Use {alternative} instead."
if addendum:
message += f" {addendum}"
warning_cls = PendingDeprecationWarning if pending else LangChainDeprecationWarning
warning = warning_cls(message)
warnings.warn(warning, category=LangChainDeprecationWarning, stacklevel=2)
# PUBLIC API
T = TypeVar("T", Type, Callable)
def deprecated(
since: str,
*,
message: str = "",
name: str = "",
alternative: str = "",
pending: bool = False,
obj_type: str = "",
addendum: str = "",
removal: str = "",
) -> Callable[[T], T]:
"""Decorator to mark a function, a class, or a property as deprecated.
When deprecating a classmethod, a staticmethod, or a property, the
``@deprecated`` decorator should go *under* ``@classmethod`` and
``@staticmethod`` (i.e., `deprecated` should directly decorate the
underlying callable), but *over* ``@property``.
When deprecating a class ``C`` intended to be used as a base class in a
multiple inheritance hierarchy, ``C`` *must* define an ``__init__`` method
(if ``C`` instead inherited its ``__init__`` from its own base class, then
``@deprecated`` would mess up ``__init__`` inheritance when installing its
own (deprecation-emitting) ``C.__init__``).
Parameters are the same as for `warn_deprecated`, except that *obj_type*
defaults to 'class' if decorating a class, 'attribute' if decorating a
property, and 'function' otherwise.
Arguments:
since : str
The release at which this API became deprecated.
message : str, optional
Override the default deprecation message. The %(since)s,
%(name)s, %(alternative)s, %(obj_type)s, %(addendum)s,
and %(removal)s format specifiers will be replaced by the
values of the respective arguments passed to this function.
name : str, optional
The name of the deprecated object.
alternative : str, optional
An alternative API that the user may use in place of the
deprecated API. The deprecation warning will tell the user
about this alternative if provided.
pending : bool, optional
If True, uses a PendingDeprecationWarning instead of a
DeprecationWarning. Cannot be used together with removal.
obj_type : str, optional
The object type being deprecated.
addendum : str, optional
Additional text appended directly to the final message.
removal : str, optional
The expected removal version. With the default (an empty
string), a removal version is automatically computed from
since. Set to other Falsy values to not schedule a removal
date. Cannot be used together with pending.
Examples
--------
.. code-block:: python
@deprecated('1.4.0')
def the_function_to_deprecate():
pass
"""
def deprecate(
obj: T,
*,
_obj_type: str = obj_type,
_name: str = name,
_message: str = message,
_alternative: str = alternative,
_pending: bool = pending,
_addendum: str = addendum,
) -> T:
"""Implementation of the decorator returned by `deprecated`."""
if isinstance(obj, type):
if not _obj_type:
_obj_type = "class"
wrapped = obj.__init__ # type: ignore
_name = _name or obj.__name__
old_doc = obj.__doc__
def finalize(wrapper: Callable[..., Any], new_doc: str) -> T:
"""Finalize the deprecation of a class."""
try:
obj.__doc__ = new_doc
except AttributeError: # Can't set on some extension objects.
pass
obj.__init__ = functools.wraps(obj.__init__)( # type: ignore[misc]
wrapper
)
return obj
elif isinstance(obj, property):
if not _obj_type:
_obj_type = "attribute"
wrapped = None
_name = _name or obj.fget.__name__
old_doc = obj.__doc__
class _deprecated_property(type(obj)): # type: ignore
"""A deprecated property."""
def __get__(self, instance, owner=None): # type: ignore
if instance is not None or owner is not None:
emit_warning()
return super().__get__(instance, owner)
def __set__(self, instance, value): # type: ignore
if instance is not None:
emit_warning()
return super().__set__(instance, value)
def __delete__(self, instance): # type: ignore
if instance is not None:
emit_warning()
return super().__delete__(instance)
def __set_name__(self, owner, set_name): # type: ignore
nonlocal _name
if _name == "<lambda>":
_name = set_name
def finalize(_: Any, new_doc: str) -> Any: # type: ignore
"""Finalize the property."""
return _deprecated_property(
fget=obj.fget, fset=obj.fset, fdel=obj.fdel, doc=new_doc
)
else:
if not _obj_type:
_obj_type = "function"
wrapped = obj
_name = _name or obj.__name__ # type: ignore
old_doc = wrapped.__doc__
def finalize( # type: ignore
wrapper: Callable[..., Any], new_doc: str
) -> T:
"""Wrap the wrapped function using the wrapper and update the docstring.
Args:
wrapper: The wrapper function.
new_doc: The new docstring.
Returns:
The wrapped function.
"""
wrapper = functools.wraps(wrapped)(wrapper)
wrapper.__doc__ = new_doc
return wrapper
def emit_warning() -> None:
"""Emit the warning."""
_warn_deprecated(
since,
message=_message,
name=_name,
alternative=_alternative,
pending=_pending,
obj_type=_obj_type,
addendum=_addendum,
removal=removal,
)
def warning_emitting_wrapper(*args: Any, **kwargs: Any) -> Any:
"""Wrapper for the original wrapped callable that emits a warning.
Args:
*args: The positional arguments to the function.
**kwargs: The keyword arguments to the function.
Returns:
The return value of the function being wrapped.
"""
emit_warning()
return wrapped(*args, **kwargs)
old_doc = inspect.cleandoc(old_doc or "").strip("\n")
if not old_doc:
new_doc = "[*Deprecated*]"
else:
new_doc = f"[*Deprecated*] {old_doc}"
return finalize(warning_emitting_wrapper, new_doc)
return deprecate
@contextlib.contextmanager
def suppress_langchain_deprecation_warning() -> Generator[None, None, None]:
"""Context manager to suppress LangChainDeprecationWarning."""
with warnings.catch_warnings():
warnings.simplefilter("ignore", LangChainDeprecationWarning)
yield

View File

@@ -17,7 +17,7 @@ from langchain.agents.agent_toolkits.gmail.toolkit import GmailToolkit
from langchain.agents.agent_toolkits.jira.toolkit import JiraToolkit
from langchain.agents.agent_toolkits.json.base import create_json_agent
from langchain.agents.agent_toolkits.json.toolkit import JsonToolkit
from langchain.agents.agent_toolkits.multion.base import create_multion_agent
from langchain.agents.agent_toolkits.multion.toolkit import MultionToolkit
from langchain.agents.agent_toolkits.nla.toolkit import NLAToolkit
from langchain.agents.agent_toolkits.office365.toolkit import O365Toolkit
from langchain.agents.agent_toolkits.openapi.base import create_openapi_agent
@@ -52,6 +52,7 @@ __all__ = [
"GmailToolkit",
"JiraToolkit",
"JsonToolkit",
"MultionToolkit",
"NLAToolkit",
"O365Toolkit",
"OpenAPIToolkit",
@@ -65,7 +66,6 @@ __all__ = [
"ZapierToolkit",
"create_csv_agent",
"create_json_agent",
"create_multion_agent",
"create_openapi_agent",
"create_pandas_dataframe_agent",
"create_pbi_agent",

View File

@@ -1,58 +0,0 @@
"""MultiOn agent."""
from typing import Any, Dict, Optional
from langchain.agents.agent import AgentExecutor, BaseSingleActionAgent
from langchain.agents.agent_toolkits.python.prompt import PREFIX
from langchain.agents.mrkl.base import ZeroShotAgent
from langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgent
from langchain.agents.types import AgentType
from langchain.base_language import BaseLanguageModel
from langchain.callbacks.base import BaseCallbackManager
from langchain.chains.llm import LLMChain
from langchain.schema import SystemMessage
from langchain.tools.multion.tool import MultionClientTool
def create_multion_agent(
llm: BaseLanguageModel,
tool: MultionClientTool,
agent_type: AgentType = AgentType.ZERO_SHOT_REACT_DESCRIPTION,
callback_manager: Optional[BaseCallbackManager] = None,
verbose: bool = False,
prefix: str = PREFIX,
agent_executor_kwargs: Optional[Dict[str, Any]] = None,
**kwargs: Dict[str, Any],
) -> AgentExecutor:
"""Construct a multion agent from an LLM and tool."""
tools = [tool]
agent: BaseSingleActionAgent
if agent_type == AgentType.ZERO_SHOT_REACT_DESCRIPTION:
prompt = ZeroShotAgent.create_prompt(tools, prefix=prefix)
llm_chain = LLMChain(
llm=llm,
prompt=prompt,
callback_manager=callback_manager,
)
tool_names = [tool.name for tool in tools]
agent = ZeroShotAgent(llm_chain=llm_chain, allowed_tools=tool_names, **kwargs)
elif agent_type == AgentType.OPENAI_FUNCTIONS:
system_message = SystemMessage(content=prefix)
_prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)
agent = OpenAIFunctionsAgent(
llm=llm,
prompt=_prompt,
tools=[tool],
callback_manager=callback_manager,
**kwargs,
)
else:
raise ValueError(f"Agent type {agent_type} not supported at the moment.")
return AgentExecutor.from_agent_and_tools(
agent=agent,
tools=tools,
callback_manager=callback_manager,
verbose=verbose,
**(agent_executor_kwargs or {}),
)

View File

@@ -0,0 +1,22 @@
"""MultiOn agent."""
from __future__ import annotations
from typing import List
from langchain.agents.agent_toolkits.base import BaseToolkit
from langchain.tools import BaseTool
from langchain.tools.multion.create_session import MultionCreateSession
from langchain.tools.multion.update_session import MultionUpdateSession
class MultionToolkit(BaseToolkit):
"""Toolkit for interacting with the Browser Agent"""
class Config:
"""Pydantic config."""
arbitrary_types_allowed = True
def get_tools(self) -> List[BaseTool]:
"""Get the tools in the toolkit."""
return [MultionCreateSession(), MultionUpdateSession()]

View File

@@ -97,6 +97,8 @@ def reduce_openapi_spec(spec: dict, dereference: bool = True) -> ReducedOpenAPIS
]
if "200" in docs["responses"]:
out["responses"] = docs["responses"]["200"]
if docs.get("requestBody"):
out["requestBody"] = docs.get("requestBody")
return out
endpoints = [

View File

@@ -11,6 +11,7 @@ from langchain.callbacks.manager import Callbacks
from langchain.chains.api import news_docs, open_meteo_docs, podcast_docs, tmdb_docs
from langchain.chains.api.base import APIChain
from langchain.chains.llm_math.base import LLMMathChain
from langchain.utilities.dalle_image_generator import DallEAPIWrapper
from langchain.utilities.requests import TextRequestsWrapper
from langchain.tools.arxiv.tool import ArxivQueryRun
from langchain.tools.golden_query.tool import GoldenQueryRun
@@ -196,7 +197,7 @@ def _get_golden_query(**kwargs: Any) -> BaseTool:
return GoldenQueryRun(api_wrapper=GoldenQueryAPIWrapper(**kwargs))
def _get_pupmed(**kwargs: Any) -> BaseTool:
def _get_pubmed(**kwargs: Any) -> BaseTool:
return PubmedQueryRun(api_wrapper=PubMedAPIWrapper(**kwargs))
@@ -221,6 +222,14 @@ def _get_serpapi(**kwargs: Any) -> BaseTool:
)
def _get_dalle_image_generator(**kwargs: Any) -> Tool:
return Tool(
"Dall-E Image Generator",
DallEAPIWrapper(**kwargs).run,
"A wrapper around OpenAI DALL-E API. Useful for when you need to generate images from a text description. Input should be an image description.",
)
def _get_twilio(**kwargs: Any) -> BaseTool:
return Tool(
name="Text Message",
@@ -305,6 +314,7 @@ _EXTRA_OPTIONAL_TOOLS: Dict[str, Tuple[Callable[[KwArg(Any)], BaseTool], List[st
["serper_api_key", "aiosession"],
),
"serpapi": (_get_serpapi, ["serpapi_api_key", "aiosession"]),
"dalle-image-generator": (_get_dalle_image_generator, ["openai_api_key"]),
"twilio": (_get_twilio, ["account_sid", "auth_token", "from_number"]),
"searx-search": (_get_searx_search, ["searx_host", "engines", "aiosession"]),
"wikipedia": (_get_wikipedia, ["top_k_results", "lang"]),
@@ -313,10 +323,7 @@ _EXTRA_OPTIONAL_TOOLS: Dict[str, Tuple[Callable[[KwArg(Any)], BaseTool], List[st
["top_k_results", "load_max_docs", "load_all_available_meta"],
),
"golden-query": (_get_golden_query, ["golden_api_key"]),
"pupmed": (
_get_pupmed,
["top_k_results", "load_max_docs", "load_all_available_meta"],
),
"pubmed": (_get_pubmed, ["top_k_results"]),
"human": (_get_human_tool, ["prompt_func", "input_func"]),
"awslambda": (
_get_lambda_api,

View File

@@ -227,7 +227,6 @@ def trace_as_chain_group(
cm = CallbackManager.configure(
inheritable_callbacks=cb,
inheritable_tags=tags,
example_id=example_id,
)
run_manager = cm.on_chain_start({"name": group_name}, {})
@@ -1274,7 +1273,6 @@ class CallbackManager(BaseCallbackManager):
local_tags: Optional[List[str]] = None,
inheritable_metadata: Optional[Dict[str, Any]] = None,
local_metadata: Optional[Dict[str, Any]] = None,
example_id: Optional[Union[str, UUID]] = None,
) -> CallbackManager:
"""Configure the callback manager.
@@ -1292,7 +1290,6 @@ class CallbackManager(BaseCallbackManager):
metadata. Defaults to None.
local_metadata (Optional[Dict[str, Any]], optional): The local metadata.
Defaults to None.
example_id (Optional[UUID], optional): The example ID. Defaults to None.
Returns:
CallbackManager: The configured callback manager.
@@ -1306,7 +1303,6 @@ class CallbackManager(BaseCallbackManager):
local_tags,
inheritable_metadata,
local_metadata,
example_id,
)
@@ -1569,7 +1565,6 @@ class AsyncCallbackManager(BaseCallbackManager):
local_tags: Optional[List[str]] = None,
inheritable_metadata: Optional[Dict[str, Any]] = None,
local_metadata: Optional[Dict[str, Any]] = None,
example_id: Optional[Union[str, UUID]] = None,
) -> AsyncCallbackManager:
"""Configure the async callback manager.
@@ -1587,7 +1582,6 @@ class AsyncCallbackManager(BaseCallbackManager):
metadata. Defaults to None.
local_metadata (Optional[Dict[str, Any]], optional): The local metadata.
Defaults to None.
example_id (Optional[UUID], optional): The ID of the example. Defaults to None.
Returns:
AsyncCallbackManager: The configured async callback manager.
@@ -1601,7 +1595,6 @@ class AsyncCallbackManager(BaseCallbackManager):
local_tags,
inheritable_metadata,
local_metadata,
example_id=example_id,
)
@@ -1634,7 +1627,6 @@ def _configure(
local_tags: Optional[List[str]] = None,
inheritable_metadata: Optional[Dict[str, Any]] = None,
local_metadata: Optional[Dict[str, Any]] = None,
example_id: Optional[Union[str, UUID]] = None,
) -> T:
"""Configure the callback manager.
@@ -1652,12 +1644,10 @@ def _configure(
metadata. Defaults to None.
local_metadata (Optional[Dict[str, Any]], optional): The local metadata.
Defaults to None.
example_id (Optional[UUID], optional): The example ID. Defaults to None.
Returns:
T: The configured callback manager.
"""
example_id = UUID(example_id) if isinstance(example_id, str) else example_id
callback_manager = callback_manager_cls(handlers=[])
if inheritable_callbacks or local_callbacks:
if isinstance(inheritable_callbacks, list) or inheritable_callbacks is None:
@@ -1754,16 +1744,10 @@ def _configure(
for handler in callback_manager.handlers
):
if tracer_v2:
if example_id:
# This can get ugly since we don't manage the un-setting
# of the example_id
tracer_v2.example_id = example_id
callback_manager.add_handler(tracer_v2, True)
else:
try:
handler = LangChainTracer(
project_name=tracer_project, example_id=example_id
)
handler = LangChainTracer(project_name=tracer_project)
callback_manager.add_handler(handler, True)
except Exception as e:
logger.warning(
@@ -1772,13 +1756,6 @@ def _configure(
" unset the LANGCHAIN_TRACING_V2 environment variables.",
e,
)
elif tracing_v2_enabled_ and example_id:
# This can get ugly since we don't manage the un-setting
# of the example_id
for handler in callback_manager.handlers:
if isinstance(handler, LangChainTracer):
handler.example_id = example_id
break
if open_ai is not None and not any(
isinstance(handler, OpenAICallbackHandler)
for handler in callback_manager.handlers

View File

@@ -6,7 +6,6 @@ import warnings
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Any, Dict, List, Optional, Union
from uuid import UUID
import yaml
from pydantic import Field, root_validator, validator
@@ -207,7 +206,6 @@ class Chain(Serializable, Runnable[Dict[str, Any], Dict[str, Any]], ABC):
tags: Optional[List[str]] = None,
metadata: Optional[Dict[str, Any]] = None,
include_run_info: bool = False,
example_id: Optional[UUID] = None,
) -> Dict[str, Any]:
"""Execute the chain.
@@ -243,7 +241,6 @@ class Chain(Serializable, Runnable[Dict[str, Any], Dict[str, Any]], ABC):
self.tags,
metadata,
self.metadata,
example_id=example_id,
)
new_arg_supported = inspect.signature(self._call).parameters.get("run_manager")
run_manager = callback_manager.on_chain_start(
@@ -276,7 +273,6 @@ class Chain(Serializable, Runnable[Dict[str, Any], Dict[str, Any]], ABC):
tags: Optional[List[str]] = None,
metadata: Optional[Dict[str, Any]] = None,
include_run_info: bool = False,
example_id: Optional[UUID] = None,
) -> Dict[str, Any]:
"""Asynchronously execute the chain.
@@ -298,7 +294,6 @@ class Chain(Serializable, Runnable[Dict[str, Any], Dict[str, Any]], ABC):
metadata: Optional metadata associated with the chain. Defaults to None
include_run_info: Whether to include run info in the response. Defaults
to False.
example_id: Optional UUID of the example being processed. Defaults to None.
Returns:
A dict of named outputs. Should contain all outputs specified in
@@ -313,7 +308,6 @@ class Chain(Serializable, Runnable[Dict[str, Any], Dict[str, Any]], ABC):
self.tags,
metadata,
self.metadata,
example_id=example_id,
)
new_arg_supported = inspect.signature(self._acall).parameters.get("run_manager")
run_manager = await callback_manager.on_chain_start(

View File

@@ -1,7 +1,16 @@
"""Methods for creating chains that use OpenAI function-calling APIs."""
import inspect
import re
from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Type, Union
from typing import (
Any,
Callable,
Dict,
List,
Optional,
Sequence,
Tuple,
Type,
Union,
)
from pydantic import BaseModel
@@ -25,8 +34,7 @@ PYTHON_TO_JSON_TYPES = {
def _get_python_function_name(function: Callable) -> str:
"""Get the name of a Python function."""
source = inspect.getsource(function)
return re.search(r"^def (.*)\(", source).groups()[0] # type: ignore
return function.__name__
def _parse_python_function_docstring(function: Callable) -> Tuple[str, dict]:
@@ -94,10 +102,16 @@ def _get_python_function_required_args(function: Callable) -> List[str]:
spec = inspect.getfullargspec(function)
required = spec.args[: -len(spec.defaults)] if spec.defaults else spec.args
required += [k for k in spec.kwonlyargs if k not in (spec.kwonlydefaults or {})]
is_class = type(function) is type
if is_class and required[0] == "self":
required = required[1:]
return required
def convert_python_function_to_openai_function(function: Callable) -> Dict[str, Any]:
def convert_python_function_to_openai_function(
function: Callable,
) -> Dict[str, Any]:
"""Convert a Python function to an OpenAI function-calling API compatible dict.
Assumes the Python function has type hints and a docstring with a description. If

View File

@@ -83,6 +83,7 @@ def _load_stuff_chain(
document_variable_name=document_variable_name,
verbose=verbose,
callback_manager=callback_manager,
callbacks=callbacks,
**kwargs,
)
@@ -209,6 +210,7 @@ def _load_refine_chain(
initial_response_name=initial_response_name,
verbose=verbose,
callback_manager=callback_manager,
callbacks=callbacks,
**kwargs,
)

View File

@@ -18,6 +18,7 @@ an interface where "chat messages" are the inputs and outputs.
""" # noqa: E501
from langchain.chat_models.anthropic import ChatAnthropic
from langchain.chat_models.anyscale import ChatAnyscale
from langchain.chat_models.azure_openai import AzureChatOpenAI
from langchain.chat_models.fake import FakeListChatModel
from langchain.chat_models.google_palm import ChatGooglePalm
@@ -39,4 +40,5 @@ __all__ = [
"ChatVertexAI",
"JinaChat",
"HumanInputChatModel",
"ChatAnyscale",
]

View File

@@ -0,0 +1,192 @@
"""Anyscale Endpoints chat wrapper. Relies heavily on ChatOpenAI."""
from __future__ import annotations
import logging
import os
import sys
from typing import TYPE_CHECKING, Optional, Set
import requests
from pydantic import Field, root_validator
from langchain.chat_models.openai import (
ChatOpenAI,
_convert_message_to_dict,
_import_tiktoken,
)
from langchain.schema.messages import BaseMessage
from langchain.utils import get_from_dict_or_env
if TYPE_CHECKING:
import tiktoken
logger = logging.getLogger(__name__)
DEFAULT_API_BASE = "https://api.endpoints.anyscale.com/v1"
DEFAULT_MODEL = "meta-llama/Llama-2-7b-chat-hf"
class ChatAnyscale(ChatOpenAI):
"""Wrapper around Anyscale Chat large language models.
To use, you should have the ``openai`` python package installed, and the
environment variable ``ANYSCALE_API_KEY`` set with your API key.
Alternatively, you can use the anyscale_api_key keyword argument.
Any parameters that are valid to be passed to the `openai.create` call can be passed
in, even if not explicitly saved on this class.
Example:
.. code-block:: python
from langchain.chat_models import ChatAnyscale
chat = ChatAnyscale(model_name="meta-llama/Llama-2-7b-chat-hf")
"""
@property
def _llm_type(self) -> str:
"""Return type of chat model."""
return "anyscale-chat"
@property
def lc_secrets(self) -> dict[str, str]:
return {"anyscale_api_key": "ANYSCALE_API_KEY"}
anyscale_api_key: Optional[str] = None
"""AnyScale Endpoints API keys."""
model_name: str = Field(default=DEFAULT_MODEL, alias="model")
"""Model name to use."""
anyscale_api_base: str = Field(default=DEFAULT_API_BASE)
"""Base URL path for API requests,
leave blank if not using a proxy or service emulator."""
anyscale_proxy: Optional[str] = None
"""To support explicit proxy for Anyscale."""
available_models: Optional[Set[str]] = None
"""Available models from Anyscale API."""
@staticmethod
def get_available_models(
anyscale_api_key: Optional[str] = None,
anyscale_api_base: str = DEFAULT_API_BASE,
) -> Set[str]:
"""Get available models from Anyscale API."""
try:
anyscale_api_key = anyscale_api_key or os.environ["ANYSCALE_API_KEY"]
except KeyError as e:
raise ValueError(
"Anyscale API key must be passed as keyword argument or "
"set in environment variable ANYSCALE_API_KEY.",
) from e
models_url = f"{anyscale_api_base}/models"
models_response = requests.get(
models_url,
headers={
"Authorization": f"Bearer {anyscale_api_key}",
},
)
if models_response.status_code != 200:
raise ValueError(
f"Error getting models from {models_url}: "
f"{models_response.status_code}",
)
return {model["id"] for model in models_response.json()["data"]}
@root_validator(pre=True)
def validate_environment_override(cls, values: dict) -> dict:
"""Validate that api key and python package exists in environment."""
values["openai_api_key"] = get_from_dict_or_env(
values,
"anyscale_api_key",
"ANYSCALE_API_KEY",
)
values["openai_api_base"] = get_from_dict_or_env(
values,
"anyscale_api_base",
"ANYSCALE_API_BASE",
default=DEFAULT_API_BASE,
)
values["openai_proxy"] = get_from_dict_or_env(
values,
"anyscale_proxy",
"ANYSCALE_PROXY",
default="",
)
try:
import openai
except ImportError as e:
raise ValueError(
"Could not import openai python package. "
"Please install it with `pip install openai`.",
) from e
try:
values["client"] = openai.ChatCompletion
except AttributeError as exc:
raise ValueError(
"`openai` has no `ChatCompletion` attribute, this is likely "
"due to an old version of the openai package. Try upgrading it "
"with `pip install --upgrade openai`.",
) from exc
if "model_name" not in values.keys():
values["model_name"] = DEFAULT_MODEL
model_name = values["model_name"]
available_models = cls.get_available_models(
values["openai_api_key"],
values["openai_api_base"],
)
if model_name not in available_models:
raise ValueError(
f"Model name {model_name} not found in available models: "
f"{available_models}.",
)
values["available_models"] = available_models
return values
def _get_encoding_model(self) -> tuple[str, tiktoken.Encoding]:
tiktoken_ = _import_tiktoken()
if self.tiktoken_model_name is not None:
model = self.tiktoken_model_name
else:
model = self.model_name
# Returns the number of tokens used by a list of messages.
try:
encoding = tiktoken_.encoding_for_model("gpt-3.5-turbo-0301")
except KeyError:
logger.warning("Warning: model not found. Using cl100k_base encoding.")
model = "cl100k_base"
encoding = tiktoken_.get_encoding(model)
return model, encoding
def get_num_tokens_from_messages(self, messages: list[BaseMessage]) -> int:
"""Calculate num tokens with tiktoken package.
Official documentation: https://github.com/openai/openai-cookbook/blob/
main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb"""
if sys.version_info[1] <= 7:
return super().get_num_tokens_from_messages(messages)
model, encoding = self._get_encoding_model()
tokens_per_message = 3
tokens_per_name = 1
num_tokens = 0
messages_dict = [_convert_message_to_dict(m) for m in messages]
for message in messages_dict:
num_tokens += tokens_per_message
for key, value in message.items():
# Cast str(value) in case the message value is not a string
# This occurs with function messages
num_tokens += len(encoding.encode(str(value)))
if key == "name":
num_tokens += tokens_per_name
# every reply is primed with <im_start>assistant
num_tokens += 3
return num_tokens

View File

@@ -158,7 +158,6 @@ class BaseChatModel(BaseLanguageModel[BaseMessageChunk], ABC):
self.tags,
config.get("metadata"),
self.metadata,
example_id=config.get("example_id"),
)
(run_manager,) = callback_manager.on_chat_model_start(
dumpd(self), [messages], invocation_params=params, options=options

View File

@@ -381,9 +381,10 @@ class ChatOpenAI(BaseChatModel):
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
stream: Optional[bool] = None,
**kwargs: Any,
) -> ChatResult:
if self.streaming:
if stream if stream is not None else self.streaming:
generation: Optional[ChatGenerationChunk] = None
for chunk in self._stream(
messages=messages, stop=stop, run_manager=run_manager, **kwargs
@@ -454,9 +455,10 @@ class ChatOpenAI(BaseChatModel):
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
stream: Optional[bool] = None,
**kwargs: Any,
) -> ChatResult:
if self.streaming:
if stream if stream is not None else self.streaming:
generation: Optional[ChatGenerationChunk] = None
async for chunk in self._astream(
messages=messages, stop=stop, run_manager=run_manager, **kwargs

View File

@@ -43,13 +43,16 @@ class PromptLayerChatOpenAI(ChatOpenAI):
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
stream: Optional[bool] = None,
**kwargs: Any
) -> ChatResult:
"""Call ChatOpenAI generate and then call PromptLayer API to log the request."""
from promptlayer.utils import get_api_key, promptlayer_api_request
request_start_time = datetime.datetime.now().timestamp()
generated_responses = super()._generate(messages, stop, run_manager, **kwargs)
generated_responses = super()._generate(
messages, stop, run_manager, stream=stream, **kwargs
)
request_end_time = datetime.datetime.now().timestamp()
message_dicts, params = super()._create_message_dicts(messages, stop)
for i, generation in enumerate(generated_responses.generations):
@@ -82,13 +85,16 @@ class PromptLayerChatOpenAI(ChatOpenAI):
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
stream: Optional[bool] = None,
**kwargs: Any
) -> ChatResult:
"""Call ChatOpenAI agenerate and then call PromptLayer to log."""
from promptlayer.utils import get_api_key, promptlayer_api_request_async
request_start_time = datetime.datetime.now().timestamp()
generated_responses = await super()._agenerate(messages, stop, run_manager)
generated_responses = await super()._agenerate(
messages, stop, run_manager, stream=stream, **kwargs
)
request_end_time = datetime.datetime.now().timestamp()
message_dicts, params = super()._create_message_dicts(messages, stop)
for i, generation in enumerate(generated_responses.generations):

View File

@@ -111,7 +111,7 @@ class ChatVertexAI(_VertexAICommon, BaseChatModel):
values["client"] = ChatModel.from_pretrained(values["model_name"])
except ImportError:
raise_vertex_import_error(minimum_expected_version="1.28.0")
raise_vertex_import_error(minimum_expected_version="1.29.0")
return values
def _generate(
@@ -155,7 +155,7 @@ class ChatVertexAI(_VertexAICommon, BaseChatModel):
context=context, message_history=history.history, **params
)
else:
chat = self.client.start_chat(**params)
chat = self.client.start_chat(message_history=history.history, **params)
response = chat.send_message(question.content)
text = self._enforce_stop_words(response.text, stop)
return ChatResult(generations=[ChatGeneration(message=AIMessage(content=text))])

View File

@@ -1,6 +1,6 @@
"""Interface to access to place that stores documents."""
from abc import ABC, abstractmethod
from typing import Dict, Union
from typing import Dict, List, Union
from langchain.docstore.document import Document
@@ -16,6 +16,10 @@ class Docstore(ABC):
If page does not exist, return similar entries.
"""
def delete(self, ids: List) -> None:
"""Deleting IDs from in memory dictionary."""
raise NotImplementedError
class AddableMixin(ABC):
"""Mixin class that supports adding texts."""

View File

@@ -1,5 +1,5 @@
"""Simple in memory docstore in the form of a dict."""
from typing import Dict, Optional, Union
from typing import Dict, List, Optional, Union
from langchain.docstore.base import AddableMixin, Docstore
from langchain.docstore.document import Document
@@ -26,6 +26,14 @@ class InMemoryDocstore(Docstore, AddableMixin):
raise ValueError(f"Tried to add ids that already exist: {overlapping}")
self._dict = {**self._dict, **texts}
def delete(self, ids: List) -> None:
"""Deleting IDs from in memory dictionary."""
overlapping = set(ids).intersection(self._dict)
if not overlapping:
raise ValueError(f"Tried to delete ids that does not exist: {ids}")
for _id in ids:
self._dict.pop(_id)
def search(self, search: str) -> Union[str, Document]:
"""Search via direct lookup.

View File

@@ -122,6 +122,7 @@ from langchain.document_loaders.pdf import (
)
from langchain.document_loaders.powerpoint import UnstructuredPowerPointLoader
from langchain.document_loaders.psychic import PsychicLoader
from langchain.document_loaders.pubmed import PubMedLoader
from langchain.document_loaders.pyspark_dataframe import PySparkDataFrameLoader
from langchain.document_loaders.python import PythonLoader
from langchain.document_loaders.readthedocs import ReadTheDocsLoader
@@ -146,6 +147,7 @@ from langchain.document_loaders.telegram import (
)
from langchain.document_loaders.tencent_cos_directory import TencentCOSDirectoryLoader
from langchain.document_loaders.tencent_cos_file import TencentCOSFileLoader
from langchain.document_loaders.tensorflow_datasets import TensorflowDatasetLoader
from langchain.document_loaders.text import TextLoader
from langchain.document_loaders.tomarkdown import ToMarkdownLoader
from langchain.document_loaders.toml import TomlLoader
@@ -184,13 +186,14 @@ PagedPDFSplitter = PyPDFLoader
TelegramChatLoader = TelegramChatFileLoader
__all__ = [
"AcreomLoader",
"AsyncHtmlLoader",
"AZLyricsLoader",
"AcreomLoader",
"AirbyteJSONLoader",
"AirtableLoader",
"AmazonTextractPDFLoader",
"ApifyDatasetLoader",
"ArxivLoader",
"AsyncHtmlLoader",
"AzureBlobStorageContainerLoader",
"AzureBlobStorageFileLoader",
"BSHTMLLoader",
@@ -207,10 +210,11 @@ __all__ = [
"ChatGPTLoader",
"CoNLLULoader",
"CollegeConfidentialLoader",
"ConcurrentLoader",
"ConfluenceLoader",
"CubeSemanticLoader",
"DatadogLogsLoader",
"DataFrameLoader",
"DatadogLogsLoader",
"DiffbotLoader",
"DirectoryLoader",
"DiscordChatLoader",
@@ -246,12 +250,12 @@ __all__ = [
"JSONLoader",
"JoplinLoader",
"LarkSuiteDocLoader",
"MHTMLLoader",
"MWDumpLoader",
"MastodonTootsLoader",
"MathpixPDFLoader",
"MaxComputeLoader",
"MergedDataLoader",
"MHTMLLoader",
"ModernTreasuryLoader",
"NewsURLLoader",
"NotebookLoader",
@@ -263,26 +267,27 @@ __all__ = [
"OneDriveFileLoader",
"OneDriveLoader",
"OnlinePDFLoader",
"OutlookMessageLoader",
"OpenCityDataLoader",
"OutlookMessageLoader",
"PDFMinerLoader",
"PDFMinerPDFasHTMLLoader",
"PDFPlumberLoader",
"PagedPDFSplitter",
"PlaywrightURLLoader",
"PsychicLoader",
"PubMedLoader",
"PyMuPDFLoader",
"PyPDFDirectoryLoader",
"PyPDFLoader",
"PyPDFium2Loader",
"PySparkDataFrameLoader",
"PythonLoader",
"RSSFeedLoader",
"ReadTheDocsLoader",
"RecursiveUrlLoader",
"RedditPostsLoader",
"RoamLoader",
"RocksetLoader",
"RSSFeedLoader",
"S3DirectoryLoader",
"S3FileLoader",
"SRTLoader",
@@ -292,11 +297,12 @@ __all__ = [
"SnowflakeLoader",
"SpreedlyLoader",
"StripeLoader",
"TencentCOSDirectoryLoader",
"TencentCOSFileLoader",
"TelegramChatApiLoader",
"TelegramChatFileLoader",
"TelegramChatLoader",
"TensorflowDatasetLoader",
"TencentCOSDirectoryLoader",
"TencentCOSFileLoader",
"TextLoader",
"ToMarkdownLoader",
"TomlLoader",
@@ -330,6 +336,4 @@ __all__ = [
"XorbitsLoader",
"YoutubeAudioLoader",
"YoutubeLoader",
"ConcurrentLoader",
"AmazonTextractPDFLoader",
]

View File

@@ -0,0 +1,190 @@
"""Loads local airbyte json files."""
from typing import Any, Callable, Iterator, List, Mapping, Optional
from libs.langchain.langchain.utils.utils import guard_import
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
RecordHandler = Callable[[Any, Optional[str]], Document]
class AirbyteCDKLoader(BaseLoader):
"""Loads records using an Airbyte source connector implemented using the CDK."""
def __init__(
self,
config: Mapping[str, Any],
source_class: Any,
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
from airbyte_cdk.models.airbyte_protocol import AirbyteRecordMessage
from airbyte_cdk.sources.embedded.base_integration import (
BaseEmbeddedIntegration,
)
from airbyte_cdk.sources.embedded.runner import CDKRunner
class CDKIntegration(BaseEmbeddedIntegration):
def _handle_record(
self, record: AirbyteRecordMessage, id: Optional[str]
) -> Document:
if record_handler:
return record_handler(record, id)
return Document(page_content="", metadata=record.data)
self._integration = CDKIntegration(
config=config,
runner=CDKRunner(source=source_class(), name=source_class.__name__),
)
self._stream_name = stream_name
self._state = state
def load(self) -> List[Document]:
return list(self.lazy_load())
def lazy_load(self) -> Iterator[Document]:
return self._integration._load_data(
stream_name=self._stream_name, state=self._state
)
class AirbyteHubspotLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_hubspot", pip_name="airbyte-source-hubspot"
).SourceHubspot
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)
class AirbyteStripeLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_stripe", pip_name="airbyte-source-stripe"
).SourceStripe
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)
class AirbyteTypeformLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_typeform", pip_name="airbyte-source-typeform"
).SourceTypeform
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)
class AirbyteZendeskSupportLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_zendesk_support", pip_name="airbyte-source-zendesk-support"
).SourceZendeskSupport
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)
class AirbyteShopifyLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_shopify", pip_name="airbyte-source-shopify"
).SourceShopify
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)
class AirbyteSalesforceLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_salesforce", pip_name="airbyte-source-salesforce"
).SourceSalesforce
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)
class AirbyteGongLoader(AirbyteCDKLoader):
def __init__(
self,
config: Mapping[str, Any],
stream_name: str,
record_handler: Optional[RecordHandler] = None,
state: Optional[Any] = None,
) -> None:
source_class = guard_import(
"source_gong", pip_name="airbyte-source-gong"
).SourceGong
super().__init__(
config=config,
source_class=source_class,
stream_name=stream_name,
record_handler=record_handler,
state=state,
)

View File

@@ -8,7 +8,6 @@ from langchain.utilities.arxiv import ArxivAPIWrapper
class ArxivLoader(BaseLoader):
"""Loads a query result from arxiv.org into a list of Documents.
Each document represents one Document.
The loader converts the original PDF format into the text.
"""

View File

@@ -31,7 +31,7 @@ class BlackboardLoader(WebBaseLoader):
)
documents = loader.load()
"""
""" # noqa: E501
base_url: str
"""Base url of the blackboard course."""
@@ -47,6 +47,7 @@ class BlackboardLoader(WebBaseLoader):
load_all_recursively: bool = True,
basic_auth: Optional[Tuple[str, str]] = None,
cookies: Optional[dict] = None,
continue_on_failure: Optional[bool] = False,
):
"""Initialize with blackboard course url.
@@ -58,6 +59,10 @@ class BlackboardLoader(WebBaseLoader):
load_all_recursively: If True, load all documents recursively.
basic_auth: Basic auth credentials.
cookies: Cookies.
continue_on_failure: whether to continue loading the sitemap if an error
occurs loading a url, emitting a warning instead of raising an
exception. Setting this to True makes the loader more robust, but also
may result in missing data. Default: False
Raises:
ValueError: If blackboard course url is invalid.
@@ -80,6 +85,7 @@ class BlackboardLoader(WebBaseLoader):
cookies.update({"BbRouter": bbrouter})
self.session.cookies.update(cookies)
self.load_all_recursively = load_all_recursively
self.continue_on_failure = continue_on_failure
self.check_bs4()
def check_bs4(self) -> None:

View File

@@ -1,5 +1,5 @@
"""Loading logic for loading documents from an GCS directory."""
from typing import List
from typing import Callable, List, Optional
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
@@ -9,17 +9,27 @@ from langchain.document_loaders.gcs_file import GCSFileLoader
class GCSDirectoryLoader(BaseLoader):
"""Loads Documents from GCS."""
def __init__(self, project_name: str, bucket: str, prefix: str = ""):
def __init__(
self,
project_name: str,
bucket: str,
prefix: str = "",
loader_func: Optional[Callable[[str], BaseLoader]] = None,
):
"""Initialize with bucket and key name.
Args:
project_name: The name of the project for the GCS bucket.
bucket: The name of the GCS bucket.
prefix: The prefix of the GCS bucket.
loader_func: A loader function that instatiates a loader based on a
file_path argument. If nothing is provided, the GCSFileLoader
would use its default loader.
"""
self.project_name = project_name
self.bucket = bucket
self.prefix = prefix
self._loader_func = loader_func
def load(self) -> List[Document]:
"""Load documents."""
@@ -37,6 +47,8 @@ class GCSDirectoryLoader(BaseLoader):
# intermediate directories on the fly
if blob.name.endswith("/"):
continue
loader = GCSFileLoader(self.project_name, self.bucket, blob.name)
loader = GCSFileLoader(
self.project_name, self.bucket, blob.name, loader_func=self._loader_func
)
docs.extend(loader.load())
return docs

View File

@@ -1,7 +1,7 @@
"""Load documents from a GCS file."""
import os
import tempfile
from typing import List
from typing import Callable, List, Optional
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
@@ -11,18 +11,42 @@ from langchain.document_loaders.unstructured import UnstructuredFileLoader
class GCSFileLoader(BaseLoader):
"""Load Documents from a GCS file."""
def __init__(self, project_name: str, bucket: str, blob: str):
def __init__(
self,
project_name: str,
bucket: str,
blob: str,
loader_func: Optional[Callable[[str], BaseLoader]] = None,
):
"""Initialize with bucket and key name.
Args:
project_name: The name of the project to load
bucket: The name of the GCS bucket.
blob: The name of the GCS blob to load.
loader_func: A loader function that instatiates a loader based on a
file_path argument. If nothing is provided, the
UnstructuredFileLoader is used.
Examples:
To use an alternative PDF loader:
>> from from langchain.document_loaders import PyPDFLoader
>> loader = GCSFileLoader(..., loader_func=PyPDFLoader)
To use UnstructuredFileLoader with additional arguments:
>> loader = GCSFileLoader(...,
>> loader_func=lambda x: UnstructuredFileLoader(x, mode="elements"))
"""
self.bucket = bucket
self.blob = blob
self.project_name = project_name
def default_loader_func(file_path: str) -> BaseLoader:
return UnstructuredFileLoader(file_path)
self._loader_func = loader_func if loader_func else default_loader_func
def load(self) -> List[Document]:
"""Load documents."""
try:
@@ -44,5 +68,9 @@ class GCSFileLoader(BaseLoader):
os.makedirs(os.path.dirname(file_path), exist_ok=True)
# Download the file to a destination
blob.download_to_filename(file_path)
loader = UnstructuredFileLoader(file_path)
return loader.load()
loader = self._loader_func(file_path)
docs = loader.load()
for doc in docs:
if "source" in doc.metadata:
doc.metadata["source"] = f"gs://{self.bucket}/{self.blob}"
return docs

View File

@@ -19,6 +19,7 @@ class GitbookLoader(WebBaseLoader):
load_all_paths: bool = False,
base_url: Optional[str] = None,
content_selector: str = "main",
continue_on_failure: Optional[bool] = False,
):
"""Initialize with web page and whether to load all paths.
@@ -31,6 +32,10 @@ class GitbookLoader(WebBaseLoader):
appended to this base url. Defaults to `web_page`.
content_selector: The CSS selector for the content to load.
Defaults to "main".
continue_on_failure: whether to continue loading the sitemap if an error
occurs loading a url, emitting a warning instead of raising an
exception. Setting this to True makes the loader more robust, but also
may result in missing data. Default: False
"""
self.base_url = base_url or web_page
if self.base_url.endswith("/"):
@@ -43,6 +48,7 @@ class GitbookLoader(WebBaseLoader):
super().__init__(web_paths)
self.load_all_paths = load_all_paths
self.content_selector = content_selector
self.continue_on_failure = continue_on_failure
def load(self) -> List[Document]:
"""Fetch text from one single GitBook page."""

View File

@@ -43,7 +43,7 @@ class OBSDirectoryLoader(BaseLoader):
try:
from obs import ObsClient
except ImportError:
raise ValueError(
raise ImportError(
"Could not import esdk-obs-python python package. "
"Please install it with `pip install esdk-obs-python`."
)

View File

@@ -67,7 +67,7 @@ class OBSFileLoader(BaseLoader):
try:
from obs import ObsClient
except ImportError:
raise ValueError(
raise ImportError(
"Could not import esdk-obs-python python package. "
"Please install it with `pip install esdk-obs-python`."
)

View File

@@ -1,10 +1,13 @@
import logging
import time
from typing import Iterator, Optional
from typing import Dict, Iterator, Optional, Tuple
from langchain.document_loaders.base import BaseBlobParser
from langchain.document_loaders.blob_loaders import Blob
from langchain.schema import Document
logger = logging.getLogger(__name__)
class OpenAIWhisperParser(BaseBlobParser):
"""Transcribe and parse audio files.
@@ -77,12 +80,31 @@ class OpenAIWhisperParser(BaseBlobParser):
class OpenAIWhisperParserLocal(BaseBlobParser):
"""Transcribe and parse audio files.
Audio transcription is with OpenAI Whisper model locally from transformers
NOTE: By default uses the gpu if available, if you want to use cpu,
please set device = "cpu"
Audio transcription with OpenAI Whisper model locally from transformers
Parameters:
device - device to use
NOTE: By default uses the gpu if available,
if you want to use cpu, please set device = "cpu"
lang_model - whisper model to use, for example "openai/whisper-medium"
forced_decoder_ids - id states for decoder in multilanguage model,
usage example:
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="french",
task="transcribe")
forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="french",
task="translate")
"""
def __init__(self, device: str = "0", lang_model: Optional[str] = None):
def __init__(
self,
device: str = "0",
lang_model: Optional[str] = None,
forced_decoder_ids: Optional[Tuple[Dict]] = None,
):
try:
from transformers import pipeline
except ImportError:
@@ -136,10 +158,19 @@ class OpenAIWhisperParserLocal(BaseBlobParser):
# load model for inference
self.pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-medium",
model=self.lang_model,
chunk_length_s=30,
device=self.device,
)
if forced_decoder_ids is not None:
try:
self.pipe.model.config.forced_decoder_ids = forced_decoder_ids
except Exception as exception_text:
logger.info(
"Unable to set forced_decoder_ids parameter for whisper model"
f"Text of exception: {exception_text}"
"Therefore whisper model will use default mode for decoder"
)
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
@@ -149,14 +180,14 @@ class OpenAIWhisperParserLocal(BaseBlobParser):
try:
from pydub import AudioSegment
except ImportError:
raise ValueError(
"pydub package not found, please install it with " "`pip install pydub`"
raise ImportError(
"pydub package not found, please install it with `pip install pydub`"
)
try:
import librosa
except ImportError:
raise ValueError(
raise ImportError(
"librosa package not found, please install it with "
"`pip install librosa`"
)

View File

@@ -183,7 +183,7 @@ class AmazonTextractPDFParser(BaseBlobParser):
else:
self.textract_features = []
except ImportError:
raise ModuleNotFoundError(
raise ImportError(
"Could not import amazon-textract-caller python package. "
"Please install it with `pip install amazon-textract-caller`."
)
@@ -194,7 +194,7 @@ class AmazonTextractPDFParser(BaseBlobParser):
self.boto3_textract_client = boto3.client("textract")
except ImportError:
raise ModuleNotFoundError(
raise ImportError(
"Could not import boto3 python package. "
"Please install it with `pip install boto3`."
)

View File

@@ -0,0 +1,39 @@
from typing import Iterator, List, Optional
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from langchain.utilities.pubmed import PubMedAPIWrapper
class PubMedLoader(BaseLoader):
"""Loads a query result from PubMed biomedical library into a list of Documents.
Attributes:
query: The query to be passed to the PubMed API.
load_max_docs: The maximum number of documents to load.
"""
def __init__(
self,
query: str,
load_max_docs: Optional[int] = 3,
):
"""Initialize the PubMedLoader.
Args:
query: The query to be passed to the PubMed API.
load_max_docs: The maximum number of documents to load.
Defaults to 3.
"""
self.query = query
self.load_max_docs = load_max_docs
self._client = PubMedAPIWrapper(
top_k_results=load_max_docs,
)
def load(self) -> List[Document]:
return list(self._client.lazy_load_docs(self.query))
def lazy_load(self) -> Iterator[Document]:
for doc in self._client.lazy_load_docs(self.query):
yield doc

View File

@@ -1,4 +1,6 @@
from typing import Iterator, List, Optional, Set
import asyncio
import re
from typing import Callable, Iterator, List, Optional, Set, Union
from urllib.parse import urljoin, urlparse
import requests
@@ -13,20 +15,117 @@ class RecursiveUrlLoader(BaseLoader):
def __init__(
self,
url: str,
max_depth: Optional[int] = None,
use_async: Optional[bool] = None,
extractor: Optional[Callable[[str], str]] = None,
exclude_dirs: Optional[str] = None,
timeout: Optional[int] = None,
prevent_outside: Optional[bool] = None,
) -> None:
"""Initialize with URL to crawl and any subdirectories to exclude.
Args:
url: The URL to crawl.
exclude_dirs: A list of subdirectories to exclude.
use_async: Whether to use asynchronous loading,
if use_async is true, this function will not be lazy,
but it will still work in the expected way, just not lazy.
extractor: A function to extract the text from the html,
when extract function returns empty string, the document will be ignored.
max_depth: The max depth of the recursive loading.
timeout: The timeout for the requests, in the unit of seconds.
"""
self.url = url
self.exclude_dirs = exclude_dirs
self.use_async = use_async if use_async is not None else False
self.extractor = extractor if extractor is not None else lambda x: x
self.max_depth = max_depth if max_depth is not None else 2
self.timeout = timeout if timeout is not None else 10
self.prevent_outside = prevent_outside if prevent_outside is not None else True
def get_child_links_recursive(
self, url: str, visited: Optional[Set[str]] = None
def _get_sub_links(self, raw_html: str, base_url: str) -> List[str]:
"""This function extracts all the links from the raw html,
and convert them into absolute paths.
Args:
raw_html (str): original html
base_url (str): the base url of the html
Returns:
List[str]: sub links
"""
# Get all links that are relative to the root of the website
all_links = re.findall(r"href=[\"\'](.*?)[\"\']", raw_html)
absolute_paths = []
invalid_prefixes = ("javascript:", "mailto:", "#")
invalid_suffixes = (
".css",
".js",
".ico",
".png",
".jpg",
".jpeg",
".gif",
".svg",
)
# Process the links
for link in all_links:
# Ignore blacklisted patterns
# like javascript: or mailto:, files of svg, ico, css, js
if link.startswith(invalid_prefixes) or link.endswith(invalid_suffixes):
continue
# Some may be absolute links like https://to/path
if link.startswith("http"):
if (not self.prevent_outside) or (
self.prevent_outside and link.startswith(base_url)
):
absolute_paths.append(link)
else:
absolute_paths.append(urljoin(base_url, link))
# Some may be relative links like /to/path
if link.startswith("/") and not link.startswith("//"):
absolute_paths.append(urljoin(base_url, link))
continue
# Some may have omitted the protocol like //to/path
if link.startswith("//"):
absolute_paths.append(f"{urlparse(base_url).scheme}:{link}")
continue
# Remove duplicates
# also do another filter to prevent outside links
absolute_paths = list(
set(
[
path
for path in absolute_paths
if not self.prevent_outside
or path.startswith(base_url)
and path != base_url
]
)
)
return absolute_paths
def _gen_metadata(self, raw_html: str, url: str) -> dict:
"""Build metadata from BeautifulSoup output."""
try:
from bs4 import BeautifulSoup
except ImportError:
print("The bs4 package is required for the RecursiveUrlLoader.")
print("Please install it with `pip install bs4`.")
metadata = {"source": url}
soup = BeautifulSoup(raw_html, "html.parser")
if title := soup.find("title"):
metadata["title"] = title.get_text()
if description := soup.find("meta", attrs={"name": "description"}):
metadata["description"] = description.get("content", None)
if html := soup.find("html"):
metadata["language"] = html.get("lang", None)
return metadata
def _get_child_links_recursive(
self, url: str, visited: Optional[Set[str]] = None, depth: int = 0
) -> Iterator[Document]:
"""Recursively get all child links starting with the path of the input URL.
@@ -35,26 +134,12 @@ class RecursiveUrlLoader(BaseLoader):
visited: A set of visited URLs.
"""
from langchain.document_loaders import WebBaseLoader
try:
from bs4 import BeautifulSoup
except ImportError:
raise ImportError(
"The BeautifulSoup package is required for the RecursiveUrlLoader."
)
# Construct the base and parent URLs
parsed_url = urlparse(url)
base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
parent_url = "/".join(parsed_url.path.split("/")[:-1])
current_path = parsed_url.path
if depth > self.max_depth:
return []
# Add a trailing slash if not present
if not base_url.endswith("/"):
base_url += "/"
if not parent_url.endswith("/"):
parent_url += "/"
if not url.endswith("/"):
url += "/"
# Exclude the root and parent from a list
visited = set() if visited is None else visited
@@ -63,42 +148,162 @@ class RecursiveUrlLoader(BaseLoader):
if self.exclude_dirs and any(
url.startswith(exclude_dir) for exclude_dir in self.exclude_dirs
):
return visited
return []
# Get all links that are relative to the root of the website
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
all_links = [link.get("href") for link in soup.find_all("a")]
# Get all links that can be accessed from the current URL
try:
response = requests.get(url, timeout=self.timeout)
except Exception:
return []
# Extract only the links that are children of the current URL
child_links = list(
{
link
for link in all_links
if link and link.startswith(current_path) and link != current_path
}
)
# Get absolute path for all root relative links listed
absolute_paths = [urljoin(base_url, link) for link in child_links]
absolute_paths = self._get_sub_links(response.text, url)
# Store the visited links and recursively visit the children
for link in absolute_paths:
# Check all unvisited links
if link not in visited:
visited.add(link)
loaded_link = WebBaseLoader(link).load()
if isinstance(loaded_link, list):
yield from loaded_link
else:
yield loaded_link
yield from self.get_child_links_recursive(link, visited)
return visited
try:
response = requests.get(link)
text = response.text
except Exception:
# unreachable link, so just ignore it
continue
loaded_link = Document(
page_content=self.extractor(text),
metadata=self._gen_metadata(text, link),
)
yield loaded_link
# If the link is a directory (w/ children) then visit it
if link.endswith("/"):
yield from self._get_child_links_recursive(link, visited, depth + 1)
return []
async def _async_get_child_links_recursive(
self, url: str, visited: Optional[Set[str]] = None, depth: int = 0
) -> List[Document]:
"""Recursively get all child links starting with the path of the input URL.
Args:
url: The URL to crawl.
visited: A set of visited URLs.
depth: To reach the current url, how many pages have been visited.
"""
try:
import aiohttp
except ImportError:
print("The aiohttp package is required for the RecursiveUrlLoader.")
print("Please install it with `pip install aiohttp`.")
if depth > self.max_depth:
return []
# Add a trailing slash if not present
if not url.endswith("/"):
url += "/"
# Exclude the root and parent from a list
visited = set() if visited is None else visited
# Exclude the links that start with any of the excluded directories
if self.exclude_dirs and any(
url.startswith(exclude_dir) for exclude_dir in self.exclude_dirs
):
return []
# Disable SSL verification because websites may have invalid SSL certificates,
# but won't cause any security issues for us.
async with aiohttp.ClientSession(
connector=aiohttp.TCPConnector(ssl=False),
timeout=aiohttp.ClientTimeout(self.timeout),
) as session:
# Some url may be invalid, so catch the exception
response: aiohttp.ClientResponse
try:
response = await session.get(url)
text = await response.text()
except aiohttp.client_exceptions.InvalidURL:
return []
# There may be some other exceptions, so catch them,
# we don't want to stop the whole process
except Exception:
return []
absolute_paths = self._get_sub_links(text, url)
# Worker will be only called within the current function
# Worker function will process the link
# then recursively call get_child_links_recursive to process the children
async def worker(link: str) -> Union[Document, None]:
try:
async with aiohttp.ClientSession(
connector=aiohttp.TCPConnector(ssl=False),
timeout=aiohttp.ClientTimeout(self.timeout),
) as session:
response = await session.get(link)
text = await response.text()
extracted = self.extractor(text)
if len(extracted) > 0:
return Document(
page_content=extracted,
metadata=self._gen_metadata(text, link),
)
else:
return None
# Despite the fact that we have filtered some links,
# there may still be some invalid links, so catch the exception
except aiohttp.client_exceptions.InvalidURL:
return None
# There may be some other exceptions, so catch them,
# we don't want to stop the whole process
except Exception:
# print(e)
return None
# The coroutines that will be executed
tasks = []
# Generate the tasks
for link in absolute_paths:
# Check all unvisited links
if link not in visited:
visited.add(link)
tasks.append(worker(link))
# Get the not None results
results = list(
filter(lambda x: x is not None, await asyncio.gather(*tasks))
)
# Recursively call the function to get the children of the children
sub_tasks = []
for link in absolute_paths:
sub_tasks.append(
self._async_get_child_links_recursive(link, visited, depth + 1)
)
# sub_tasks returns coroutines of list,
# so we need to flatten the list await asyncio.gather(*sub_tasks)
flattened = []
next_results = await asyncio.gather(*sub_tasks)
for sub_result in next_results:
if isinstance(sub_result, Exception):
# We don't want to stop the whole process, so just ignore it
# Not standard html format or invalid url or 404 may cause this
# But we can't do anything about it.
continue
if sub_result is not None:
flattened += sub_result
results += flattened
return list(filter(lambda x: x is not None, results))
def lazy_load(self) -> Iterator[Document]:
"""Lazy load web pages."""
return self.get_child_links_recursive(self.url)
"""Lazy load web pages.
When use_async is True, this function will not be lazy,
but it will still work in the expected way, just not lazy."""
if self.use_async:
results = asyncio.run(self._async_get_child_links_recursive(self.url))
if results is None:
return iter([])
else:
return iter(results)
else:
return self._get_child_links_recursive(self.url)
def load(self) -> List[Document]:
"""Load web pages."""

View File

@@ -33,6 +33,7 @@ class SitemapLoader(WebBaseLoader):
blocknum: int = 0,
meta_function: Optional[Callable] = None,
is_local: bool = False,
continue_on_failure: bool = False,
):
"""Initialize with webpage path and optional filter URLs.
@@ -48,6 +49,10 @@ class SitemapLoader(WebBaseLoader):
remember when setting this method to also copy metadata["loc"]
to metadata["source"] if you are using this field
is_local: whether the sitemap is a local file. Default: False
continue_on_failure: whether to continue loading the sitemap if an error
occurs loading a url, emitting a warning instead of raising an
exception. Setting this to True makes the loader more robust, but also
may result in missing data. Default: False
"""
if blocksize is not None and blocksize < 1:
@@ -71,6 +76,7 @@ class SitemapLoader(WebBaseLoader):
self.blocksize = blocksize
self.blocknum = blocknum
self.is_local = is_local
self.continue_on_failure = continue_on_failure
def parse_sitemap(self, soup: Any) -> List[dict]:
"""Parse sitemap xml and load into a list of dicts.

View File

@@ -0,0 +1,79 @@
from typing import Callable, Dict, Iterator, List, Optional
from langchain.document_loaders.base import BaseLoader
from langchain.schema import Document
from langchain.utilities.tensorflow_datasets import TensorflowDatasets
class TensorflowDatasetLoader(BaseLoader):
"""Loads from TensorFlow Datasets into a list of Documents.
Attributes:
dataset_name: the name of the dataset to load
split_name: the name of the split to load.
load_max_docs: a limit to the number of loaded documents. Defaults to 100.
sample_to_document_function: a function that converts a dataset sample
into a Document
Example:
.. code-block:: python
from langchain.document_loaders import TensorflowDatasetLoader
def mlqaen_example_to_document(example: dict) -> Document:
return Document(
page_content=decode_to_str(example["context"]),
metadata={
"id": decode_to_str(example["id"]),
"title": decode_to_str(example["title"]),
"question": decode_to_str(example["question"]),
"answer": decode_to_str(example["answers"]["text"][0]),
},
)
tsds_client = TensorflowDatasetLoader(
dataset_name="mlqa/en",
split_name="test",
load_max_docs=100,
sample_to_document_function=mlqaen_example_to_document,
)
"""
def __init__(
self,
dataset_name: str,
split_name: str,
load_max_docs: Optional[int] = 100,
sample_to_document_function: Optional[Callable[[Dict], Document]] = None,
):
"""Initialize the TensorflowDatasetLoader.
Args:
dataset_name: the name of the dataset to load
split_name: the name of the split to load.
load_max_docs: a limit to the number of loaded documents. Defaults to 100.
sample_to_document_function: a function that converts a dataset sample
into a Document.
"""
self.dataset_name: str = dataset_name
self.split_name: str = split_name
self.load_max_docs = load_max_docs
"""The maximum number of documents to load."""
self.sample_to_document_function: Optional[
Callable[[Dict], Document]
] = sample_to_document_function
"""Custom function that transform a dataset sample into a Document."""
self._tfds_client = TensorflowDatasets(
dataset_name=self.dataset_name,
split_name=self.split_name,
load_max_docs=self.load_max_docs,
sample_to_document_function=self.sample_to_document_function,
)
def lazy_load(self) -> Iterator[Document]:
yield from self._tfds_client.lazy_load()
def load(self) -> List[Document]:
return list(self.lazy_load())

View File

@@ -62,6 +62,7 @@ class WebBaseLoader(BaseLoader):
header_template: Optional[dict] = None,
verify_ssl: Optional[bool] = True,
proxies: Optional[dict] = None,
continue_on_failure: Optional[bool] = False,
):
"""Initialize with webpage path."""
@@ -96,6 +97,7 @@ class WebBaseLoader(BaseLoader):
self.session = requests.Session()
self.session.headers = dict(headers)
self.session.verify = verify_ssl
self.continue_on_failure = continue_on_failure
if proxies:
self.session.proxies.update(proxies)
@@ -133,7 +135,20 @@ class WebBaseLoader(BaseLoader):
self, url: str, semaphore: asyncio.Semaphore
) -> str:
async with semaphore:
return await self._fetch(url)
try:
return await self._fetch(url)
except Exception as e:
if self.continue_on_failure:
logger.warning(
f"Error fetching {url}, skipping due to"
f" continue_on_failure=True"
)
return ""
logger.exception(
f"Error fetching {url} and aborting, use continue_on_failure=True "
"to continue loading urls after encountering an error."
)
raise e
async def fetch_all(self, urls: List[str]) -> Any:
"""Fetch all urls concurrently with rate limiting."""

View File

@@ -31,6 +31,7 @@ from langchain.embeddings.fake import DeterministicFakeEmbedding, FakeEmbeddings
from langchain.embeddings.google_palm import GooglePalmEmbeddings
from langchain.embeddings.gpt4all import GPT4AllEmbeddings
from langchain.embeddings.huggingface import (
HuggingFaceBgeEmbeddings,
HuggingFaceEmbeddings,
HuggingFaceInstructEmbeddings,
)
@@ -97,6 +98,7 @@ __all__ = [
"XinferenceEmbeddings",
"LocalAIEmbeddings",
"AwaEmbeddings",
"HuggingFaceBgeEmbeddings",
]

View File

@@ -6,10 +6,15 @@ from langchain.embeddings.base import Embeddings
DEFAULT_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
DEFAULT_INSTRUCT_MODEL = "hkunlp/instructor-large"
DEFAULT_BGE_MODEL = "BAAI/bge-large-en"
DEFAULT_EMBED_INSTRUCTION = "Represent the document for retrieval: "
DEFAULT_QUERY_INSTRUCTION = (
"Represent the question for retrieving supporting documents: "
)
DEFAULT_QUERY_BGE_INSTRUCTION_EN = (
"Represent this question for searching relevant passages: "
)
DEFAULT_QUERY_BGE_INSTRUCTION_ZH = "为这个句子生成表示以用于检索相关文章:"
class HuggingFaceEmbeddings(BaseModel, Embeddings):
@@ -169,3 +174,88 @@ class HuggingFaceInstructEmbeddings(BaseModel, Embeddings):
instruction_pair = [self.query_instruction, text]
embedding = self.client.encode([instruction_pair], **self.encode_kwargs)[0]
return embedding.tolist()
class HuggingFaceBgeEmbeddings(BaseModel, Embeddings):
"""HuggingFace BGE sentence_transformers embedding models.
To use, you should have the ``sentence_transformers`` python package installed.
Example:
.. code-block:: python
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
hf = HuggingFaceBgeEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
"""
client: Any #: :meta private:
model_name: str = DEFAULT_BGE_MODEL
"""Model name to use."""
cache_folder: Optional[str] = None
"""Path to store models.
Can be also set by SENTENCE_TRANSFORMERS_HOME environment variable."""
model_kwargs: Dict[str, Any] = Field(default_factory=dict)
"""Key word arguments to pass to the model."""
encode_kwargs: Dict[str, Any] = Field(default_factory=dict)
"""Key word arguments to pass when calling the `encode` method of the model."""
query_instruction: str = DEFAULT_QUERY_BGE_INSTRUCTION_EN
"""Instruction to use for embedding query."""
def __init__(self, **kwargs: Any):
"""Initialize the sentence_transformer."""
super().__init__(**kwargs)
try:
import sentence_transformers
except ImportError as exc:
raise ImportError(
"Could not import sentence_transformers python package. "
"Please install it with `pip install sentence_transformers`."
) from exc
self.client = sentence_transformers.SentenceTransformer(
self.model_name, cache_folder=self.cache_folder, **self.model_kwargs
)
if "-zh" in self.model_name:
self.query_instruction = DEFAULT_QUERY_BGE_INSTRUCTION_ZH
class Config:
"""Configuration for this pydantic object."""
extra = Extra.forbid
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Compute doc embeddings using a HuggingFace transformer model.
Args:
texts: The list of texts to embed.
Returns:
List of embeddings, one for each text.
"""
texts = [t.replace("\n", " ") for t in texts]
embeddings = self.client.encode(texts, **self.encode_kwargs)
return embeddings.tolist()
def embed_query(self, text: str) -> List[float]:
"""Compute query embeddings using a HuggingFace transformer model.
Args:
text: The text to embed.
Returns:
Embeddings for the text.
"""
text = text.replace("\n", " ")
embedding = self.client.encode(
self.query_instruction + text, **self.encode_kwargs
)
return embedding.tolist()

View File

@@ -295,7 +295,13 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
if self.openai_api_type in ("azure", "azure_ad", "azuread"):
openai_args["engine"] = self.deployment
if self.openai_proxy:
import openai
try:
import openai
except ImportError:
raise ImportError(
"Could not import openai python package. "
"Please install it with `pip install openai`."
)
openai.proxy = {
"http": self.openai_proxy,

View File

@@ -100,14 +100,14 @@ class PairwiseStringResultOutputParser(BaseOutputParser[dict]):
"""
return "pairwise_string_result"
def parse(self, text: str) -> Any:
def parse(self, text: str) -> Dict[str, Any]:
"""Parse the output text.
Args:
text (str): The output text to parse.
Returns:
Any: The parsed output.
Dict: The parsed output.
Raises:
ValueError: If the verdict is invalid.

Some files were not shown because too many files have changed in this diff Show More