Add a method that exposes a similarity search with corresponding
normalized similarity scores. Implement only for FAISS now.
### Motivation:
Some memory definitions combine `relevance` with other scores, like
recency , importance, etc.
While many (but not all) of the `VectorStore`'s expose a
`similarity_search_with_score` method, they don't all interpret the
units of that score (depends on the distance metric and whether or not
the the embeddings are normalized).
This PR proposes a `similarity_search_with_normalized_similarities`
method that lets consumers of the vector store not have to worry about
the metric and embedding scale.
*Most providers default to euclidean distance, with Pinecone being one
exception (defaults to cosine _similarity_).*
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
The encoding fetch was out of date. Luckily OpenAI has a nice[
`encoding_for_model`](46287bfa49/tiktoken/model.py)
function in `tiktoken` we can use now.
Title, lang and description are on almost every web page, and are
incredibly useful pieces of information that currently isn't captured
with the current web base loader
I thought about adding the title and description to the content of the
document, as
that content could be useful in search, but I left it out for right now.
If you think
it'd be worth adding, happy to add it.
I've found it's nice to have the title/description in the metadata to
have some structured data
when retrieving rows from vectordbs for use with summary and source
citation, so if we do want to add it to the `page_content`, i'd advocate
for it to also be included in metadata.
Same as similarity_search, allows child classes to add vector
store-specific args (this was technically already happening in couple
places but now typing is correct).
Minor cosmetic changes
- Activeloop environment cred authentication in notebooks with
`getpass.getpass` (instead of CLI which not always works)
- much faster tests with Deep Lake pytest mode on
- Deep Lake kwargs pass
Notes
- I put pytest environment creds inside `vectorstores/conftest.py`, but
feel free to suggest a better location. For context, if I put in
`test_deeplake.py`, `ruff` doesn't let me to set them before import
deeplake
---------
Co-authored-by: Davit Buniatyan <d@activeloop.ai>
Note to self: Always run integration tests, even on "that last minute
change you thought would be safe" :)
---------
Co-authored-by: Mike Lambert <mike.lambert@anthropic.com>
**About**
Specify encoding to avoid UnicodeDecodeError when reading .txt for users
who are following the tutorial.
**Reference**
```
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1205: character maps to <undefined>
```
**Environment**
OS: Win 11
Python: 3.8
* Adds an Anthropic ChatModel
* Factors out common code in our LLMModel and ChatModel
* Supports streaming llm-tokens to the callbacks on a delta basis (until
a future V2 API does that for us)
* Some fixes
Allows users to specify what files should be loaded instead of
indiscriminately loading the entire repo.
extends #2851
NOTE: for reviewers, `hide whitespace` option recommended since I
changed the indentation of an if-block to use `continue` instead so it
looks less like a Christmas tree :)