community: add ZoteroRetriever (#30270)

**Description** 
This contribution adds a retriever for the Zotero API.
[Zotero](https://www.zotero.org/) is an open source reference management
for bibliographic data and related research materials. A retriever will
allow langchain applications to retrieve relevant documents from
personal or shared group libraries, which I believe will be helpful for
numerous applications, such as RAG systems, personal research
assistants, etc. Tests and docs were added.

The documentation provided assumes the retriever will be part of the
langchain-community package, as this seemed customary. Please let me
know if this is not the preferred way to do it. I also uploaded the
implementation to PyPI.

**Dependencies**
The retriever requires the `pyzotero` package for API access. This
dependency is stated in the docs, and the retriever will return an error
if the package is not found. However, this dependency is not added to
the langchain package itself.

**Twitter handle**
I'm no longer using Twitter, but I'd appreciate a shoutout on
[Bluesky](https://bsky.app/profile/koenigt.bsky.social) or
[LinkedIn](https://www.linkedin.com/in/dr-tim-k%C3%B6nig-534aa2324/)!


Let me know if there are any issues, I'll gladly try and sort them out!

---------

Co-authored-by: Chester Curme <chester.curme@gmail.com>
This commit is contained in:
Tim König 2025-03-20 01:19:32 +01:00 committed by GitHub
parent aa5ac9279a
commit b5992695ae
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 301 additions and 1 deletions

View File

@ -0,0 +1,35 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zotero\n",
"\n",
"[Zotero](https://www.zotero.org/) is an open source reference management system intended for managing bibliographic data and related research materials. You can connect to your personal library, as well as shared group libraries, via the [API](https://www.zotero.org/support/dev/web_api/v3/start). This retriever implementation utilizes [PyZotero](https://github.com/urschrei/pyzotero) to access libraries. \n",
"\n",
"\n",
"## Installation\n",
"\n",
"```bash\n",
"pip install pyzotero\n",
"```\n",
"\n",
"## Retriever\n",
"\n",
"See a [usage example](/docs/integrations/retrievers/zotero).\n",
"\n",
"```python\n",
"from langchain_zotero_retriever.retrievers import ZoteroRetriever\n",
"```"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,260 @@
{
"cells": [
{
"cell_type": "raw",
"id": "afaf8039",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"---\n",
"sidebar_label: Zotero\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "e49f1e0d",
"metadata": {},
"source": [
"# ZoteroRetriever\n",
"\n",
"This will help you getting started with the Zotero [retriever](/docs/concepts/retrievers). For detailed documentation of all ZoteroRetriever features and configurations head to the [Github page](https://github.com/TimBMK/langchain-zotero-retriever).\n",
"\n",
"### Integration details\n",
"\n",
"\n",
"| Retriever | Source | Package |\n",
"| :--- | :--- | :---: |\n",
"[ZoteroRetriever](https://github.com/TimBMK/langchain-zotero-retriever) | [Zotero API](https://www.zotero.org/support/dev/web_api/v3/start) | langchain-community |\n",
"\n",
"## Setup\n"
]
},
{
"cell_type": "markdown",
"id": "72ee0c4b-9764-423a-9dbf-95129e185210",
"metadata": {},
"source": [
"If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a15d341e-3e26-4ca3-830b-5aab30ed66de",
"metadata": {},
"outputs": [],
"source": [
"# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
"# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
]
},
{
"cell_type": "markdown",
"id": "0730d6a1-c893-4840-9817-5e5251676d5d",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"This retriever lives in the `langchain-zotero-retriever` package. We also require the `pyzotero` dependency:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "652d6238-1f87-422a-b135-f5abbb8652fc",
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-zotero-retriever pyzotero"
]
},
{
"cell_type": "markdown",
"id": "a38cde65-254d-4219-a441-068766c0d4b5",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"`ZoteroRetriever` parameters include:\n",
"- `k`: Number of results to include (Default: 50)\n",
"- `type`: Type of search to perform. \"Top\" retrieves top level Zotero library items, \"items\" returns any Zotero library items. (Default: top)\n",
"- `get_fulltext`: Retrieves full texts if they are attached to the items in the library. If False, or no text is attached, returns an empty string as page_content. (Default: True)\n",
"- `library_id`: ID of the Zotero library to search. Required to connect to a library.\n",
"- `library_type`: Type of library to search. \"user\" for personal library, \"group\" for shared group libraries. (Default: user)\n",
"- `api_key`: Zotero API key if not set as an environment variable. Optional, required to access non-public group libraries or your personal library. Fetched automatically if provided as ZOTERO_API_KEY environment variable.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70cc8e65-2a02-408a-bbc6-8ef649057d82",
"metadata": {},
"outputs": [],
"source": [
"from langchain_zotero_retriever.retrievers import ZoteroRetriever\n",
"\n",
"retriever = ZoteroRetriever(\n",
" k=10,\n",
" library_id=\"2319375\", # a public group library that does not require an API key for access\n",
" library_type=\"group\", # set this to \"user\" if you are using a personal library. Personal libraries require an API key\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5c5f2839-4020-424e-9fc9-07777eede442",
"metadata": {},
"source": [
"## Usage\n",
"\n",
"Apart from the `query`, the retriever provides these additional search parameters:\n",
"- `itemType`: Type of item to search for (e.g. \"book\" or \"journalArticle\")\n",
"- `tag`: for searching over tags attached to library items (see search syntax for combining multiple tags)\n",
"- `qmode`: Search mode to use. Changes what the query searches over. \"everything\" includes full-text content. \"titleCreatorYear\" to search over title, authors and year.\n",
"- `since`: Return only objects modified after the specified library version. Defaults to return everything.\n",
"\n",
"For Search Syntax, see Zotero API Documentation: https://www.zotero.org/support/dev/web_api/v3/basics#search_syntax\n",
"\n",
"For the full API schema (including available itemTypes) see: https://github.com/zotero/zotero-schema"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51a60dbe-9f2e-4e04-bb62-23968f17164a",
"metadata": {},
"outputs": [],
"source": [
"query = \"Zuboff\"\n",
"\n",
"retriever.invoke(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "add7bbb0",
"metadata": {},
"outputs": [],
"source": [
"tags = [\n",
" \"Surveillance\",\n",
" \"Digital Capitalism\",\n",
"] # note that providing tags as a list will result in a logical AND operation\n",
"\n",
"retriever.invoke(\"\", tag=tags)"
]
},
{
"cell_type": "markdown",
"id": "d1ee55bc-ffc8-4cfa-801c-993953a08cfd",
"metadata": {},
"source": [
"## Use within a chain\n",
"\n",
"Due to the way the Zotero API search operates, directly passing a user question to the ZoteroRetriever will often not return satisfactory results. For use in chains or agentic frameworks, it is recommended to turn the ZoteroRetriever into a [tool](https://python.langchain.com/docs/how_to/custom_tools/#creating-tools-from-functions). This way, the LLM can turn the user query into a more concise search query for the API. Furthermore, this allows the LLM to fill in additional search parameters, such as tag or item type."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b19a04d",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional, Union\n",
"\n",
"from langchain_core.output_parsers import PydanticToolsParser\n",
"from langchain_core.tools import StructuredTool, tool\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"\n",
"def retrieve(\n",
" query: str,\n",
" itemType: Optional[str],\n",
" tag: Optional[Union[str, List[str]]],\n",
" qmode: str = \"everything\",\n",
" since: Optional[int] = None,\n",
"):\n",
" retrieved_docs = retriever.invoke(\n",
" query, itemType=itemType, tag=tag, qmode=qmode, since=since\n",
" )\n",
" serialized_docs = \"\\n\\n\".join(\n",
" (\n",
" f\"Metadata: { {key: doc.metadata[key] for key in doc.metadata if key != 'abstractNote'} }\\n\"\n",
" f\"Abstract: {doc.metadata['abstractNote']}\\n\"\n",
" )\n",
" for doc in retrieved_docs\n",
" )\n",
"\n",
" return serialized_docs, retrieved_docs\n",
"\n",
"\n",
"description = \"\"\"Search and return relevant documents from a Zotero library. The following search parameters can be used:\n",
"\n",
" Args:\n",
" query: str: The search query to be used. Try to keep this specific and short, e.g. a specific topic or author name\n",
" itemType: Optional. Type of item to search for (e.g. \"book\" or \"journalArticle\"). Multiple types can be passed as a string seperated by \"||\", e.g. \"book || journalArticle\". Defaults to all types.\n",
" tag: Optional. For searching over tags attached to library items. If documents tagged with multiple tags are to be retrieved, pass them as a list. If documents with any of the tags are to be retrieved, pass them as a string separated by \"||\", e.g. \"tag1 || tag2\"\n",
" qmode: Search mode to use. Changes what the query searches over. \"everything\" includes full-text content. \"titleCreatorYear\" to search over title, authors and year. Defaults to \"everything\".\n",
" since: Return only objects modified after the specified library version. Defaults to return everything.\n",
" \"\"\"\n",
"\n",
"retriever_tool = StructuredTool.from_function(\n",
" func=retrieve,\n",
" name=\"retrieve\",\n",
" description=description,\n",
" return_direct=True,\n",
")\n",
"\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini-2024-07-18\")\n",
"\n",
"llm_with_tools = llm.bind_tools([retrieve])\n",
"\n",
"q = \"What journal articles do I have on Surveillance in the zotero library?\"\n",
"\n",
"chain = llm_with_tools | PydanticToolsParser(tools=[retrieve])\n",
"\n",
"chain.invoke(q)"
]
},
{
"cell_type": "markdown",
"id": "3a5bb5ca-c3ae-4a58-be67-2cd18574b9a3",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all ZoteroRetriever features and configurations head to the [Github page](https://github.com/TimBMK/langchain-zotero-retriever).\n",
"\n",
"For detailed documentation on the Zotero API, head to the [Zotero API reference](https://www.zotero.org/support/dev/web_api/v3/start)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -526,4 +526,9 @@ packages:
provider_page: dell
- name: langchain-tavily
path: .
repo: tavily-ai/langchain-tavily
repo: tavily-ai/langchain-tavily
- name: langchain-zotero-retriever
path: .
repo: TimBMK/langchain-zotero-retriever
name_title: Zotero
provider_page: zotero