Next version of PrivateGPT (#1077)

* Dockerize private-gpt

* Use port 8001 for local development

* Add setup script

* Add CUDA Dockerfile

* Create README.md

* Make the API use OpenAI response format

* Truncate prompt

* refactor: add models and __pycache__ to .gitignore

* Better naming

* Update readme

* Move models ignore to it's folder

* Add scaffolding

* Apply formatting

* Fix tests

* Working sagemaker custom llm

* Fix linting

* Fix linting

* Enable streaming

* Allow all 3.11 python versions

* Use llama 2 prompt format and fix completion

* Restructure (#3)

Co-authored-by: Pablo Orgaz <pablo@Pablos-MacBook-Pro.local>

* Fix Dockerfile

* Use a specific build stage

* Cleanup

* Add FastAPI skeleton

* Cleanup openai package

* Fix DI and tests

* Split tests and tests with coverage

* Remove old scaffolding

* Add settings logic (#4)

* Add settings logic

* Add settings for sagemaker

---------

Co-authored-by: Pablo Orgaz <pablo@Pablos-MacBook-Pro.local>

* Local LLM (#5)

* Add settings logic

* Add settings for sagemaker

* Add settings-local-example.yaml

* Delete terraform files

* Refactor tests to use fixtures

* Join deltas

* Add local model support

---------

Co-authored-by: Pablo Orgaz <pablo@Pablos-MacBook-Pro.local>

* Update README.md

* Fix tests

* Version bump

* Enable simple llamaindex observability (#6)

* Enable simple llamaindex observability

* Improve code through linting

* Update README.md

* Move to async (#7)

* Migrate implementation to use asyncio

* Formatting

* Cleanup

* Linting

---------

Co-authored-by: Pablo Orgaz <pablo@Pablos-MacBook-Pro.local>

* Query Docs and gradio UI

* Remove unnecessary files

* Git ignore chromadb folder

* Async migration + DI Cleanup

* Fix tests

* Add integration test

* Use fastapi responses

* Retrieval service with partial implementation

* Cleanup

* Run formatter

* Fix types

* Fetch nodes asynchronously

* Install local dependencies in tests

* Install ui dependencies in tests

* Install dependencies for llama-cpp

* Fix sudo

* Attempt to fix cuda issues

* Attempt to fix cuda issues

* Try to reclaim some space from ubuntu machine

* Retrieval with context

* Fix lint and imports

* Fix mypy

* Make retrieval API a POST

* Make Completions body a dataclass

* Fix LLM chat message order

* Add Query Chunks to Gradio UI

* Improve rag query prompt

* Rollback CI Changes

* Move to sync code

* Using Llamaindex abstraction for query retrieval

* Fix types

* Default to CONDENSED chat mode for contextualized chat

* Rename route function

* Add Chat endpoint

* Remove webhooks

* Add IntelliJ run config to gitignore

* .gitignore applied

* Sync chat completion

* Refactor total

* Typo in context_files.py

* Add embeddings component and service

* Remove wrong dataclass from IngestService

* Filter by context file id implementation

* Fix typing

* Implement context_filter and separate from the bool use_context in the API

* Change chunks api to avoid conceptual class of the context concept

* Deprecate completions and fix tests

* Remove remaining dataclasses

* Use embedding component in ingest service

* Fix ingestion to have multipart and local upload

* Fix ingestion API

* Add chunk tests

* Add configurable paths

* Cleaning up

* Add more docs

* IngestResponse includes a list of IngestedDocs

* Use IngestedDoc in the Chunk document reference

* Rename ingest routes to ingest_router.py

* Fix test working directory for intellij

* Set testpaths for pytest

* Remove unused as_chat_engine

* Add .fleet ide to gitignore

* Make LLM and Embedding model configurable

* Fix imports and checks

* Let local_data folder exist empty in the repository

* Don't use certain metadata in LLM

* Remove long lines

* Fix windows installation

* Typos

* Update poetry.lock

* Add TODO for linux

* Script and first version of docs

* No jekill build

* Fix relative url to openapi json

* Change default docs values

* Move chromadb dependency to the general group

* Fix tests to use separate local_data

* Create CNAME

* Update CNAME

* Fix openapi.json relative path

* PrivateGPT logo

* WIP OpenAPI documentation metadata

* Add ingest script (#11)

* Add ingest script

* Fix broken name refactor

* Add ingest docs and Makefile script

* Linting

* Move transformers to main dependency

* Move torch to main dependencies

* Don't load HuggingFaceEmbedding in tests

* Fix lint

---------

Co-authored-by: Pablo Orgaz <pablo@Pablos-MacBook-Pro.local>

* Rename file to camel_case

* Commit settings-local.yaml

* Move documentation to public docs

* Fix docker image for linux

* Installation and Running the Server documentation

* Move back to docs folder, as it is the only supported by github pages

* Delete CNAME

* Create CNAME

* Delete CNAME

* Create CNAME

* Improved API documentation

* Fix lint

* Completions documentation

* Updated openapi scheme

* Ingestion API doc

* Minor doc changes

* Updated openapi scheme

* Chunks API documentation

* Embeddings and Health API, and homogeneous responses

* Revamp README with new skeleton of content

* More docs

* PrivateGPT logo

* Improve UI

* Update ingestion docu

* Update README with new sections

* Use context window in the retriever

* Gradio Documentation

* Add logo to UI

* Include Contributing and Community sections to README

* Update links to resources in the README

* Small README.md updates

* Wrap lines of README.md

* Don't put health under /v1

* Add copy button to Chat

* Architecture documentation

* Updated openapi.json

* Updated openapi.json

* Updated openapi.json

* Change UI label

* Update documentation

* Add releases link to README.md

* Gradio avatar and stop debug

* Readme update

* Clean old files

* Remove unused terraform checks

* Update twitter link.

* Disable minimum coverage

* Clean install message in README.md

---------

Co-authored-by: Pablo Orgaz <pablo@Pablos-MacBook-Pro.local>
Co-authored-by: Iván Martínez <ivanmartit@gmail.com>
Co-authored-by: RubenGuerrero <ruben.guerrero@boopos.com>
Co-authored-by: Daniel Gallego Vico <daniel.gallego@bq.com>
This commit is contained in:
Pablo Orgaz
2023-10-19 16:04:35 +02:00
committed by GitHub
parent 78d1ef44ad
commit 51cc638758
98 changed files with 7067 additions and 3397 deletions

View File

@@ -0,0 +1 @@
"""private-gpt server."""

View File

View File

@@ -0,0 +1,82 @@
from fastapi import APIRouter
from llama_index.llms import ChatMessage, MessageRole
from pydantic import BaseModel
from starlette.responses import StreamingResponse
from private_gpt.di import root_injector
from private_gpt.open_ai.extensions.context_filter import ContextFilter
from private_gpt.open_ai.openai_models import (
OpenAICompletion,
OpenAIMessage,
to_openai_response,
to_openai_sse_stream,
)
from private_gpt.server.chat.chat_service import ChatService
chat_router = APIRouter(prefix="/v1")
class ChatBody(BaseModel):
messages: list[OpenAIMessage]
use_context: bool = False
context_filter: ContextFilter | None = None
stream: bool = False
model_config = {
"json_schema_extra": {
"examples": [
{
"messages": [
{
"role": "user",
"content": "How do you fry an egg?",
}
],
"stream": False,
"use_context": True,
"context_filter": {
"docs_ids": ["c202d5e6-7b69-4869-81cc-dd574ee8ee11"]
},
}
]
}
}
@chat_router.post(
"/chat/completions",
response_model=None,
responses={200: {"model": OpenAICompletion}},
tags=["Contextual Completions"],
)
def chat_completion(body: ChatBody) -> OpenAICompletion | StreamingResponse:
"""Given a list of messages comprising a conversation, return a response.
If `use_context` is set to `true`, the model will use context coming
from the ingested documents to create the response. The documents being used can
be filtered using the `context_filter` and passing the document IDs to be used.
Ingested documents IDs can be found using `/ingest/list` endpoint. If you want
all ingested documents to be used, remove `context_filter` altogether.
When using `'stream': true`, the API will return data chunks following [OpenAI's
streaming model](https://platform.openai.com/docs/api-reference/chat/streaming):
```
{"id":"12345","object":"completion.chunk","created":1694268190,
"model":"private-gpt","choices":[{"index":0,"delta":{"content":"Hello"},
"finish_reason":null}]}
```
"""
service = root_injector.get(ChatService)
all_messages = [
ChatMessage(content=m.content, role=MessageRole(m.role)) for m in body.messages
]
if body.stream:
stream = service.stream_chat(
all_messages, body.use_context, body.context_filter
)
return StreamingResponse(
to_openai_sse_stream(stream), media_type="text/event-stream"
)
else:
response = service.chat(all_messages, body.use_context, body.context_filter)
return to_openai_response(response)

View File

@@ -0,0 +1,116 @@
from collections.abc import Sequence
from typing import TYPE_CHECKING, Any
from injector import inject, singleton
from llama_index import ServiceContext, StorageContext, VectorStoreIndex
from llama_index.chat_engine import ContextChatEngine
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.llm_predictor.utils import stream_chat_response_to_tokens
from llama_index.llms import ChatMessage
from llama_index.types import TokenGen
from private_gpt.components.embedding.embedding_component import EmbeddingComponent
from private_gpt.components.llm.llm_component import LLMComponent
from private_gpt.components.node_store.node_store_component import NodeStoreComponent
from private_gpt.components.vector_store.vector_store_component import (
VectorStoreComponent,
)
from private_gpt.open_ai.extensions.context_filter import ContextFilter
if TYPE_CHECKING:
from llama_index.chat_engine.types import (
AgentChatResponse,
StreamingAgentChatResponse,
)
@singleton
class ChatService:
@inject
def __init__(
self,
llm_component: LLMComponent,
vector_store_component: VectorStoreComponent,
embedding_component: EmbeddingComponent,
node_store_component: NodeStoreComponent,
) -> None:
self.llm_service = llm_component
self.vector_store_component = vector_store_component
self.storage_context = StorageContext.from_defaults(
vector_store=vector_store_component.vector_store,
docstore=node_store_component.doc_store,
index_store=node_store_component.index_store,
)
self.service_context = ServiceContext.from_defaults(
llm=llm_component.llm, embed_model=embedding_component.embedding_model
)
self.index = VectorStoreIndex.from_vector_store(
vector_store_component.vector_store,
storage_context=self.storage_context,
service_context=self.service_context,
show_progress=True,
)
def _chat_with_contex(
self,
message: str,
context_filter: ContextFilter | None = None,
chat_history: Sequence[ChatMessage] | None = None,
streaming: bool = False,
) -> Any:
vector_index_retriever = self.vector_store_component.get_retriever(
index=self.index, context_filter=context_filter
)
chat_engine = ContextChatEngine.from_defaults(
retriever=vector_index_retriever,
service_context=self.service_context,
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window"),
],
)
if streaming:
result = chat_engine.stream_chat(message, chat_history)
else:
result = chat_engine.chat(message, chat_history)
return result
def stream_chat(
self,
messages: list[ChatMessage],
use_context: bool = False,
context_filter: ContextFilter | None = None,
) -> TokenGen:
if use_context:
last_message = messages[-1].content
response: StreamingAgentChatResponse = self._chat_with_contex(
message=last_message if last_message is not None else "",
chat_history=messages[:-1],
context_filter=context_filter,
streaming=True,
)
response_gen = response.response_gen
else:
stream = self.llm_service.llm.stream_chat(messages)
response_gen = stream_chat_response_to_tokens(stream)
return response_gen
def chat(
self,
messages: list[ChatMessage],
use_context: bool = False,
context_filter: ContextFilter | None = None,
) -> str:
if use_context:
last_message = messages[-1].content
wrapped_response: AgentChatResponse = self._chat_with_contex(
message=last_message if last_message is not None else "",
chat_history=messages[:-1],
context_filter=context_filter,
streaming=False,
)
response = wrapped_response.response
else:
chat_response = self.llm_service.llm.chat(messages)
response_content = chat_response.message.content
response = response_content if response_content is not None else ""
return response

View File

View File

@@ -0,0 +1,53 @@
from fastapi import APIRouter
from pydantic import BaseModel, Field
from private_gpt.di import root_injector
from private_gpt.open_ai.extensions.context_filter import ContextFilter
from private_gpt.server.chunks.chunks_service import Chunk, ChunksService
chunks_router = APIRouter(prefix="/v1")
class ChunksBody(BaseModel):
text: str = Field(examples=["Q3 2023 sales"])
context_filter: ContextFilter | None = None
limit: int = 10
prev_next_chunks: int = Field(default=0, examples=[2])
class ChunksResponse(BaseModel):
object: str = Field(enum=["list"])
model: str = Field(enum=["private-gpt"])
data: list[Chunk]
@chunks_router.post("/chunks", tags=["Context Chunks"])
def chunks_retrieval(body: ChunksBody) -> ChunksResponse:
"""Given a `text`, returns the most relevant chunks from the ingested documents.
The returned information can be used to generate prompts that can be
passed to `/completions` or `/chat/completions` APIs. Note: it is usually a very
fast API, because only the Embeddings model is involved, not the LLM. The
returned information contains the relevant chunk `text` together with the source
`document` it is coming from. It also contains a score that can be used to
compare different results.
The max number of chunks to be returned is set using the `limit` param.
Previous and next chunks (pieces of text that appear right before or after in the
document) can be fetched by using the `prev_next_chunks` field.
The documents being used can be filtered using the `context_filter` and passing
the document IDs to be used. Ingested documents IDs can be found using
`/ingest/list` endpoint. If you want all ingested documents to be used,
remove `context_filter` altogether.
"""
service = root_injector.get(ChunksService)
results = service.retrieve_relevant(
body.text, body.context_filter, body.limit, body.prev_next_chunks
)
return ChunksResponse(
object="list",
model="private-gpt",
data=results,
)

View File

@@ -0,0 +1,119 @@
from typing import TYPE_CHECKING
from injector import inject, singleton
from llama_index import ServiceContext, StorageContext, VectorStoreIndex
from llama_index.schema import NodeWithScore
from pydantic import BaseModel, Field
from private_gpt.components.embedding.embedding_component import EmbeddingComponent
from private_gpt.components.llm.llm_component import LLMComponent
from private_gpt.components.node_store.node_store_component import NodeStoreComponent
from private_gpt.components.vector_store.vector_store_component import (
VectorStoreComponent,
)
from private_gpt.open_ai.extensions.context_filter import ContextFilter
from private_gpt.server.ingest.ingest_service import IngestedDoc
if TYPE_CHECKING:
from llama_index.schema import RelatedNodeInfo
class Chunk(BaseModel):
object: str = Field(enum=["context.chunk"])
score: float = Field(examples=[0.023])
document: IngestedDoc
text: str = Field(examples=["Outbound sales increased 20%, driven by new leads."])
previous_texts: list[str] | None = Field(
examples=[["SALES REPORT 2023", "Inbound didn't show major changes."]]
)
next_texts: list[str] | None = Field(
examples=[
[
"New leads came from Google Ads campaign.",
"The campaign was run by the Marketing Department",
]
]
)
@singleton
class ChunksService:
@inject
def __init__(
self,
llm_component: LLMComponent,
vector_store_component: VectorStoreComponent,
embedding_component: EmbeddingComponent,
node_store_component: NodeStoreComponent,
) -> None:
self.vector_store_component = vector_store_component
self.storage_context = StorageContext.from_defaults(
vector_store=vector_store_component.vector_store,
docstore=node_store_component.doc_store,
index_store=node_store_component.index_store,
)
self.query_service_context = ServiceContext.from_defaults(
llm=llm_component.llm, embed_model=embedding_component.embedding_model
)
def _get_sibling_nodes_text(
self, node_with_score: NodeWithScore, related_number: int, forward: bool = True
) -> list[str]:
explored_nodes_texts = []
current_node = node_with_score.node
for _ in range(related_number):
explored_node_info: RelatedNodeInfo | None = (
current_node.next_node if forward else current_node.prev_node
)
if explored_node_info is None:
break
explored_node = self.storage_context.docstore.get_node(
explored_node_info.node_id
)
explored_nodes_texts.append(explored_node.get_content())
current_node = explored_node
return explored_nodes_texts
def retrieve_relevant(
self,
text: str,
context_filter: ContextFilter | None = None,
limit: int = 10,
prev_next_chunks: int = 0,
) -> list[Chunk]:
index = VectorStoreIndex.from_vector_store(
self.vector_store_component.vector_store,
storage_context=self.storage_context,
service_context=self.query_service_context,
show_progress=True,
)
vector_index_retriever = self.vector_store_component.get_retriever(
index=index, context_filter=context_filter, similarity_top_k=limit
)
nodes = vector_index_retriever.retrieve(text)
nodes.sort(key=lambda n: n.score or 0.0, reverse=True)
retrieved_nodes = []
for node in nodes:
doc_id = node.node.ref_doc_id if node.node.ref_doc_id is not None else "-"
retrieved_nodes.append(
Chunk(
object="context.chunk",
score=node.score or 0.0,
document=IngestedDoc(
object="ingest.document",
doc_id=doc_id,
doc_metadata=node.metadata,
),
text=node.get_content(),
previous_texts=self._get_sibling_nodes_text(
node, prev_next_chunks, False
),
next_texts=self._get_sibling_nodes_text(node, prev_next_chunks),
)
)
return retrieved_nodes

View File

@@ -0,0 +1 @@
"""Deprecated Openai compatibility endpoint."""

View File

@@ -0,0 +1,66 @@
from fastapi import APIRouter
from pydantic import BaseModel
from starlette.responses import StreamingResponse
from private_gpt.open_ai.extensions.context_filter import ContextFilter
from private_gpt.open_ai.openai_models import (
OpenAICompletion,
OpenAIMessage,
)
from private_gpt.server.chat.chat_router import ChatBody, chat_completion
completions_router = APIRouter(prefix="/v1")
class CompletionsBody(BaseModel):
prompt: str
use_context: bool = False
context_filter: ContextFilter | None = None
stream: bool = False
model_config = {
"json_schema_extra": {
"examples": [
{
"prompt": "How do you fry an egg?",
"stream": False,
"use_context": False,
}
]
}
}
@completions_router.post(
"/completions",
response_model=None,
summary="Completion",
responses={200: {"model": OpenAICompletion}},
tags=["Contextual Completions"],
)
def prompt_completion(body: CompletionsBody) -> OpenAICompletion | StreamingResponse:
"""We recommend most users use our Chat completions API.
Given a prompt, the model will return one predicted completion. If `use_context`
is set to `true`, the model will use context coming from the ingested documents
to create the response. The documents being used can be filtered using the
`context_filter` and passing the document IDs to be used. Ingested documents IDs
can be found using `/ingest/list` endpoint. If you want all ingested documents to
be used, remove `context_filter` altogether.
When using `'stream': true`, the API will return data chunks following [OpenAI's
streaming model](https://platform.openai.com/docs/api-reference/chat/streaming):
```
{"id":"12345","object":"completion.chunk","created":1694268190,
"model":"private-gpt","choices":[{"index":0,"delta":{"content":"Hello"},
"finish_reason":null}]}
```
"""
message = OpenAIMessage(content=body.prompt, role="user")
chat_body = ChatBody(
messages=[message],
use_context=body.use_context,
stream=body.stream,
context_filter=body.context_filter,
)
return chat_completion(chat_body)

View File

@@ -0,0 +1,33 @@
from fastapi import APIRouter
from pydantic import BaseModel, Field
from private_gpt.di import root_injector
from private_gpt.server.embeddings.embeddings_service import (
Embedding,
EmbeddingsService,
)
embeddings_router = APIRouter(prefix="/v1")
class EmbeddingsBody(BaseModel):
input: str | list[str]
class EmbeddingsResponse(BaseModel):
object: str = Field(enum=["list"])
model: str = Field(enum=["private-gpt"])
data: list[Embedding]
@embeddings_router.post("/embeddings", tags=["Embeddings"])
def embeddings_generation(body: EmbeddingsBody) -> EmbeddingsResponse:
"""Get a vector representation of a given input.
That vector representation can be easily consumed
by machine learning models and algorithms.
"""
service = root_injector.get(EmbeddingsService)
input_texts = body.input if isinstance(body.input, list) else [body.input]
embeddings = service.texts_embeddings(input_texts)
return EmbeddingsResponse(object="list", model="private-gpt", data=embeddings)

View File

@@ -0,0 +1,28 @@
from injector import inject, singleton
from pydantic import BaseModel, Field
from private_gpt.components.embedding.embedding_component import EmbeddingComponent
class Embedding(BaseModel):
index: int
object: str = Field(enum=["embedding"])
embedding: list[float] = Field(examples=[[0.0023064255, -0.009327292]])
@singleton
class EmbeddingsService:
@inject
def __init__(self, embedding_component: EmbeddingComponent) -> None:
self.embedding_model = embedding_component.embedding_model
def texts_embeddings(self, texts: list[str]) -> list[Embedding]:
texts_embeddings = self.embedding_model.get_text_embedding_batch(texts)
return [
Embedding(
index=texts_embeddings.index(embedding),
object="embedding",
embedding=embedding,
)
for embedding in texts_embeddings
]

View File

View File

@@ -0,0 +1,14 @@
from fastapi import APIRouter
from pydantic import BaseModel, Field
health_router = APIRouter()
class HealthResponse(BaseModel):
status: str = Field(enum=["ok"])
@health_router.get("/health", tags=["Health"])
def health() -> HealthResponse:
"""Return ok if the system is up."""
return HealthResponse(status="ok")

View File

View File

@@ -0,0 +1,49 @@
from fastapi import APIRouter, HTTPException, UploadFile
from pydantic import BaseModel, Field
from private_gpt.di import root_injector
from private_gpt.server.ingest.ingest_service import IngestedDoc, IngestService
ingest_router = APIRouter(prefix="/v1")
class IngestResponse(BaseModel):
object: str = Field(enum=["list"])
model: str = Field(enum=["private-gpt"])
data: list[IngestedDoc]
@ingest_router.post("/ingest", tags=["Ingestion"])
def ingest(file: UploadFile) -> IngestResponse:
"""Ingests and processes a file, storing its chunks to be used as context.
The context obtained from files is later used in
`/chat/completions`, `/completions`, and `/chunks` APIs.
Most common document
formats are supported, but you may be prompted to install an extra dependency to
manage a specific file type.
A file can generate different Documents (for example a PDF generates one Document
per page). All Documents IDs are returned in the response, together with the
extracted Metadata (which is later used to improve context retrieval). Those IDs
can be used to filter the context used to create responses in
`/chat/completions`, `/completions`, and `/chunks` APIs.
"""
service = root_injector.get(IngestService)
if file.filename is None:
raise HTTPException(400, "No file name provided")
ingested_documents = service.ingest(file.filename, file.file.read())
return IngestResponse(object="list", model="private-gpt", data=ingested_documents)
@ingest_router.get("/ingest/list", tags=["Ingestion"])
def list_ingested() -> IngestResponse:
"""Lists already ingested Documents including their Document ID and metadata.
Those IDs can be used to filter the context used to create responses
in `/chat/completions`, `/completions`, and `/chunks` APIs.
"""
service = root_injector.get(IngestService)
ingested_documents = service.list_ingested()
return IngestResponse(object="list", model="private-gpt", data=ingested_documents)

View File

@@ -0,0 +1,159 @@
import tempfile
from pathlib import Path
from typing import TYPE_CHECKING, Any, AnyStr
from injector import inject, singleton
from llama_index import (
Document,
ServiceContext,
StorageContext,
StringIterableReader,
VectorStoreIndex,
)
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.readers.file.base import DEFAULT_FILE_READER_CLS
from pydantic import BaseModel, Field
from private_gpt.components.embedding.embedding_component import EmbeddingComponent
from private_gpt.components.llm.llm_component import LLMComponent
from private_gpt.components.node_store.node_store_component import NodeStoreComponent
from private_gpt.components.vector_store.vector_store_component import (
VectorStoreComponent,
)
from private_gpt.paths import local_data_path
if TYPE_CHECKING:
from llama_index.readers.base import BaseReader
class IngestedDoc(BaseModel):
object: str = Field(enum=["ingest.document"])
doc_id: str = Field(examples=["c202d5e6-7b69-4869-81cc-dd574ee8ee11"])
doc_metadata: dict[str, Any] | None = Field(
examples=[
{
"page_label": "2",
"file_name": "Sales Report Q3 2023.pdf",
}
]
)
@staticmethod
def curate_metadata(metadata: dict[str, Any]) -> dict[str, Any]:
"""Remove unwanted metadata keys."""
metadata.pop("doc_id", None)
metadata.pop("window", None)
metadata.pop("original_text", None)
return metadata
@singleton
class IngestService:
@inject
def __init__(
self,
llm_component: LLMComponent,
vector_store_component: VectorStoreComponent,
embedding_component: EmbeddingComponent,
node_store_component: NodeStoreComponent,
) -> None:
self.llm_service = llm_component
self.storage_context = StorageContext.from_defaults(
vector_store=vector_store_component.vector_store,
docstore=node_store_component.doc_store,
index_store=node_store_component.index_store,
)
self.ingest_service_context = ServiceContext.from_defaults(
llm=self.llm_service.llm,
embed_model=embedding_component.embedding_model,
node_parser=SentenceWindowNodeParser.from_defaults(),
)
def ingest(self, file_name: str, file_data: AnyStr | Path) -> list[IngestedDoc]:
extension = Path(file_name).suffix
reader_cls = DEFAULT_FILE_READER_CLS.get(extension)
documents: list[Document]
if reader_cls is None:
# Read as a plain text
string_reader = StringIterableReader()
if isinstance(file_data, Path):
text = file_data.read_text()
documents = string_reader.load_data([text])
elif isinstance(file_data, bytes):
documents = string_reader.load_data([file_data.decode("utf-8")])
elif isinstance(file_data, str):
documents = string_reader.load_data([file_data])
else:
raise ValueError(f"Unsupported data type {type(file_data)}")
else:
reader: BaseReader = reader_cls()
if isinstance(file_data, Path):
# Already a path, nothing to do
documents = reader.load_data(file_data)
else:
# llama-index mainly supports reading from files, so
# we have to create a tmp file to read for it to work
with tempfile.NamedTemporaryFile() as tmp:
path_to_tmp = Path(tmp.name)
if isinstance(file_data, bytes):
path_to_tmp.write_bytes(file_data)
else:
path_to_tmp.write_text(str(file_data))
documents = reader.load_data(path_to_tmp)
for document in documents:
document.metadata["file_name"] = file_name
return self._save_docs(documents)
def _save_docs(self, documents: list[Document]) -> list[IngestedDoc]:
for document in documents:
document.metadata["doc_id"] = document.doc_id
# We don't want the Embeddings search to receive this metadata
document.excluded_embed_metadata_keys = ["doc_id"]
# We don't want the LLM to receive these metadata in the context
document.excluded_llm_metadata_keys = ["file_name", "doc_id", "page_label"]
# create vectorStore index
VectorStoreIndex.from_documents(
documents,
storage_context=self.storage_context,
service_context=self.ingest_service_context,
store_nodes_override=True, # Force store nodes in index and document stores
show_progress=True,
)
# persist the index and nodes
self.storage_context.persist(persist_dir=local_data_path)
return [
IngestedDoc(
object="ingest.document",
doc_id=document.doc_id,
doc_metadata=IngestedDoc.curate_metadata(document.metadata),
)
for document in documents
]
def list_ingested(self) -> list[IngestedDoc]:
ingested_docs = []
try:
docstore = self.storage_context.docstore
ingested_docs_ids: set[str] = set()
for node in docstore.docs.values():
if node.ref_doc_id is not None:
ingested_docs_ids.add(node.ref_doc_id)
for doc_id in ingested_docs_ids:
ref_doc_info = docstore.get_ref_doc_info(ref_doc_id=doc_id)
doc_metadata = None
if ref_doc_info is not None and ref_doc_info.metadata is not None:
doc_metadata = IngestedDoc.curate_metadata(ref_doc_info.metadata)
ingested_docs.append(
IngestedDoc(
object="ingest.document",
doc_id=doc_id,
doc_metadata=doc_metadata,
)
)
return ingested_docs
except ValueError:
pass
return ingested_docs

View File

@@ -0,0 +1,46 @@
from collections.abc import Callable
from pathlib import Path
from typing import Any
from watchdog.events import (
DirCreatedEvent,
DirModifiedEvent,
FileCreatedEvent,
FileModifiedEvent,
FileSystemEventHandler,
)
from watchdog.observers import Observer
class IngestWatcher:
def __init__(
self, watch_path: Path, on_file_changed: Callable[[Path], None]
) -> None:
self.watch_path = watch_path
self.on_file_changed = on_file_changed
class Handler(FileSystemEventHandler):
def on_modified(self, event: DirModifiedEvent | FileModifiedEvent) -> None:
if isinstance(event, FileModifiedEvent):
on_file_changed(Path(event.src_path))
def on_created(self, event: DirCreatedEvent | FileCreatedEvent) -> None:
if isinstance(event, FileCreatedEvent):
on_file_changed(Path(event.src_path))
event_handler = Handler()
observer: Any = Observer()
self._observer = observer
self._observer.schedule(event_handler, str(watch_path), recursive=True)
def start(self) -> None:
self._observer.start()
while self._observer.is_alive():
try:
self._observer.join(1)
except KeyboardInterrupt:
break
def stop(self) -> None:
self._observer.stop()
self._observer.join()