Update twitter account

fix(llm): special tokens and leading space (#1831 )
Fix version in poetry
2025-10-11 01:33:27 +00:00 · 2024-04-09 15:18:03 +02:00 · 2024-04-04 14:37:29 +02:00 · 2024-04-03 10:59:35 +02:00 · 2024-04-02 18:27:57 +02:00 · 2024-04-02 17:45:15 +02:00
85 changed files with 5158 additions and 4491 deletions
--- a/.github/workflows/actions/install_dependencies/action.yml
+++ b/.github/workflows/actions/install_dependencies/action.yml
@@ -25,6 +25,6 @@ runs:
        python-version: ${{ inputs.python_version }}
        cache: "poetry"
    - name: Install Dependencies
-      run: poetry install --with ui --no-root
+      run: poetry install --extras "ui vector-stores-qdrant" --no-root
      shell: bash

--- a/.github/workflows/preview-docs.yml
+++ b/.github/workflows/preview-docs.yml
@@ -14,6 +14,8 @@ jobs:
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
+        with:
+          ref: refs/pull/${{ github.event.pull_request.number }}/merge

      - name: Setup Node.js 
        uses: actions/setup-node@v4
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,6 @@
 .venv
+.env
+venv

 settings-me.yaml

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,84 @@
 # Changelog

+## [0.5.0](https://github.com/zylon-ai/private-gpt/compare/v0.4.0...v0.5.0) (2024-04-02)
+
+
+### Features
+
+* **code:** improve concat of strings in ui ([#1785](https://github.com/zylon-ai/private-gpt/issues/1785)) ([bac818a](https://github.com/zylon-ai/private-gpt/commit/bac818add51b104cda925b8f1f7b51448e935ca1))
+* **docker:** set default Docker to use Ollama ([#1812](https://github.com/zylon-ai/private-gpt/issues/1812)) ([f83abff](https://github.com/zylon-ai/private-gpt/commit/f83abff8bc955a6952c92cc7bcb8985fcec93afa))
+* **docs:** Add guide Llama-CPP Linux AMD GPU support ([#1782](https://github.com/zylon-ai/private-gpt/issues/1782)) ([8a836e4](https://github.com/zylon-ai/private-gpt/commit/8a836e4651543f099c59e2bf497ab8c55a7cd2e5))
+* **docs:** Feature/upgrade docs ([#1741](https://github.com/zylon-ai/private-gpt/issues/1741)) ([5725181](https://github.com/zylon-ai/private-gpt/commit/572518143ac46532382db70bed6f73b5082302c1))
+* **docs:** upgrade fern ([#1596](https://github.com/zylon-ai/private-gpt/issues/1596)) ([84ad16a](https://github.com/zylon-ai/private-gpt/commit/84ad16af80191597a953248ce66e963180e8ddec))
+* **ingest:** Created a faster ingestion mode - pipeline ([#1750](https://github.com/zylon-ai/private-gpt/issues/1750)) ([134fc54](https://github.com/zylon-ai/private-gpt/commit/134fc54d7d636be91680dc531f5cbe2c5892ac56))
+* **llm - embed:** Add support for Azure OpenAI ([#1698](https://github.com/zylon-ai/private-gpt/issues/1698)) ([1efac6a](https://github.com/zylon-ai/private-gpt/commit/1efac6a3fe19e4d62325e2c2915cd84ea277f04f))
+* **llm:** adds serveral settings for llamacpp and ollama ([#1703](https://github.com/zylon-ai/private-gpt/issues/1703)) ([02dc83e](https://github.com/zylon-ai/private-gpt/commit/02dc83e8e9f7ada181ff813f25051bbdff7b7c6b))
+* **llm:** Ollama LLM-Embeddings decouple + longer keep_alive settings ([#1800](https://github.com/zylon-ai/private-gpt/issues/1800)) ([b3b0140](https://github.com/zylon-ai/private-gpt/commit/b3b0140e244e7a313bfaf4ef10eb0f7e4192710e))
+* **llm:** Ollama timeout setting ([#1773](https://github.com/zylon-ai/private-gpt/issues/1773)) ([6f6c785](https://github.com/zylon-ai/private-gpt/commit/6f6c785dac2bbad37d0b67fda215784298514d39))
+* **local:** tiktoken cache within repo for offline ([#1467](https://github.com/zylon-ai/private-gpt/issues/1467)) ([821bca3](https://github.com/zylon-ai/private-gpt/commit/821bca32e9ee7c909fd6488445ff6a04463bf91b))
+* **nodestore:** add Postgres for the doc and index store ([#1706](https://github.com/zylon-ai/private-gpt/issues/1706)) ([68b3a34](https://github.com/zylon-ai/private-gpt/commit/68b3a34b032a08ca073a687d2058f926032495b3))
+* **rag:** expose similarity_top_k and similarity_score to settings ([#1771](https://github.com/zylon-ai/private-gpt/issues/1771)) ([087cb0b](https://github.com/zylon-ai/private-gpt/commit/087cb0b7b74c3eb80f4f60b47b3a021c81272ae1))
+* **RAG:** Introduce SentenceTransformer Reranker ([#1810](https://github.com/zylon-ai/private-gpt/issues/1810)) ([83adc12](https://github.com/zylon-ai/private-gpt/commit/83adc12a8ef0fa0c13a0dec084fa596445fc9075))
+* **scripts:** Wipe qdrant and obtain db Stats command ([#1783](https://github.com/zylon-ai/private-gpt/issues/1783)) ([ea153fb](https://github.com/zylon-ai/private-gpt/commit/ea153fb92f1f61f64c0d04fff0048d4d00b6f8d0))
+* **ui:** Add Model Information to ChatInterface label ([f0b174c](https://github.com/zylon-ai/private-gpt/commit/f0b174c097c2d5e52deae8ef88de30a0d9013a38))
+* **ui:** add sources check to not repeat identical sources ([#1705](https://github.com/zylon-ai/private-gpt/issues/1705)) ([290b9fb](https://github.com/zylon-ai/private-gpt/commit/290b9fb084632216300e89bdadbfeb0380724b12))
+* **UI:** Faster startup and document listing ([#1763](https://github.com/zylon-ai/private-gpt/issues/1763)) ([348df78](https://github.com/zylon-ai/private-gpt/commit/348df781b51606b2f9810bcd46f850e54192fd16))
+* **ui:** maintain score order when curating sources ([#1643](https://github.com/zylon-ai/private-gpt/issues/1643)) ([410bf7a](https://github.com/zylon-ai/private-gpt/commit/410bf7a71f17e77c4aec723ab80c233b53765964))
+* unify settings for vector and nodestore connections to PostgreSQL ([#1730](https://github.com/zylon-ai/private-gpt/issues/1730)) ([63de7e4](https://github.com/zylon-ai/private-gpt/commit/63de7e4930ac90dd87620225112a22ffcbbb31ee))
+* wipe per storage type ([#1772](https://github.com/zylon-ai/private-gpt/issues/1772)) ([c2d6948](https://github.com/zylon-ai/private-gpt/commit/c2d694852b4696834962a42fde047b728722ad74))
+
+
+### Bug Fixes
+
+* **docs:** Minor documentation amendment ([#1739](https://github.com/zylon-ai/private-gpt/issues/1739)) ([258d02d](https://github.com/zylon-ai/private-gpt/commit/258d02d87c5cb81d6c3a6f06aa69339b670dffa9))
+* Fixed docker-compose ([#1758](https://github.com/zylon-ai/private-gpt/issues/1758)) ([774e256](https://github.com/zylon-ai/private-gpt/commit/774e2560520dc31146561d09a2eb464c68593871))
+* **ingest:** update script label ([#1770](https://github.com/zylon-ai/private-gpt/issues/1770)) ([7d2de5c](https://github.com/zylon-ai/private-gpt/commit/7d2de5c96fd42e339b26269b3155791311ef1d08))
+* **settings:** set default tokenizer to avoid running make setup fail ([#1709](https://github.com/zylon-ai/private-gpt/issues/1709)) ([d17c34e](https://github.com/zylon-ai/private-gpt/commit/d17c34e81a84518086b93605b15032e2482377f7))
+
+## [0.4.0](https://github.com/imartinez/privateGPT/compare/v0.3.0...v0.4.0) (2024-03-06)
+
+
+### Features
+
+* Upgrade to LlamaIndex to 0.10 ([#1663](https://github.com/imartinez/privateGPT/issues/1663)) ([45f0571](https://github.com/imartinez/privateGPT/commit/45f05711eb71ffccdedb26f37e680ced55795d44))
+* **Vector:** support pgvector ([#1624](https://github.com/imartinez/privateGPT/issues/1624)) ([cd40e39](https://github.com/imartinez/privateGPT/commit/cd40e3982b780b548b9eea6438c759f1c22743a8))
+
+## [0.3.0](https://github.com/imartinez/privateGPT/compare/v0.2.0...v0.3.0) (2024-02-16)
+
+
+### Features
+
+* add mistral + chatml prompts ([#1426](https://github.com/imartinez/privateGPT/issues/1426)) ([e326126](https://github.com/imartinez/privateGPT/commit/e326126d0d4cd7e46a79f080c442c86f6dd4d24b))
+* Add stream information to generate SDKs ([#1569](https://github.com/imartinez/privateGPT/issues/1569)) ([24fae66](https://github.com/imartinez/privateGPT/commit/24fae660e6913aac6b52745fb2c2fe128ba2eb79))
+* **API:** Ingest plain text ([#1417](https://github.com/imartinez/privateGPT/issues/1417)) ([6eeb95e](https://github.com/imartinez/privateGPT/commit/6eeb95ec7f17a618aaa47f5034ee5bccae02b667))
+* **bulk-ingest:** Add --ignored Flag to Exclude Specific Files and Directories During Ingestion ([#1432](https://github.com/imartinez/privateGPT/issues/1432)) ([b178b51](https://github.com/imartinez/privateGPT/commit/b178b514519550e355baf0f4f3f6beb73dca7df2))
+* **llm:** Add openailike llm mode ([#1447](https://github.com/imartinez/privateGPT/issues/1447)) ([2d27a9f](https://github.com/imartinez/privateGPT/commit/2d27a9f956d672cb1fe715cf0acdd35c37f378a5)), closes [#1424](https://github.com/imartinez/privateGPT/issues/1424)
+* **llm:** Add support for Ollama LLM ([#1526](https://github.com/imartinez/privateGPT/issues/1526)) ([6bbec79](https://github.com/imartinez/privateGPT/commit/6bbec79583b7f28d9bea4b39c099ebef149db843))
+* **settings:** Configurable context_window and tokenizer ([#1437](https://github.com/imartinez/privateGPT/issues/1437)) ([4780540](https://github.com/imartinez/privateGPT/commit/47805408703c23f0fd5cab52338142c1886b450b))
+* **settings:** Update default model to TheBloke/Mistral-7B-Instruct-v0.2-GGUF ([#1415](https://github.com/imartinez/privateGPT/issues/1415)) ([8ec7cf4](https://github.com/imartinez/privateGPT/commit/8ec7cf49f40701a4f2156c48eb2fad9fe6220629))
+* **ui:** make chat area stretch to fill the screen ([#1397](https://github.com/imartinez/privateGPT/issues/1397)) ([c71ae7c](https://github.com/imartinez/privateGPT/commit/c71ae7cee92463bbc5ea9c434eab9f99166e1363))
+* **UI:** Select file to Query or Delete + Delete ALL ([#1612](https://github.com/imartinez/privateGPT/issues/1612)) ([aa13afd](https://github.com/imartinez/privateGPT/commit/aa13afde07122f2ddda3942f630e5cadc7e4e1ee))
+
+
+### Bug Fixes
+
+* Adding an LLM param to fix broken generator from llamacpp ([#1519](https://github.com/imartinez/privateGPT/issues/1519)) ([869233f](https://github.com/imartinez/privateGPT/commit/869233f0e4f03dc23e5fae43cf7cb55350afdee9))
+* **deploy:** fix local and external dockerfiles ([fde2b94](https://github.com/imartinez/privateGPT/commit/fde2b942bc03688701ed563be6d7d597c75e4e4e))
+* **docker:** docker broken copy ([#1419](https://github.com/imartinez/privateGPT/issues/1419)) ([059f358](https://github.com/imartinez/privateGPT/commit/059f35840adbc3fb93d847d6decf6da32d08670c))
+* **docs:** Update quickstart doc and set version in pyproject.toml to 0.2.0 ([0a89d76](https://github.com/imartinez/privateGPT/commit/0a89d76cc5ed4371ffe8068858f23dfbb5e8cc37))
+* minor bug in chat stream output - python error being serialized ([#1449](https://github.com/imartinez/privateGPT/issues/1449)) ([6191bcd](https://github.com/imartinez/privateGPT/commit/6191bcdbd6e92b6f4d5995967dc196c9348c5954))
+* **settings:** correct yaml multiline string ([#1403](https://github.com/imartinez/privateGPT/issues/1403)) ([2564f8d](https://github.com/imartinez/privateGPT/commit/2564f8d2bb8c4332a6a0ab6d722a2ac15006b85f))
+* **tests:** load the test settings only when running tests ([d3acd85](https://github.com/imartinez/privateGPT/commit/d3acd85fe34030f8cfd7daf50b30c534087bdf2b))
+* **UI:** Updated ui.py. Frees up the CPU to not be bottlenecked. ([24fb80c](https://github.com/imartinez/privateGPT/commit/24fb80ca38f21910fe4fd81505d14960e9ed4faa))
+
+## [0.2.0](https://github.com/imartinez/privateGPT/compare/v0.1.0...v0.2.0) (2023-12-10)
+
+
+### Features
+
+* **llm:** drop default_system_prompt ([#1385](https://github.com/imartinez/privateGPT/issues/1385)) ([a3ed14c](https://github.com/imartinez/privateGPT/commit/a3ed14c58f77351dbd5f8f2d7868d1642a44f017))
+* **ui:** Allows User to Set System Prompt via "Additional Options" in Chat Interface ([#1353](https://github.com/imartinez/privateGPT/issues/1353)) ([145f3ec](https://github.com/imartinez/privateGPT/commit/145f3ec9f41c4def5abf4065a06fb0786e2d992a))
+
 ## [0.1.0](https://github.com/imartinez/privateGPT/compare/v0.0.2...v0.1.0) (2023-11-30)


--- a/Dockerfile.external
+++ b/Dockerfile.external
@@ -5,6 +5,7 @@ RUN pip install pipx
 RUN python3 -m pipx ensurepath
 RUN pipx install poetry
 ENV PATH="/root/.local/bin:$PATH"
+ENV PATH=".venv/bin/:$PATH"

 # https://python-poetry.org/docs/configuration/#virtualenvsin-project
 ENV POETRY_VIRTUALENVS_IN_PROJECT=true
@@ -13,7 +14,7 @@ FROM base as dependencies
 WORKDIR /home/worker/app
 COPY pyproject.toml poetry.lock ./

-RUN poetry install --with ui
+RUN poetry install --extras "ui vector-stores-qdrant llms-ollama embeddings-ollama"

 FROM base as app

@@ -29,8 +30,11 @@ RUN mkdir local_data; chown worker local_data
 RUN mkdir models; chown worker models
 COPY --chown=worker --from=dependencies /home/worker/app/.venv/ .venv
 COPY --chown=worker private_gpt/ private_gpt
-COPY --chown=worker docs/ docs
+COPY --chown=worker fern/ fern
 COPY --chown=worker *.yaml *.md ./
+COPY --chown=worker scripts/ scripts
+
+ENV PYTHONPATH="$PYTHONPATH:/private_gpt/"

 USER worker
-ENTRYPOINT .venv/bin/python -m private_gpt
+ENTRYPOINT python -m private_gpt
--- a/Dockerfile.local
+++ b/Dockerfile.local
@@ -7,6 +7,7 @@ RUN pip install pipx
 RUN python3 -m pipx ensurepath
 RUN pipx install poetry
 ENV PATH="/root/.local/bin:$PATH"
+ENV PATH=".venv/bin/:$PATH"

 # Dependencies to build llama-cpp
 RUN apt update && apt install -y \
@@ -23,8 +24,7 @@ FROM base as dependencies
 WORKDIR /home/worker/app
 COPY pyproject.toml poetry.lock ./

-RUN poetry install --with local
-RUN poetry install --with ui
+RUN poetry install --extras "ui embeddings-huggingface llms-llama-cpp vector-stores-qdrant"

 FROM base as app

@@ -40,8 +40,11 @@ RUN mkdir local_data; chown worker local_data
 RUN mkdir models; chown worker models
 COPY --chown=worker --from=dependencies /home/worker/app/.venv/ .venv
 COPY --chown=worker private_gpt/ private_gpt
-COPY --chown=worker docs/ docs
+COPY --chown=worker fern/ fern
 COPY --chown=worker *.yaml *.md ./
+COPY --chown=worker scripts/ scripts
+
+ENV PYTHONPATH="$PYTHONPATH:/private_gpt/"

 USER worker
-ENTRYPOINT .venv/bin/python -m private_gpt
+ENTRYPOINT python -m private_gpt
--- a/25
+++ b/25
@@ -51,5 +51,28 @@ api-docs:
 ingest:
 	@poetry run python scripts/ingest_folder.py $(call args)

+stats:
+	poetry run python scripts/utils.py stats
+
 wipe:
-	poetry run python scripts/utils.py wipe
+	poetry run python scripts/utils.py wipe
+
+setup:
+	poetry run python scripts/setup
+
+list:
+	@echo "Available commands:"
+	@echo "  test            : Run tests using pytest"
+	@echo "  test-coverage   : Run tests with coverage report"
+	@echo "  black           : Check code format with black"
+	@echo "  ruff            : Check code with ruff"
+	@echo "  format          : Format code with black and ruff"
+	@echo "  mypy            : Run mypy for type checking"
+	@echo "  check           : Run format and mypy commands"
+	@echo "  run             : Run the application"
+	@echo "  dev-windows     : Run the application in development mode on Windows"
+	@echo "  dev             : Run the application in development mode"
+	@echo "  api-docs        : Generate API documentation"
+	@echo "  ingest          : Ingest data using specified script"
+	@echo "  wipe            : Wipe data using specified script"
+	@echo "  setup           : Setup the application"
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
 [![Website](https://img.shields.io/website?up_message=check%20it&down_message=down&url=https%3A%2F%2Fdocs.privategpt.dev%2F&label=Documentation)](https://docs.privategpt.dev/)

 [![Discord](https://img.shields.io/discord/1164200432894234644?logo=discord&label=PrivateGPT)](https://discord.gg/bK6mRVpErU)
-[![X (formerly Twitter) Follow](https://img.shields.io/twitter/follow/PrivateGPT_AI)](https://twitter.com/PrivateGPT_AI)
+[![X (formerly Twitter) Follow](https://img.shields.io/twitter/follow/ZylonPrivateGPT)](https://twitter.com/ZylonPrivateGPT)


 > Install & usage docs: https://docs.privategpt.dev/
@@ -117,7 +117,7 @@ Don't know what to contribute? Here is the public
 [Project Board](https://github.com/users/imartinez/projects/3) with several ideas. 

 Head over to Discord 
-#contributors channel and ask for write permissions on that Github project.
+#contributors channel and ask for write permissions on that GitHub project.

 ## 💬 Community
 Join the conversation around PrivateGPT on our:
@@ -158,4 +158,4 @@ This project has been strongly influenced and supported by other amazing project
 [GPT4All](https://github.com/nomic-ai/gpt4all),
 [LlamaCpp](https://github.com/ggerganov/llama.cpp),
 [Chroma](https://www.trychroma.com/)
-and [SentenceTransformers](https://www.sbert.net/).
+and [SentenceTransformers](https://www.sbert.net/).
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -1,14 +1,16 @@
 services:
  private-gpt:
    build:
-      dockerfile: Dockerfile.local
+      dockerfile: Dockerfile.external
    volumes:
      - ./local_data/:/home/worker/app/local_data
-      - ./models/:/home/worker/app/models
    ports:
      - 8001:8080
    environment:
      PORT: 8080
      PGPT_PROFILES: docker
-      PGPT_MODE: local
-
+      PGPT_MODE: ollama
+  ollama:
+    image: ollama/ollama:latest
+    volumes:
+      - ./models:/root/.ollama
--- a/docs/.nojekyll
+++ b/docs/.nojekyll
--- a/docs/description.md
+++ b/docs/description.md
@@ -1,474 +0,0 @@
-## Introduction
-
-PrivateGPT provides an **API** containing all the building blocks required to build
-**private, context-aware AI applications**. The API follows and extends OpenAI API standard, and supports
-both normal and streaming responses.
-
-The API is divided in two logical blocks:
-
- High-level API, abstracting all the complexity of a RAG (Retrieval Augmented Generation) pipeline implementation:
-    - Ingestion of documents: internally managing document parsing, splitting, metadata extraction,
-      embedding generation and storage.
-    - Chat & Completions using context from ingested documents: abstracting the retrieval of context, the prompt
-      engineering and the response generation.
- Low-level API, allowing advanced users to implement their own complex pipelines:
-    - Embeddings generation: based on a piece of text.
-    - Contextual chunks retrieval: given a query, returns the most relevant chunks of text from the ingested
-      documents.
-
-> A working **Gradio UI client** is provided to test the API, together with a set of
-> useful tools such as bulk model download script, ingestion script, documents folder
-> watch, etc.
-
-## Quick Local Installation steps
-
-The steps in `Installation and Settings` section are better explained and cover more
-setup scenarios. But if you are looking for a quick setup guide, here it is:
-
-```
-# Clone the repo
-git clone https://github.com/imartinez/privateGPT
-cd privateGPT
-
-# Install Python 3.11
-pyenv install 3.11
-pyenv local 3.11
-
-# Install dependencies
-poetry install --with ui,local
-
-# Download Embedding and LLM models
-poetry run python scripts/setup
-
-# (Optional) For Mac with Metal GPU, enable it. Check Installation and Settings section 
-to know how to enable GPU on other platforms
-CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir llama-cpp-python
-
-# Run the local server  
-PGPT_PROFILES=local make run
-
-# Note: on Mac with Metal you should see a ggml_metal_add_buffer log, stating GPU is 
-being used
-
-# Navigate to the UI and try it out! 
-http://localhost:8001/
-```
-
-## Installation and Settings
-
-### Base requirements to run PrivateGPT
-
-* Git clone PrivateGPT repository, and navigate to it:
-
-```
-  git clone https://github.com/imartinez/privateGPT
-  cd privateGPT
-```
-
-* Install Python 3.11. Ideally through a python version manager like `pyenv`.
-  Python 3.12
-  should work too. Earlier python versions are not supported.
-    * osx/linux: [pyenv](https://github.com/pyenv/pyenv)
-    * windows: [pyenv-win](https://github.com/pyenv-win/pyenv-win)
-
-```  
-pyenv install 3.11
-pyenv local 3.11
-```
-
-* Install [Poetry](https://python-poetry.org/docs/#installing-with-the-official-installer) for dependency management:
-
-* Have a valid C++ compiler like gcc. See [Troubleshooting: C++ Compiler](#troubleshooting-c-compiler) for more details.
-
-* Install `make` for scripts:
-    * osx: (Using homebrew): `brew install make`
-    * windows: (Using chocolatey) `choco install make`
-
-### Install dependencies
-
-Install the dependencies:
-
-```bash
-poetry install --with ui
-```
-
-Verify everything is working by running `make run` (or `poetry run python -m private_gpt`) and navigate to
-http://localhost:8001. You should see a [Gradio UI](https://gradio.app/) **configured with a mock LLM** that will
-echo back the input. Later we'll see how to configure a real LLM.
-
-### Settings
-
-> Note: the default settings of PrivateGPT work out-of-the-box for a 100% local setup. Skip this section if you just
-> want to test PrivateGPT locally, and come back later to learn about more configuration options.
-
-PrivateGPT is configured through *profiles* that are defined using yaml files, and selected through env variables.
-The full list of properties configurable can be found in `settings.yaml`
-
-#### env var `PGPT_SETTINGS_FOLDER`
-
-The location of the settings folder. Defaults to the root of the project.
-Should contain the default `settings.yaml` and any other `settings-{profile}.yaml`.
-
-#### env var `PGPT_PROFILES`
-
-By default, the profile definition in `settings.yaml` is loaded.
-Using this env var you can load additional profiles; format is a comma separated list of profile names.
-This will merge `settings-{profile}.yaml` on top of the base settings file.
-
-For example:
-`PGPT_PROFILES=local,cuda` will load `settings-local.yaml`
-and `settings-cuda.yaml`, their contents will be merged with
-later profiles properties overriding values of earlier ones like `settings.yaml`.
-
-During testing, the `test` profile will be active along with the default, therefore `settings-test.yaml`
-file is required.
-
-#### Environment variables expansion
-
-Configuration files can contain environment variables,
-they will be expanded at runtime.
-
-Expansion must follow the pattern `${VARIABLE_NAME:default_value}`.
-
-For example, the following configuration will use the value of the `PORT`
-environment variable or `8001` if it's not set.
-Missing variables with no default will produce an error.
-
-```yaml
-server:
-  port: ${PORT:8001}
-```
-
-### Local LLM requirements
-
-Install extra dependencies for local execution:
-
-```bash
-poetry install --with local
-```
-
-For PrivateGPT to run fully locally GPU acceleration is required
-(CPU execution is possible, but very slow), however,
-typical Macbook laptops or window desktops with mid-range GPUs lack VRAM to run
-even the smallest LLMs. For that reason
-**local execution is only supported for models compatible with [llama.cpp](https://github.com/ggerganov/llama.cpp)**
-
-These two models are known to work well:
-
-* https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF
-* https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF (recommended)
-
-To ease the installation process, use the `setup` script that will download both
-the embedding and the LLM model and place them in the correct location (under `models` folder):
-
-```bash
-poetry run python scripts/setup
-```
-
-If you are ok with CPU execution, you can skip the rest of this section.
-
-As stated before, llama.cpp is required and in
-particular [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
-is used.
-
-> It's highly encouraged that you fully read llama-cpp and llama-cpp-python documentation relevant to your platform.
-> Running into installation issues is very likely, and you'll need to troubleshoot them yourself.
-
-#### Customizing low level parameters
-
-Currently not all the parameters of llama-cpp and llama-cpp-python are available at PrivateGPT's `settings.yaml` file. In case you need to customize parameters such as the number of layers loaded into the GPU, you might change these at the `llm_component.py` file under the `private_gpt/components/llm/llm_component.py`. If you are getting an out of memory error, you might also try a smaller model or stick to the proposed recommended models, instead of custom tuning the parameters.
-
-#### OSX GPU support
-
-You will need to build [llama.cpp](https://github.com/ggerganov/llama.cpp) with
-metal support. To do that run:
-
-```bash
-CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir llama-cpp-python
-```
-
-#### Windows NVIDIA GPU support
-
-Windows GPU support is done through CUDA.
-Follow the instructions on the original [llama.cpp](https://github.com/ggerganov/llama.cpp) repo to install the required
-dependencies.
-
-Some tips to get it working with an NVIDIA card and CUDA (Tested on Windows 10 with CUDA 11.5 RTX 3070):
-
-* Install latest VS2022 (and build tools) https://visualstudio.microsoft.com/vs/community/
-* Install CUDA toolkit https://developer.nvidia.com/cuda-downloads
-* Verify your installation is correct by running `nvcc --version` and `nvidia-smi`, ensure your CUDA version is up to
-  date and your GPU is detected.
-* [Optional] Install CMake to troubleshoot building issues by compiling llama.cpp directly https://cmake.org/download/
-
-If you have all required dependencies properly configured running the
-following powershell command should succeed.
-
-```powershell
-$env:CMAKE_ARGS='-DLLAMA_CUBLAS=on'; poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
-```
-
-If your installation was correct, you should see a message similar to the following next
-time you start the server `BLAS = 1`.
-
-```
-llama_new_context_with_model: total VRAM used: 4857.93 MB (model: 4095.05 MB, context: 762.87 MB)
-AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 
-```
-
-Note that llama.cpp offloads matrix calculations to the GPU but the performance is
-still hit heavily due to latency between CPU and GPU communication. You might need to tweak
-batch sizes and other parameters to get the best performance for your particular system.
-
-#### Linux NVIDIA GPU support and Windows-WSL
-
-Linux GPU support is done through CUDA.
-Follow the instructions on the original [llama.cpp](https://github.com/ggerganov/llama.cpp) repo to install the required
-external
-dependencies.
-
-Some tips:
-
-* Make sure you have an up-to-date C++ compiler
-* Install CUDA toolkit https://developer.nvidia.com/cuda-downloads
-* Verify your installation is correct by running `nvcc --version` and `nvidia-smi`, ensure your CUDA version is up to
-  date and your GPU is detected.
-
-After that running the following command in the repository will install llama.cpp with GPU support:
-
-`
-CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
-`
-
-If your installation was correct, you should see a message similar to the following next
-time you start the server `BLAS = 1`.
-
-```
-llama_new_context_with_model: total VRAM used: 4857.93 MB (model: 4095.05 MB, context: 762.87 MB)
-AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | 
-```
-
-#### Vectorstores
-PrivateGPT supports [Chroma](https://www.trychroma.com/), [Qdrant](https://qdrant.tech/) as vectorstore providers. Chroma being the default.
-
-To enable Qdrant, set the `vectorstore.database` property in the `settings.yaml` file to `qdrant` and install the `qdrant` extra.
-
-```bash
-poetry install --extras qdrant
-```
-
-By default Qdrant tries to connect to an instance at `http://localhost:3000`.
-
-Qdrant settings can be configured by setting values to the `qdrant` property in the `settings.yaml` file.
-
-The available configuration options are:
-| Field        | Description |
-|--------------|-------------|
-| location     | If `:memory:` - use in-memory Qdrant instance.<br>If `str` - use it as a `url` parameter.|
-| url          | Either host or str of 'Optional[scheme], host, Optional[port], Optional[prefix]'.<br> Eg. `http://localhost:6333` |
-| port         | Port of the REST API interface. Default: `6333` |
-| grpc_port    | Port of the gRPC interface. Default: `6334` |
-| prefer_grpc  | If `true` - use gRPC interface whenever possible in custom methods. |
-| https        | If `true` - use HTTPS(SSL) protocol.|
-| api_key      | API key for authentication in Qdrant Cloud.|
-| prefix       | If set, add `prefix` to the REST URL path.<br>Example: `service/v1` will result in `http://localhost:6333/service/v1/{qdrant-endpoint}` for REST API.|
-| timeout      | Timeout for REST and gRPC API requests.<br>Default: 5.0 seconds for REST and unlimited for gRPC |
-| host         | Host name of Qdrant service. If url and host are not set, defaults to 'localhost'.|
-| path         | Persistence path for QdrantLocal. Eg. `local_data/private_gpt/qdrant`|
-| force_disable_check_same_thread         | Force disable check_same_thread for QdrantLocal sqlite connection.|
-
-#### Known issues and Troubleshooting
-
-Execution of LLMs locally still has a lot of sharp edges, specially when running on non Linux platforms.
-You might encounter several issues:
-
-* Performance: RAM or VRAM usage is very high, your computer might experience slowdowns or even crashes.
-* GPU Virtualization on Windows and OSX: Simply not possible with docker desktop, you have to run the server directly on
-  the host.
-* Building errors: Some of PrivateGPT dependencies need to build native code, and they might fail on some platforms.
-  Most likely you are missing some dev tools in your machine (updated C++ compiler, CUDA is not on PATH, etc.).
-  If you encounter any of these issues, please open an issue and we'll try to help.
-
-#### Troubleshooting: C++ Compiler
-
-If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++
-compiler on your computer.
-
-**For Windows 10/11**
-
-To install a C++ compiler on Windows 10/11, follow these steps:
-
-1. Install Visual Studio 2022.
-2. Make sure the following components are selected:
-    * Universal Windows Platform development
-    * C++ CMake tools for Windows
-3. Download the MinGW installer from the [MinGW website](https://sourceforge.net/projects/mingw/).
-4. Run the installer and select the `gcc` component.
-
-** For OSX **
-
-1. Check if you have a C++ compiler installed, Xcode might have done it for you. for example running `gcc`.
-2. If not, you can install clang or gcc with homebrew `brew install gcc`
-
-#### Troubleshooting: Mac Running Intel
-
-When running a Mac with Intel hardware (not M1), you may run into _clang: error: the clang compiler does not support '
-march=native'_ during pip install.
-
-If so set your archflags during pip install. eg: _ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt_
-
-## Running the Server
-
-After following the installation steps you should be ready to go. Here are some common run setups:
-
-### Running 100% locally
-
-Make sure you have followed the *Local LLM requirements* section before moving on.
-
-This command will start PrivateGPT using the `settings.yaml` (default profile) together with the `settings-local.yaml`
-configuration files. By default, it will enable both the API and the Gradio UI. Run:
-
-```
-PGPT_PROFILES=local make run
-``` 
-
-or
-
-```
-PGPT_PROFILES=local poetry run python -m private_gpt
-```
-
-When the server is started it will print a log *Application startup complete*.
-Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API
-using Swagger UI.
-
-### Local server using OpenAI as LLM
-
-If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
-decide to run PrivateGPT using OpenAI as the LLM.
-
-In order to do so, create a profile `settings-openai.yaml` with the following contents:
-
-```yaml
-llm:
-  mode: openai
-
-openai:
-  api_key: <your_openai_api_key>  # You could skip this configuration and use the OPENAI_API_KEY env var instead
-```
-
-And run PrivateGPT loading that profile you just created:
-
-```PGPT_PROFILES=openai make run```
-
-or
-
-```PGPT_PROFILES=openai poetry run python -m private_gpt```
-
-> Note this will still use the local Embeddings model, as it is ok to use it on a CPU.
-> We'll support using OpenAI embeddings in a future release.
-
-When the server is started it will print a log *Application startup complete*.
-Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
-You'll notice the speed and quality of response is higher, given you are using OpenAI's servers for the heavy
-computations.
-
-### Use AWS's Sagemaker
-
-🚧 Under construction 🚧
-
-## Gradio UI user manual
-
-Gradio UI is a ready to use way of testing most of PrivateGPT API functionalities.
-
-![Gradio PrivateGPT](https://lh3.googleusercontent.com/drive-viewer/AK7aPaD_Hc-A8A9ooMe-hPgm_eImgsbxAjb__8nFYj8b_WwzvL1Gy90oAnp1DfhPaN6yGiEHCOXs0r77W1bYHtPzlVwbV7fMsA=s1600)
-
-### Execution Modes
-
-It has 3 modes of execution (you can select in the top-left):
-
-* Query Docs: uses the context from the
-  ingested documents to answer the questions posted in the chat. It also takes
-  into account previous chat messages as context.
-    * Makes use of `/chat/completions` API with `use_context=true` and no
-      `context_filter`.
-* Search in Docs: fast search that returns the 4 most related text
-  chunks, together with their source document and page.
-    * Makes use of `/chunks` API with no `context_filter`, `limit=4` and
-      `prev_next_chunks=0`.
-* LLM Chat: simple, non-contextual chat with the LLM. The ingested documents won't
-  be taken into account, only the previous messages.
-    * Makes use of `/chat/completions` API with `use_context=false`.
-
-### Document Ingestion
-
-Ingest documents by using the `Upload a File` button. You can check the progress of
-the ingestion in the console logs of the server.
-
-The list of ingested files is shown below the button.
-
-If you want to delete the ingested documents, refer to *Reset Local documents
-database* section in the documentation.
-
-### Chat
-
-Normal chat interface, self-explanatory ;)
-
-You can check the actual prompt being passed to the LLM by looking at the logs of
-the server. We'll add better observability in future releases.
-
-## Deployment options
-
-🚧 We are working on Dockerized deployment guidelines 🚧
-
-## Observability
-
-Basic logs are enabled using LlamaIndex
-basic logging (for example ingestion progress or LLM prompts and answers).
-
-🚧 We are working on improved Observability. 🚧
-
-## Ingesting & Managing Documents
-
-🚧 Document Update and Delete are still WIP. 🚧
-
-The ingestion of documents can be done in different ways:
-
-* Using the `/ingest` API
-* Using the Gradio UI
-* Using the Bulk Local Ingestion functionality (check next section)
-
-### Bulk Local Ingestion
-
-When you are running PrivateGPT in a fully local setup, you can ingest a complete folder for convenience (containing
-pdf, text files, etc.)
-and optionally watch changes on it with the command:
-
-```bash
-make ingest /path/to/folder -- --watch
-```
-
-To log the processed and failed files to an additional file, use:
-
-```bash
-make ingest /path/to/folder -- --watch --log-file /path/to/log/file.log
-```
-
-After ingestion is complete, you should be able to chat with your documents
-by navigating to http://localhost:8001 and using the option `Query documents`,
-or using the completions / chat API.
-
-### Reset Local documents database
-
-When running in a local setup, you can remove all ingested documents by simply
-deleting all contents of `local_data` folder (except .gitignore).
-
-To simplify this process, you can use the command:
-```bash
-make wipe
-```
-
-## API
-
-As explained in the introduction, the API contains high level APIs (ingestion and chat/completions) and low level APIs
-(embeddings and chunk retrieval). In this section the different specific API calls are explained.
--- a/docs/index.html
+++ b/docs/index.html
@@ -1,22 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <title>PrivateGPT Docs</title>
-    <!-- needed for adaptive design -->
-    <meta name="viewport" content="width=device-width, initial-scale=1">
-    <link href="https://fonts.googleapis.com/css?family=Montserrat:300,400,700|Roboto:300,400,700" rel="stylesheet">
-    <link rel="shortcut icon" href="https://fastapi.tiangolo.com/img/favicon.png">
-    <!-- ReDoc doesn't change outer page styles -->
-    <style>
-      body {
-        margin: 0;
-        padding: 0;
-      }
-    </style>
-</head>
-<body>
-    <noscript> ReDoc requires Javascript to function. Please enable it to browse the documentation. </noscript>
-    <redoc spec-url="/openapi.json"></redoc>
-    <script src="https://cdn.jsdelivr.net/npm/redoc@next/bundles/redoc.standalone.js"></script>
-</body>
--- a/docs/logo.png
+++ b/docs/logo.png
--- a/docs/openapi.json
+++ b/docs/openapi.json
--- a/fern/docs.yml
+++ b/fern/docs.yml
@@ -30,15 +30,15 @@ navigation:
    layout:
      - section: Welcome
        contents:
-          - page: Welcome
+          - page: Introduction
            path: ./docs/pages/overview/welcome.mdx
-          - page: Quickstart
-            path: ./docs/pages/overview/quickstart.mdx
  # How to install privateGPT, with FAQ and troubleshooting
  - tab: installation
    layout:
      - section: Getting started
        contents:
+          - page: Main Concepts
+            path: ./docs/pages/installation/concepts.mdx
          - page: Installation
            path: ./docs/pages/installation/installation.mdx
  # Manual of privateGPT: how to use it and configure it
@@ -58,10 +58,14 @@ navigation:
        contents:
          - page: Vector Stores
            path: ./docs/pages/manual/vectordb.mdx
+          - page: Node Stores
+            path: ./docs/pages/manual/nodestore.mdx
      - section: Advanced Setup
        contents:
          - page: LLM Backends
            path: ./docs/pages/manual/llms.mdx
+          - page: Reranking
+            path: ./docs/pages/manual/reranker.mdx
      - section: User Interface
        contents:
          - page: User interface (Gradio) Manual
@@ -89,7 +93,7 @@ navigation:
 # `type:primary` is always displayed at the most right side of the navbar
 navbar-links:
  - type: secondary
-    text: Github
+    text: GitHub
    url: "https://github.com/imartinez/privateGPT"
  - type: secondary
    text: Contact us
--- a/fern/docs/pages/api-reference/api-reference.mdx
+++ b/fern/docs/pages/api-reference/api-reference.mdx
@@ -1 +1,14 @@
 # API Reference
+
+The API is divided in two logical blocks:
+
+1. High-level API, abstracting all the complexity of a RAG (Retrieval Augmented Generation) pipeline implementation:
+    - Ingestion of documents: internally managing document parsing, splitting, metadata extraction,
+      embedding generation and storage.
+    - Chat & Completions using context from ingested documents: abstracting the retrieval of context, the prompt
+      engineering and the response generation.
+
+2. Low-level API, allowing advanced users to implement their own complex pipelines:
+    - Embeddings generation: based on a piece of text.
+    - Contextual chunks retrieval: given a query, returns the most relevant chunks of text from the ingested
+      documents.
--- a/fern/docs/pages/api-reference/sdks.mdx
+++ b/fern/docs/pages/api-reference/sdks.mdx
@@ -8,14 +8,14 @@ The clients are kept up to date automatically, so we encourage you to use the la

 <Cards>
  <Card
-    title="Node.js/TypeScript"
+    title="Node.js/TypeScript - WIP"
    icon="fa-brands fa-node"
    href="https://github.com/imartinez/privateGPT-typescript"
  />
  <Card
-    title="Python"
+    title="Python - Ready!"
    icon="fa-brands fa-python"
-    href="https://github.com/imartinez/privateGPT-python"
+    href="https://github.com/imartinez/pgpt_python"
  />
  <br />
 </Cards>
@@ -24,12 +24,12 @@ The clients are kept up to date automatically, so we encourage you to use the la

 <Cards>
  <Card
-    title="Java"
+    title="Java - WIP"
    icon="fa-brands fa-java"
    href="https://github.com/imartinez/privateGPT-java"
  />
  <Card
-    title="Go"
+    title="Go - WIP"
    icon="fa-brands fa-golang"
    href="https://github.com/imartinez/privateGPT-go"
  />
--- a/fern/docs/pages/installation/concepts.mdx
+++ b/fern/docs/pages/installation/concepts.mdx
@@ -0,0 +1,60 @@
+PrivateGPT is a service that wraps a set of AI RAG primitives in a comprehensive set of APIs providing a private, secure, customizable and easy to use GenAI development framework.
+
+It uses FastAPI and LLamaIndex as its core frameworks. Those can be customized by changing the codebase itself.
+
+It supports a variety of LLM providers, embeddings providers, and vector stores, both local and remote. Those can be easily changed without changing the codebase.
+
+# Different Setups support
+
+## Setup configurations available
+You get to decide the setup for these 3 main components:
+- LLM: the large language model provider used for inference. It can be local, or remote, or even OpenAI.
+- Embeddings: the embeddings provider used to encode the input, the documents and the users' queries. Same as the LLM, it can be local, or remote, or even OpenAI.
+- Vector store: the store used to index and retrieve the documents.
+
+There is an extra component that can be enabled or disabled: the UI. It is a Gradio UI that allows to interact with the API in a more user-friendly way.
+
+### Setups and Dependencies
+Your setup will be the combination of the different options available. You'll find recommended setups in the [installation](/installation) section.
+PrivateGPT uses poetry to manage its dependencies. You can install the dependencies for the different setups by running `poetry install --extras "<extra1> <extra2>..."`.
+Extras are the different options available for each component. For example, to install the dependencies for a a local setup with UI and qdrant as vector database, Ollama as LLM and HuggingFace as local embeddings, you would run
+
+`poetry install --extras "ui vector-stores-qdrant llms-ollama embeddings-huggingface"`.
+
+Refer to the [installation](/installation) section for more details.
+
+### Setups and Configuration
+PrivateGPT uses yaml to define its configuration in files named `settings-<profile>.yaml`.
+Different configuration files can be created in the root directory of the project.
+PrivateGPT will load the configuration at startup from the profile specified in the `PGPT_PROFILES` environment variable.
+For example, running:
+```bash
+PGPT_PROFILES=ollama make run
+```
+will load the configuration from `settings.yaml` and `settings-ollama.yaml`.
+- `settings.yaml` is always loaded and contains the default configuration.
+- `settings-ollama.yaml` is loaded if the `ollama` profile is specified in the `PGPT_PROFILES` environment variable. It can override configuration from the default `settings.yaml`
+
+## About Fully Local Setups
+In order to run PrivateGPT in a fully local setup, you will need to run the LLM, Embeddings and Vector Store locally.
+### Vector stores
+The vector stores supported (Qdrant, ChromaDB and Postgres) run locally by default.
+### Embeddings
+For local Embeddings there are two options:
+* (Recommended) You can use the 'ollama' option in PrivateGPT, which will connect to your local Ollama instance. Ollama simplifies a lot the installation of local LLMs.
+* You can use the 'embeddings-huggingface' option in PrivateGPT, which will use HuggingFace.
+
+In order for HuggingFace LLM to work (the second option), you need to download the embeddings model to the `models` folder. You can do so by running the `setup` script:
+```bash
+poetry run python scripts/setup
+```
+
+### LLM
+For local LLM there are two options:
+* (Recommended) You can use the 'ollama' option in PrivateGPT, which will connect to your local Ollama instance. Ollama simplifies a lot the installation of local LLMs.
+* You can use the 'llms-llama-cpp' option in PrivateGPT, which will use LlamaCPP. It works great on Mac with Metal most of the times (leverages Metal GPU), but it can be tricky in certain Linux and Windows distributions, depending on the GPU. In the installation document you'll find guides and troubleshooting.
+
+In order for LlamaCPP powered LLM to work (the second option), you need to download the LLM model to the `models` folder. You can do so by running the `setup` script:
+```bash
+poetry run python scripts/setup
+```
--- a/fern/docs/pages/installation/installation.mdx
+++ b/fern/docs/pages/installation/installation.mdx
@@ -1,8 +1,8 @@
-## Installation and Settings
+It is important that you review the Main Concepts before you start the installation process.

-### Base requirements to run PrivateGPT
+## Base requirements to run PrivateGPT

-* Git clone PrivateGPT repository, and navigate to it:
+* Clone PrivateGPT repository, and navigate to it:

 ```bash
  git clone https://github.com/imartinez/privateGPT
@@ -10,7 +10,7 @@
 ```

 * Install Python `3.11` (*if you do not have it already*). Ideally through a python version manager like `pyenv`.
-  Python 3.12 should work too. Earlier python versions are not supported.
+  Earlier python versions are not supported.
    * osx/linux: [pyenv](https://github.com/pyenv/pyenv)
    * windows: [pyenv-win](https://github.com/pyenv-win/pyenv-win)

@@ -21,93 +21,205 @@ pyenv local 3.11

 * Install [Poetry](https://python-poetry.org/docs/#installing-with-the-official-installer) for dependency management:

-* Have a valid C++ compiler like gcc. See [Troubleshooting: C++ Compiler](#troubleshooting-c-compiler) for more details.
-
-* Install `make` for scripts:
+* Install `make` to be able to run the different scripts:
    * osx: (Using homebrew): `brew install make`
    * windows: (Using chocolatey) `choco install make`

-### Install dependencies
+## Install and run your desired setup

-Install the dependencies:
+PrivateGPT allows to customize the setup -from fully local to cloud based- by deciding the modules to use.
+Here are the different options available:
+
+- LLM: "llama-cpp", "ollama", "sagemaker", "openai", "openailike", "azopenai"
+- Embeddings: "huggingface", "openai", "sagemaker", "azopenai"
+- Vector stores: "qdrant", "chroma", "postgres"
+- UI: whether or not to enable UI (Gradio) or just go with the API
+
+In order to only install the required dependencies, PrivateGPT offers different `extras` that can be combined during the installation process:

 ```bash
-poetry install --with ui
+poetry install --extras "<extra1> <extra2>..."
 ```

-Verify everything is working by running `make run` (or `poetry run python -m private_gpt`) and navigate to
-http://localhost:8001. You should see a [Gradio UI](https://gradio.app/) **configured with a mock LLM** that will
-echo back the input. Below we'll see how to configure a real LLM.
+Where `<extra>` can be any of the following:

-### Settings
+- ui: adds support for UI using Gradio
+- llms-ollama: adds support for Ollama LLM, the easiest way to get a local LLM running, requires Ollama running locally
+- llms-llama-cpp: adds support for local LLM using LlamaCPP - expect a messy installation process on some platforms
+- llms-sagemaker: adds support for Amazon Sagemaker LLM, requires Sagemaker inference endpoints
+- llms-openai: adds support for OpenAI LLM, requires OpenAI API key
+- llms-openai-like: adds support for 3rd party LLM providers that are compatible with OpenAI's API
+- llms-azopenai: adds support for Azure OpenAI LLM, requires Azure OpenAI inference endpoints
+- embeddings-ollama: adds support for Ollama Embeddings, requires Ollama running locally
+- embeddings-huggingface: adds support for local Embeddings using HuggingFace
+- embeddings-sagemaker: adds support for Amazon Sagemaker Embeddings, requires Sagemaker inference endpoints
+- embeddings-openai = adds support for OpenAI Embeddings, requires OpenAI API key
+- embeddings-azopenai = adds support for Azure OpenAI Embeddings, requires Azure OpenAI inference endpoints
+- vector-stores-qdrant: adds support for Qdrant vector store
+- vector-stores-chroma: adds support for Chroma DB vector store
+- vector-stores-postgres: adds support for Postgres vector store

-<Callout intent="info">
-The default settings of PrivateGPT should work out-of-the-box for a 100% local setup. **However**, as is, it runs exclusively on your CPU.
-Skip this section if you just want to test PrivateGPT locally, and come back later to learn about more configuration options (and have better performances).
-</Callout>
+## Recommended Setups

-<br />
+There are just some examples of recommended setups. You can mix and match the different options to fit your needs.
+You'll find more information in the Manual section of the documentation.

-### Local LLM requirements
+> **Important for Windows**: In the examples below or how to run PrivateGPT with `make run`, `PGPT_PROFILES` env var is being set inline following Unix command line syntax (works on MacOS and Linux).
+If you are using Windows, you'll need to set the env var in a different way, for example:

-Install extra dependencies for local execution:
+```powershell
+# Powershell
+$env:PGPT_PROFILES="ollama"
+make run
+```
+
+or
+
+```cmd
+# CMD
+set PGPT_PROFILES=ollama
+make run
+```
+
+### Local, Ollama-powered setup - RECOMMENDED
+
+**The easiest way to run PrivateGPT fully locally** is to depend on Ollama for the LLM. Ollama provides local LLM and Embeddings super easy to install and use, abstracting the complexity of GPU support. It's the recommended setup for local development.
+
+Go to [ollama.ai](https://ollama.ai/) and follow the instructions to install Ollama on your machine.
+
+After the installation, make sure the Ollama desktop app is closed.
+
+Install the models to be used, the default settings-ollama.yaml is configured to user `mistral 7b` LLM (~4GB) and `nomic-embed-text` Embeddings (~275MB). Therefore:

 ```bash
-poetry install --with local
+ollama pull mistral
+ollama pull nomic-embed-text
 ```

-For PrivateGPT to run fully locally GPU acceleration is required
-(CPU execution is possible, but very slow), however,
-typical Macbook laptops or window desktops with mid-range GPUs lack VRAM to run
-even the smallest LLMs. For that reason
-**local execution is only supported for models compatible with [llama.cpp](https://github.com/ggerganov/llama.cpp)**
+Now, start Ollama service (it will start a local inference server, serving both the LLM and the Embeddings):
+```bash
+ollama serve
+```

-These two models are known to work well:
+Once done, on a different terminal, you can install PrivateGPT with the following command:
+```bash
+poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant"
+```

-* https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF
-* https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF (recommended)
+Once installed, you can run PrivateGPT. Make sure you have a working Ollama running locally before running the following command.

-To ease the installation process, use the `setup` script that will download both
-the embedding and the LLM model and place them in the correct location (under `models` folder):
+```bash
+PGPT_PROFILES=ollama make run
+```

+PrivateGPT will use the already existing `settings-ollama.yaml` settings file, which is already configured to use Ollama LLM and Embeddings, and Qdrant. Review it and adapt it to your needs (different models, different Ollama port, etc.)
+
+The UI will be available at http://localhost:8001
+
+### Private, Sagemaker-powered setup
+
+If you need more performance, you can run a version of PrivateGPT that relies on powerful AWS Sagemaker machines to serve the LLM and Embeddings.
+
+You need to have access to sagemaker inference endpoints for the LLM and / or the embeddings, and have AWS credentials properly configured.
+
+Edit the `settings-sagemaker.yaml` file to include the correct Sagemaker endpoints.
+
+Then, install PrivateGPT with the following command:
+```bash
+poetry install --extras "ui llms-sagemaker embeddings-sagemaker vector-stores-qdrant"
+```
+
+Once installed, you can run PrivateGPT. Make sure you have a working Ollama running locally before running the following command.
+
+```bash
+PGPT_PROFILES=sagemaker make run
+```
+
+PrivateGPT will use the already existing `settings-sagemaker.yaml` settings file, which is already configured to use Sagemaker LLM and Embeddings endpoints, and Qdrant.
+
+The UI will be available at http://localhost:8001
+
+### Non-Private, OpenAI-powered test setup
+
+If you want to test PrivateGPT with OpenAI's LLM and Embeddings -taking into account your data is going to OpenAI!- you can run the following command:
+
+You need an OPENAI API key to run this setup.
+
+Edit the `settings-openai.yaml` file to include the correct API KEY. Never commit it! It's a secret! As an alternative to editing `settings-openai.yaml`, you can just set the env var OPENAI_API_KEY.
+
+Then, install PrivateGPT with the following command:
+```bash
+poetry install --extras "ui llms-openai embeddings-openai vector-stores-qdrant"
+```
+
+Once installed, you can run PrivateGPT.
+
+```bash
+PGPT_PROFILES=openai make run
+```
+
+PrivateGPT will use the already existing `settings-openai.yaml` settings file, which is already configured to use OpenAI LLM and Embeddings endpoints, and Qdrant.
+
+The UI will be available at http://localhost:8001
+
+### Non-Private, Azure OpenAI-powered test setup
+
+If you want to test PrivateGPT with Azure OpenAI's LLM and Embeddings -taking into account your data is going to Azure OpenAI!- you can run the following command:
+
+You need to have access to Azure OpenAI inference endpoints for the LLM and / or the embeddings, and have Azure OpenAI credentials properly configured.
+
+Edit the `settings-azopenai.yaml` file to include the correct Azure OpenAI endpoints.
+
+Then, install PrivateGPT with the following command:
+```bash
+poetry install --extras "ui llms-azopenai embeddings-azopenai vector-stores-qdrant"
+```
+
+Once installed, you can run PrivateGPT.
+
+```bash
+PGPT_PROFILES=azopenai make run
+```
+
+PrivateGPT will use the already existing `settings-azopenai.yaml` settings file, which is already configured to use Azure OpenAI LLM and Embeddings endpoints, and Qdrant.
+
+The UI will be available at http://localhost:8001
+
+### Local, Llama-CPP powered setup
+
+If you want to run PrivateGPT fully locally without relying on Ollama, you can run the following command:
+
+```bash
+poetry install --extras "ui llms-llama-cpp embeddings-huggingface vector-stores-qdrant"
+```
+
+In order for local LLM and embeddings to work, you need to download the models to the `models` folder. You can do so by running the `setup` script:
 ```bash
 poetry run python scripts/setup
 ```

-If you are ok with CPU execution, you can skip the rest of this section.
+Once installed, you can run PrivateGPT with the following command:

-As stated before, llama.cpp is required and in
+```bash
+PGPT_PROFILES=local make run
+```
+
+PrivateGPT will load the already existing `settings-local.yaml` file, which is already configured to use LlamaCPP LLM, HuggingFace embeddings and Qdrant.
+
+The UI will be available at http://localhost:8001
+
+#### Llama-CPP support
+
+For PrivateGPT to run fully locally without Ollama, Llama.cpp is required and in
 particular [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
 is used.

+You'll need to have a valid C++ compiler like gcc installed. See [Troubleshooting: C++ Compiler](#troubleshooting-c-compiler) for more details.
+
 > It's highly encouraged that you fully read llama-cpp and llama-cpp-python documentation relevant to your platform.
 > Running into installation issues is very likely, and you'll need to troubleshoot them yourself.

-#### Customizing low level parameters
-
-Currently, not all the parameters of `llama.cpp` and `llama-cpp-python` are available at PrivateGPT's `settings.yaml` file.
-In case you need to customize parameters such as the number of layers loaded into the GPU, you might change
-these at the `llm_component.py` file under the `private_gpt/components/llm/llm_component.py`.
-
-##### Available LLM config options
-
-The `llm` section of the settings allows for the following configurations:
-
- `mode`: how to run your llm
- `max_new_tokens`: this lets you configure the number of new tokens the LLM will generate and add to the context window (by default Llama.cpp uses `256`)
-
-Example:
-
-```yaml
-llm:
-  mode: local
-  max_new_tokens: 256
-```
-
-If you are getting an out of memory error, you might also try a smaller model or stick to the proposed
-recommended models, instead of custom tuning the parameters.
-
-#### OSX GPU support
+##### Llama-CPP OSX GPU support

 You will need to build [llama.cpp](https://github.com/ggerganov/llama.cpp) with metal support.

@@ -127,7 +239,7 @@ More information is available in the documentation of the libraries themselves:
 * [llama-cpp-python's documentation](https://llama-cpp-python.readthedocs.io/en/latest/#installation-with-hardware-acceleration)
 * [llama.cpp](https://github.com/ggerganov/llama.cpp#build)

-#### Windows NVIDIA GPU support
+##### Llama-CPP Windows NVIDIA GPU support

 Windows GPU support is done through CUDA.
 Follow the instructions on the original [llama.cpp](https://github.com/ggerganov/llama.cpp) repo to install the required
@@ -160,7 +272,7 @@ Note that llama.cpp offloads matrix calculations to the GPU but the performance
 still hit heavily due to latency between CPU and GPU communication. You might need to tweak
 batch sizes and other parameters to get the best performance for your particular system.

-#### Linux NVIDIA GPU support and Windows-WSL
+##### Llama-CPP Linux NVIDIA GPU support and Windows-WSL

 Linux GPU support is done through CUDA.
 Follow the instructions on the original [llama.cpp](https://github.com/ggerganov/llama.cpp) repo to install the required
@@ -188,7 +300,41 @@ llama_new_context_with_model: total VRAM used: 4857.93 MB (model: 4095.05 MB, co
 AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
 ```

-### Known issues and Troubleshooting
+##### Llama-CPP Linux AMD GPU support
+
+Linux GPU support is done through ROCm.
+Some tips:
+* Install ROCm from [quick-start install guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html)
+* [Install PyTorch for ROCm](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html)
+```bash
+wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0/torch-2.1.1%2Brocm6.0-cp311-cp311-linux_x86_64.whl
+poetry run pip install --force-reinstall --no-cache-dir torch-2.1.1+rocm6.0-cp311-cp311-linux_x86_64.whl
+```
+* Install bitsandbytes for ROCm
+```bash
+PYTORCH_ROCM_ARCH=gfx900,gfx906,gfx908,gfx90a,gfx1030,gfx1100,gfx1101,gfx940,gfx941,gfx942
+BITSANDBYTES_VERSION=62353b0200b8557026c176e74ac48b84b953a854
+git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6
+cd bitsandbytes-rocm-5.6
+git checkout ${BITSANDBYTES_VERSION}
+make hip ROCM_TARGET=${PYTORCH_ROCM_ARCH} ROCM_HOME=/opt/rocm/
+pip install . --extra-index-url https://download.pytorch.org/whl/nightly
+```
+
+After that running the following command in the repository will install llama.cpp with GPU support:
+```bash
+LLAMA_CPP_PYTHON_VERSION=0.2.56
+DAMDGPU_TARGETS=gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942
+CMAKE_ARGS="-DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ -DAMDGPU_TARGETS=${DAMDGPU_TARGETS}" poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python==${LLAMA_CPP_PYTHON_VERSION}
+```
+
+If your installation was correct, you should see a message similar to the following next time you start the server `BLAS = 1`.
+
+```
+AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
+```
+
+##### Llama-CPP Known issues and Troubleshooting

 Execution of LLMs locally still has a lot of sharp edges, specially when running on non Linux platforms.
 You might encounter several issues:
@@ -205,7 +351,7 @@ If, during your installation, something does not go as planned, retry in *verbos

 For example, when installing packages with `pip install`, you can add the option `-vvv` to show the details of the installation.

-#### Troubleshooting: C++ Compiler
+##### Llama-CPP Troubleshooting: C++ Compiler

 If you encounter an error while building a wheel during the `pip install` process, you may need to install a C++
 compiler on your computer.
@@ -227,9 +373,9 @@ To install a C++ compiler on Windows 10/11, follow these steps:
   Store and search for Xcode and install it. **Or** you can install the command line tools by running `xcode-select --install`.
 2. If not, you can install clang or gcc with homebrew `brew install gcc`

-#### Troubleshooting: Mac Running Intel
+##### Llama-CPP Troubleshooting: Mac Running Intel

 When running a Mac with Intel hardware (not M1), you may run into _clang: error: the clang compiler does not support '
 -march=native'_ during pip install.

-If so set your archflags during pip install. eg: _ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt_
+If so set your archflags during pip install. eg: _ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt_
--- a/fern/docs/pages/manual/ingestion.mdx
+++ b/fern/docs/pages/manual/ingestion.mdx
@@ -62,6 +62,7 @@ The following ingestion mode exist:
 * `simple`: historic behavior, ingest one document at a time, sequentially
 * `batch`: read, parse, and embed multiple documents using batches (batch read, and then batch parse, and then batch embed)
 * `parallel`: read, parse, and embed multiple documents in parallel. This is the fastest ingestion mode for local setup.
+* `pipeline`: Alternative to parallel.
 To change the ingestion mode, you can use the `embedding.ingest_mode` configuration value. The default value is `simple`.

 To configure the number of workers used for parallel or batched ingestion, you can use
--- a/fern/docs/pages/manual/llms.mdx
+++ b/fern/docs/pages/manual/llms.mdx
@@ -25,6 +25,30 @@ When the server is started it will print a log *Application startup complete*.
 Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API
 using Swagger UI.

+#### Customizing low level parameters
+
+Currently, not all the parameters of `llama.cpp` and `llama-cpp-python` are available at PrivateGPT's `settings.yaml` file.
+In case you need to customize parameters such as the number of layers loaded into the GPU, you might change
+these at the `llm_component.py` file under the `private_gpt/components/llm/llm_component.py`.
+
+##### Available LLM config options
+
+The `llm` section of the settings allows for the following configurations:
+
+- `mode`: how to run your llm
+- `max_new_tokens`: this lets you configure the number of new tokens the LLM will generate and add to the context window (by default Llama.cpp uses `256`)
+
+Example:
+
+```yaml
+llm:
+  mode: local
+  max_new_tokens: 256
+```
+
+If you are getting an out of memory error, you might also try a smaller model or stick to the proposed
+recommended models, instead of custom tuning the parameters.
+
 ### Using OpenAI

 If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
@@ -37,7 +61,10 @@ llm:
  mode: openai

 openai:
+  api_base: <openai-api-base-url> # Defaults to https://api.openai.com/v1
  api_key: <your_openai_api_key>  # You could skip this configuration and use the OPENAI_API_KEY env var instead
+  model: <openai_model_to_use> # Optional model to use. Default is "gpt-3.5-turbo"
+                               # Note: Open AI Models are listed here: https://platform.openai.com/docs/models
 ```

 And run PrivateGPT loading that profile you just created:
@@ -53,6 +80,61 @@ Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:80
 You'll notice the speed and quality of response is higher, given you are using OpenAI's servers for the heavy
 computations.

+### Using OpenAI compatible API
+
+Many tools, including [LocalAI](https://localai.io/) and [vLLM](https://docs.vllm.ai/en/latest/),
+support serving local models with an OpenAI compatible API. Even when overriding the `api_base`,
+using the `openai` mode doesn't allow you to use custom models. Instead, you should use the `openailike` mode:
+
+```yaml
+llm:
+  mode: openailike
+```
+
+This mode uses the same settings as the `openai` mode.
+
+As an example, you can follow the [vLLM quickstart guide](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)
+to run an OpenAI compatible server. Then, you can run PrivateGPT using the `settings-vllm.yaml` profile:
+
+`PGPT_PROFILES=vllm make run`
+
+### Using Azure OpenAI
+
+If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
+decide to run PrivateGPT using Azure OpenAI as the LLM and Embeddings model.
+
+In order to do so, create a profile `settings-azopenai.yaml` with the following contents:
+
+```yaml
+llm:
+  mode: azopenai
+
+embedding:
+  mode: azopenai
+
+azopenai:
+  api_key: <your_azopenai_api_key>  # You could skip this configuration and use the AZ_OPENAI_API_KEY env var instead
+  azure_endpoint: <your_azopenai_endpoint> # You could skip this configuration and use the AZ_OPENAI_ENDPOINT env var instead
+  api_version: <api_version> # The API version to use. Default is "2023_05_15"
+  embedding_deployment_name: <your_embedding_deployment_name> # You could skip this configuration and use the AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME env var instead
+  embedding_model: <openai_embeddings_to_use> # Optional model to use. Default is "text-embedding-ada-002" 
+  llm_deployment_name: <your_model_deployment_name> # You could skip this configuration and use the AZ_OPENAI_LLM_DEPLOYMENT_NAME env var instead
+  llm_model: <openai_model_to_use> # Optional model to use. Default is "gpt-35-turbo"
+```
+
+And run PrivateGPT loading that profile you just created:
+
+`PGPT_PROFILES=azopenai make run`
+
+or
+
+`PGPT_PROFILES=azopenai poetry run python -m private_gpt`
+
+When the server is started it will print a log *Application startup complete*.
+Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
+You'll notice the speed and quality of response is higher, given you are using Azure OpenAI's servers for the heavy
+computations.
+
 ### Using AWS Sagemaker

 For a fully private & performant setup, you can choose to have both your LLM and Embeddings model deployed using Sagemaker.
@@ -80,4 +162,34 @@ or
 `PGPT_PROFILES=sagemaker poetry run python -m private_gpt`

 When the server is started it will print a log *Application startup complete*.
-Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
+Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
+
+### Using Ollama
+
+Another option for a fully private setup is using [Ollama](https://ollama.ai/).
+
+Note: how to deploy Ollama and pull models onto it is out of the scope of this documentation.
+
+In order to do so, create a profile `settings-ollama.yaml` with the following contents:
+
+```yaml
+llm:
+  mode: ollama
+
+ollama:
+  model: <ollama_model_to_use> # Required Model to use.
+                               # Note: Ollama Models are listed here: https://ollama.ai/library
+                               #       Be sure to pull the model to your Ollama server
+  api_base: <ollama-api-base-url> # Defaults to http://localhost:11434
+```
+
+And run PrivateGPT loading that profile you just created:
+
+`PGPT_PROFILES=ollama make run`
+
+or
+
+`PGPT_PROFILES=ollama poetry run python -m private_gpt`
+
+When the server is started it will print a log *Application startup complete*.
+Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
--- a/fern/docs/pages/manual/nodestore.mdx
+++ b/fern/docs/pages/manual/nodestore.mdx
@@ -0,0 +1,66 @@
+## NodeStores
+PrivateGPT supports **Simple** and [Postgres](https://www.postgresql.org/) providers. Simple being the default.
+
+In order to select one or the other, set the `nodestore.database` property in the `settings.yaml` file to `simple` or `postgres`.
+
+```yaml
+nodestore:
+  database: simple
+```
+
+### Simple Document Store
+
+Setting up simple document store: Persist data with in-memory and disk storage.
+
+Enabling the simple document store is an excellent choice for small projects or proofs of concept where you need to persist data while maintaining minimal setup complexity. To get started, set the nodestore.database property in your settings.yaml file as follows:
+
+```yaml
+nodestore:
+  database: simple
+```
+The beauty of the simple document store is its flexibility and ease of implementation. It provides a solid foundation for managing and retrieving data without the need for complex setup or configuration. The combination of in-memory processing and disk persistence ensures that you can efficiently handle small to medium-sized datasets while maintaining data consistency across runs.
+
+### Postgres Document Store
+
+To enable Postgres, set the `nodestore.database` property in the `settings.yaml` file to `postgres` and install the `storage-nodestore-postgres` extra.  Note: Vector Embeddings Storage in Postgres is configured separately
+
+```bash
+poetry install --extras storage-nodestore-postgres
+```
+
+The available configuration options are:
+| Field         | Description                                               |
+|---------------|-----------------------------------------------------------|
+| **host**      | The server hosting the Postgres database. Default is `localhost` |
+| **port**      | The port on which the Postgres database is accessible. Default is `5432` |
+| **database**  | The specific database to connect to. Default is `postgres` |
+| **user**      | The username for database access. Default is `postgres` |
+| **password**  | The password for database access. (Required)            |
+| **schema_name** | The database schema to use. Default is `private_gpt`       |
+
+For example:
+```yaml
+nodestore:
+  database: postgres
+
+postgres:
+  host: localhost
+  port: 5432
+  database: postgres
+  user: postgres
+  password: <PASSWORD>
+  schema_name: private_gpt
+```
+
+Given the above configuration, Two PostgreSQL tables will be created upon successful connection: one for storing metadata related to the index and another for document data itself.
+
+```
+postgres=# \dt private_gpt.*
+                  List of relations
+   Schema    |      Name       | Type  |    Owner     
+-------------+-----------------+-------+--------------
+ private_gpt | data_docstore   | table | postgres
+ private_gpt | data_indexstore | table | postgres
+
+postgres=# 
+```
--- a/fern/docs/pages/manual/reranker.mdx
+++ b/fern/docs/pages/manual/reranker.mdx
@@ -0,0 +1,36 @@
+## Enhancing Response Quality with Reranking
+
+PrivateGPT offers a reranking feature aimed at optimizing response generation by filtering out irrelevant documents, potentially leading to faster response times and enhanced relevance of answers generated by the LLM.
+
+### Enabling Reranking
+
+Document reranking can significantly improve the efficiency and quality of the responses by pre-selecting the most relevant documents before generating an answer. To leverage this feature, ensure that it is enabled in the RAG settings and consider adjusting the parameters to best fit your use case.
+
+#### Additional Requirements
+
+Before enabling reranking, you must install additional dependencies:
+
+```bash
+poetry install --extras rerank-sentence-transformers
+```
+
+This command installs dependencies for the cross-encoder reranker from sentence-transformers, which is currently the only supported method by PrivateGPT for document reranking.
+
+#### Configuration
+
+To enable and configure reranking, adjust the `rag` section within the `settings.yaml` file. Here are the key settings to consider:
+
+- `similarity_top_k`: Determines the number of documents to initially retrieve and consider for reranking. This value should be larger than `top_n`.
+- `rerank`:
+  - `enabled`: Set to `true` to activate the reranking feature.
+  - `top_n`: Specifies the number of documents to use in the final answer generation process, chosen from the top-ranked documents provided by `similarity_top_k`.
+
+Example configuration snippet:
+
+```yaml
+rag:
+  similarity_top_k: 10  # Number of documents to retrieve and consider for reranking
+  rerank:
+    enabled: true
+    top_n: 3  # Number of top-ranked documents to use for generating the answer
+```
--- a/fern/docs/pages/manual/ui.mdx
+++ b/fern/docs/pages/manual/ui.mdx
@@ -35,5 +35,32 @@ database* section in the documentation.

 Normal chat interface, self-explanatory ;)

-You can check the actual prompt being passed to the LLM by looking at the logs of
-the server. We'll add better observability in future releases.
+#### System Prompt
+You can view and change the system prompt being passed to the LLM by clicking "Additional Inputs"
+in the chat interface. The system prompt is also logged on the server.
+
+By default, the `Query Docs` mode uses the setting value `ui.default_query_system_prompt`.
+
+The `LLM Chat` mode attempts to use the optional settings value `ui.default_chat_system_prompt`.
+
+If no system prompt is entered, the UI will display the default system prompt being used
+for the active mode.
+
+##### System Prompt Examples:
+
+The system prompt can effectively provide your chat bot specialized roles, and results tailored to the prompt
+you have given the model. Examples of system prompts can be be found
+[here](https://www.w3schools.com/gen_ai/chatgpt-3-5/chatgpt-3-5_roles.php).
+
+Some interesting examples to try include:
+
+* You are -X-. You have all the knowledge and personality of -X-. Answer as if you were -X- using
+their manner of speaking and vocabulary.
+    * Example: You are Shakespeare. You have all the knowledge and personality of Shakespeare.
+    Answer as if you were Shakespeare using their manner of speaking and vocabulary.
+* You are an expert (at) -role-. Answer all questions using your expertise on -specific domain topic-.
+    * Example: You are an expert software engineer. Answer all questions using your expertise on Python.
+* You are a -role- bot, respond with -response criteria needed-. If no -response criteria- is needed,
+respond with -alternate response-.
+    * Example: You are a grammar checking bot, respond with any grammatical corrections needed. If no corrections
+    are needed, respond with "verified".
--- a/fern/docs/pages/manual/vectordb.mdx
+++ b/fern/docs/pages/manual/vectordb.mdx
@@ -1,7 +1,7 @@
 ## Vectorstores
-PrivateGPT supports [Qdrant](https://qdrant.tech/) and [Chroma](https://www.trychroma.com/) as vectorstore providers. Qdrant being the default.
+PrivateGPT supports [Qdrant](https://qdrant.tech/), [Chroma](https://www.trychroma.com/) and [PGVector](https://github.com/pgvector/pgvector) as vectorstore providers. Qdrant being the default.

-In order to select one or the other, set the `vectorstore.database` property in the `settings.yaml` file to `qdrant` or `chroma`.
+In order to select one or the other, set the `vectorstore.database` property in the `settings.yaml` file to `qdrant`, `chroma` or `postgres`.

 ```yaml
 vectorstore:
@@ -47,4 +47,57 @@ To enable Chroma, set the `vectorstore.database` property in the `settings.yaml`
 poetry install --extras chroma
 ```

-By default `chroma` will use a disk-based database stored in local_data_path / "chroma_db" (being local_data_path defined in settings.yaml)
+By default `chroma` will use a disk-based database stored in local_data_path / "chroma_db" (being local_data_path defined in settings.yaml)
+
+### PGVector
+To use the PGVector store a [postgreSQL](https://www.postgresql.org/) database with the PGVector extension must be used.
+
+To enable PGVector, set the `vectorstore.database` property in the `settings.yaml` file to `postgres` and install the `vector-stores-postgres` extra.
+
+```bash
+poetry install --extras vector-stores-postgres
+```
+
+PGVector settings can be configured by setting values to the `postgres` property in the `settings.yaml` file.
+
+The available configuration options are:
+| Field         | Description                                               |
+|---------------|-----------------------------------------------------------|
+| **host**      | The server hosting the Postgres database. Default is `localhost` |
+| **port**      | The port on which the Postgres database is accessible. Default is `5432` |
+| **database**  | The specific database to connect to. Default is `postgres` |
+| **user**      | The username for database access. Default is `postgres` |
+| **password**  | The password for database access. (Required)            |
+| **schema_name** | The database schema to use. Default is `private_gpt`       |
+
+For example:
+```yaml
+vectorstore:
+  database: postgres
+
+postgres:
+  host: localhost
+  port: 5432
+  database: postgres
+  user: postgres
+  password: <PASSWORD>
+  schema_name: private_gpt
+```
+
+The following table will be created in the database
+```
+postgres=# \d private_gpt.data_embeddings
+                                      Table "private_gpt.data_embeddings"
+  Column   |       Type        | Collation | Nullable |                         Default
+-----------+-------------------+-----------+----------+---------------------------------------------------------
+ id        | bigint            |           | not null | nextval('private_gpt.data_embeddings_id_seq'::regclass)
+ text      | character varying |           | not null |
+ metadata_ | json              |           |          |
+ node_id   | character varying |           |          |
+ embedding | vector(768)       |           |          |
+Indexes:
+    "data_embeddings_pkey" PRIMARY KEY, btree (id)
+
+postgres=# 
+```
+The dimensions of the embeddings columns will be set based on the `embedding.embed_dim` value.  If the embedding model changes this table may need to be dropped and recreated to avoid a dimension mismatch.
--- a/fern/docs/pages/overview/quickstart.mdx
+++ b/fern/docs/pages/overview/quickstart.mdx
@@ -1,21 +0,0 @@
-## Local Installation steps
-
-The steps in [Installation](/installation) section are better explained and cover more
-setup scenarios (macOS, Windows, Linux).
-But if you like one-liners, have python3.11 installed, and you are running a UNIX (macOS or Linux)
-system, you can get up and running on CPU in few lines:
-
-```bash
-git clone https://github.com/imartinez/privateGPT && cd privateGPT && \
-python3.11 -m venv .venv && source .venv/bin/activate && \
-pip install --upgrade pip poetry && poetry install --with ui,local && ./scripts/setup
-
-# Launch the privateGPT API server **and** the gradio UI
-python3.11 -m private_gpt
-
-# In another terminal, create a new browser window on your private GPT!
-open http:////127.0.0.1:8001/
-```
-
-The above is not working, or it is too slow, so **you want to run it on GPU(s)**?
-Please check the more detailed [installation guide](/installation).
--- a/fern/docs/pages/overview/welcome.mdx
+++ b/fern/docs/pages/overview/welcome.mdx
@@ -1,20 +1,19 @@
-## Introduction 👋
-
 PrivateGPT provides an **API** containing all the building blocks required to
 build **private, context-aware AI applications**.
 The API follows and extends OpenAI API standard, and supports both normal and streaming responses.
 That means that, if you can use OpenAI API in one of your tools, you can use your own PrivateGPT API instead,
-with no code changes, **and for free** if you are running privateGPT in `local` mode.
-
-Looking for the installation quickstart? [Quickstart installation guide for Linux and macOS](/overview/welcome/quickstart).
-
-Do you want to install it on Windows? Or do you want to take full advantage of your hardware for better performances?
-The installation guide will help you in the [Installation section](/installation).
+with no code changes, **and for free** if you are running privateGPT in a `local` setup.

+Get started by understanding the [Main Concepts and Installation](/installation) and then dive into the [API Reference](/api-reference).

 ## Frequently Visited Resources

 <Cards>
+  <Card
+    title="Main Concepts"
+    icon="fa-solid fa-lines-leaning"
+    href="/installation"
+  />
  <Card
    title="API Reference"
    icon="fa-solid fa-code"
@@ -23,7 +22,7 @@ The installation guide will help you in the [Installation section](/installation
  <Card
    title="Twitter"
    icon="fa-brands fa-twitter"
-    href="https://twitter.com/PrivateGPT_AI"
+    href="https://twitter.com/ZylonPrivateGPT"
  />
  <Card
    title="Discord Server"
@@ -32,20 +31,8 @@ The installation guide will help you in the [Installation section](/installation
  />
 </Cards>

-## API Organization 
+<br />

-The API is divided in two logical blocks:
-
-1. High-level API, abstracting all the complexity of a RAG (Retrieval Augmented Generation) pipeline implementation:
-    - Ingestion of documents: internally managing document parsing, splitting, metadata extraction,
-      embedding generation and storage.
-    - Chat & Completions using context from ingested documents: abstracting the retrieval of context, the prompt
-      engineering and the response generation.
-
-2. Low-level API, allowing advanced users to implement their own complex pipelines:
-    - Embeddings generation: based on a piece of text.
-    - Contextual chunks retrieval: given a query, returns the most relevant chunks of text from the ingested
-      documents.

 <Callout intent = "info">
 A working **Gradio UI client** is provided to test the API, together with a set of useful tools such as bulk
--- a/fern/docs/pages/recipes/list-llm.mdx
+++ b/fern/docs/pages/recipes/list-llm.mdx
@@ -24,7 +24,7 @@ user: {{ user_message }}
 assistant: {{ assistant_message }}
 ```

-And the "`tag`" style looks like this:
+The "`tag`" style looks like this:

 ```text
 <|system|>: {{ system_prompt }}
@@ -32,7 +32,23 @@ And the "`tag`" style looks like this:
 <|assistant|>: {{ assistant_message }}
 ```

-Some LLMs will not understand this prompt style, and will not work (returning nothing).
+The "`mistral`" style looks like this: 
+
+```text 
+<s>[INST] You are an AI assistant. [/INST]</s>[INST] Hello, how are you doing? [/INST]
+```
+
+The "`chatml`" style looks like this: 
+```text
+<|im_start|>system
+{{ system_prompt }}<|im_end|>
+<|im_start|>user"
+{{ user_message }}<|im_end|>
+<|im_start|>assistant
+{{ assistant_message }}
+```
+
+Some LLMs will not understand these prompt styles, and will not work (returning nothing).
 You can try to change the prompt style to `default` (or `tag`) in the settings, and it will
 change the way the messages are formatted to be passed to the LLM.

@@ -92,4 +108,14 @@ local:
  llm_hf_model_file: godzilla2-70b.Q4_K_M.gguf
  embedding_hf_model_name: BAAI/bge-large-en
  prompt_style: "llama2"
-```
+```
+### German speaking model
+`settings-de.yaml`:
+```yml
+local:
+  llm_hf_repo_id: TheBloke/em_german_leo_mistral-GGUF
+  llm_hf_model_file:   em_german_leo_mistral.Q4_K_M.gguf
+  embedding_hf_model_name: T-Systems-onsite/german-roberta-sentence-transformer-v2
+  #llama, default or tag
+  prompt_style: "default"
+```
--- a/fern/fern.config.json
+++ b/fern/fern.config.json
@@ -1,4 +1,4 @@
 {
  "organization": "privategpt",
-  "version": "0.15.3"
+  "version": "0.19.10"
 }
--- a/fern/openapi/openapi.json
+++ b/fern/openapi/openapi.json
@@ -1,20 +1,8 @@
 {
  "openapi": "3.1.0",
  "info": {
-    "title": "PrivateGPT",
-    "summary": "PrivateGPT is a production-ready AI project that allows you to ask questions to your documents using the power of Large Language Models (LLMs), even in scenarios without Internet connection. 100% private, no data leaves your execution environment at any point.",
-    "description": "",
-    "contact": {
-      "url": "https://github.com/imartinez/privateGPT"
-    },
-    "license": {
-      "name": "Apache 2.0",
-      "url": "https://www.apache.org/licenses/LICENSE-2.0.html"
-    },
-    "version": "0.1.0",
-    "x-logo": {
-      "url": "https://lh3.googleusercontent.com/drive-viewer/AK7aPaD_iNlMoTquOBsw4boh4tIYxyEuhz6EtEs8nzq3yNkNAK00xGjE1KUCmPJSk3TYOjcs6tReG6w_cLu1S7L_gPgT9z52iw=s2560"
-    }
+    "title": "FastAPI",
+    "version": "0.1.0"
  },
  "paths": {
    "/v1/completions": {
@@ -56,6 +44,15 @@
              }
            }
          }
+        },
+        "x-fern-streaming": {
+          "stream-condition": "stream",
+          "response": {
+            "$ref": "#/components/schemas/OpenAICompletion"
+          },
+          "response-stream": {
+            "$ref": "#/components/schemas/OpenAICompletion"
+          }
        }
      }
    },
@@ -65,7 +62,7 @@
          "Contextual Completions"
        ],
        "summary": "Chat Completion",
-        "description": "Given a list of messages comprising a conversation, return a response.\n\nOptionally include a `system_prompt` to influence the way the LLM answers.\n\nIf `use_context` is set to `true`, the model will use context coming\nfrom the ingested documents to create the response. The documents being used can\nbe filtered using the `context_filter` and passing the document IDs to be used.\nIngested documents IDs can be found using `/ingest/list` endpoint. If you want\nall ingested documents to be used, remove `context_filter` altogether.\n\nWhen using `'include_sources': true`, the API will return the source Chunks used\nto create the response, which come from the context provided.\n\nWhen using `'stream': true`, the API will return data chunks following [OpenAI's\nstreaming model](https://platform.openai.com/docs/api-reference/chat/streaming):\n```\n{\"id\":\"12345\",\"object\":\"completion.chunk\",\"created\":1694268190,\n\"model\":\"private-gpt\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"Hello\"},\n\"finish_reason\":null}]}\n```",
+        "description": "Given a list of messages comprising a conversation, return a response.\n\nOptionally include an initial `role: system` message to influence the way\nthe LLM answers.\n\nIf `use_context` is set to `true`, the model will use context coming\nfrom the ingested documents to create the response. The documents being used can\nbe filtered using the `context_filter` and passing the document IDs to be used.\nIngested documents IDs can be found using `/ingest/list` endpoint. If you want\nall ingested documents to be used, remove `context_filter` altogether.\n\nWhen using `'include_sources': true`, the API will return the source Chunks used\nto create the response, which come from the context provided.\n\nWhen using `'stream': true`, the API will return data chunks following [OpenAI's\nstreaming model](https://platform.openai.com/docs/api-reference/chat/streaming):\n```\n{\"id\":\"12345\",\"object\":\"completion.chunk\",\"created\":1694268190,\n\"model\":\"private-gpt\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"Hello\"},\n\"finish_reason\":null}]}\n```",
        "operationId": "chat_completion_v1_chat_completions_post",
        "requestBody": {
          "content": {
@@ -98,6 +95,15 @@
              }
            }
          }
+        },
+        "x-fern-streaming": {
+          "stream-condition": "stream",
+          "response": {
+            "$ref": "#/components/schemas/OpenAICompletion"
+          },
+          "response-stream": {
+            "$ref": "#/components/schemas/OpenAICompletion"
+          }
        }
      }
    },
@@ -149,7 +155,7 @@
          "Ingestion"
        ],
        "summary": "Ingest",
-        "description": "Ingests and processes a file, storing its chunks to be used as context.\n\nThe context obtained from files is later used in\n`/chat/completions`, `/completions`, and `/chunks` APIs.\n\nMost common document\nformats are supported, but you may be prompted to install an extra dependency to\nmanage a specific file type.\n\nA file can generate different Documents (for example a PDF generates one Document\nper page). All Documents IDs are returned in the response, together with the\nextracted Metadata (which is later used to improve context retrieval). Those IDs\ncan be used to filter the context used to create responses in\n`/chat/completions`, `/completions`, and `/chunks` APIs.",
+        "description": "Ingests and processes a file.\n\nDeprecated. Use ingest/file instead.",
        "operationId": "ingest_v1_ingest_post",
        "requestBody": {
          "content": {
@@ -161,6 +167,91 @@
          },
          "required": true
        },
+        "responses": {
+          "200": {
+            "description": "Successful Response",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/IngestResponse"
+                }
+              }
+            }
+          },
+          "422": {
+            "description": "Validation Error",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/HTTPValidationError"
+                }
+              }
+            }
+          }
+        },
+        "deprecated": true
+      }
+    },
+    "/v1/ingest/file": {
+      "post": {
+        "tags": [
+          "Ingestion"
+        ],
+        "summary": "Ingest File",
+        "description": "Ingests and processes a file, storing its chunks to be used as context.\n\nThe context obtained from files is later used in\n`/chat/completions`, `/completions`, and `/chunks` APIs.\n\nMost common document\nformats are supported, but you may be prompted to install an extra dependency to\nmanage a specific file type.\n\nA file can generate different Documents (for example a PDF generates one Document\nper page). All Documents IDs are returned in the response, together with the\nextracted Metadata (which is later used to improve context retrieval). Those IDs\ncan be used to filter the context used to create responses in\n`/chat/completions`, `/completions`, and `/chunks` APIs.",
+        "operationId": "ingest_file_v1_ingest_file_post",
+        "requestBody": {
+          "content": {
+            "multipart/form-data": {
+              "schema": {
+                "$ref": "#/components/schemas/Body_ingest_file_v1_ingest_file_post"
+              }
+            }
+          },
+          "required": true
+        },
+        "responses": {
+          "200": {
+            "description": "Successful Response",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/IngestResponse"
+                }
+              }
+            }
+          },
+          "422": {
+            "description": "Validation Error",
+            "content": {
+              "application/json": {
+                "schema": {
+                  "$ref": "#/components/schemas/HTTPValidationError"
+                }
+              }
+            }
+          }
+        }
+      }
+    },
+    "/v1/ingest/text": {
+      "post": {
+        "tags": [
+          "Ingestion"
+        ],
+        "summary": "Ingest Text",
+        "description": "Ingests and processes a text, storing its chunks to be used as context.\n\nThe context obtained from files is later used in\n`/chat/completions`, `/completions`, and `/chunks` APIs.\n\nA Document will be generated with the given text. The Document\nID is returned in the response, together with the\nextracted Metadata (which is later used to improve context retrieval). That ID\ncan be used to filter the context used to create responses in\n`/chat/completions`, `/completions`, and `/chunks` APIs.",
+        "operationId": "ingest_text_v1_ingest_text_post",
+        "requestBody": {
+          "content": {
+            "application/json": {
+              "schema": {
+                "$ref": "#/components/schemas/IngestTextBody"
+              }
+            }
+          },
+          "required": true
+        },
        "responses": {
          "200": {
            "description": "Successful Response",
@@ -315,6 +406,20 @@
  },
  "components": {
    "schemas": {
+      "Body_ingest_file_v1_ingest_file_post": {
+        "properties": {
+          "file": {
+            "type": "string",
+            "format": "binary",
+            "title": "File"
+          }
+        },
+        "type": "object",
+        "required": [
+          "file"
+        ],
+        "title": "Body_ingest_file_v1_ingest_file_post"
+      },
      "Body_ingest_v1_ingest_post": {
        "properties": {
          "file": {
@@ -338,17 +443,6 @@
            "type": "array",
            "title": "Messages"
          },
-          "system_prompt": {
-            "anyOf": [
-              {
-                "type": "string"
-              },
-              {
-                "type": "null"
-              }
-            ],
-            "title": "System Prompt"
-          },
          "use_context": {
            "type": "boolean",
            "title": "Use Context",
@@ -389,13 +483,16 @@
            },
            "include_sources": true,
            "messages": [
+              {
+                "content": "You are a rapper. Always answer with a rap.",
+                "role": "system"
+              },
              {
                "content": "How do you fry an egg?",
                "role": "user"
              }
            ],
            "stream": false,
-            "system_prompt": "You are a rapper. Always answer with a rap.",
            "use_context": true
          }
        ]
@@ -591,6 +688,7 @@
            "include_sources": false,
            "prompt": "How do you fry an egg?",
            "stream": false,
+            "system_prompt": "You are a rapper. Always answer with a rap.",
            "use_context": false
          }
        ]
@@ -754,6 +852,30 @@
        ],
        "title": "IngestResponse"
      },
+      "IngestTextBody": {
+        "properties": {
+          "file_name": {
+            "type": "string",
+            "title": "File Name",
+            "examples": [
+              "Avatar: The Last Airbender"
+            ]
+          },
+          "text": {
+            "type": "string",
+            "title": "Text",
+            "examples": [
+              "Avatar is set in an Asian and Arctic-inspired world in which some people can telekinetically manipulate one of the four elements\u2014water, earth, fire or air\u2014through practices known as 'bending', inspired by Chinese martial arts."
+            ]
+          }
+        },
+        "type": "object",
+        "required": [
+          "file_name",
+          "text"
+        ],
+        "title": "IngestTextBody"
+      },
      "IngestedDoc": {
        "properties": {
          "object": {
@@ -986,27 +1108,5 @@
        "title": "ValidationError"
      }
    }
-  },
-  "tags": [
-    {
-      "name": "Ingestion",
-      "description": "High-level APIs covering document ingestion -internally managing document parsing, splitting,metadata extraction, embedding generation and storage- and ingested documents CRUD.Each ingested document is identified by an ID that can be used to filter the contextused in *Contextual Completions* and *Context Chunks* APIs."
-    },
-    {
-      "name": "Contextual Completions",
-      "description": "High-level APIs covering contextual Chat and Completions. They follow OpenAI's format, extending it to allow using the context coming from ingested documents to create the response. Internallymanage context retrieval, prompt engineering and the response generation."
-    },
-    {
-      "name": "Context Chunks",
-      "description": "Low-level API that given a query return relevant chunks of text coming from the ingesteddocuments."
-    },
-    {
-      "name": "Embeddings",
-      "description": "Low-level API to obtain the vector representation of a given text, using an Embeddings model.Follows OpenAI's embeddings API format."
-    },
-    {
-      "name": "Health",
-      "description": "Simple health API to make sure the server is up and running."
-    }
-  ]
+  }
 }
--- a/poetry.lock
+++ b/poetry.lock
--- a/private_gpt/init.py
+++ b/private_gpt/init.py
@@ -1,4 +1,5 @@
 """private-gpt."""
+
 import logging
 import os

@@ -21,3 +22,6 @@ os.environ["GRADIO_ANALYTICS_ENABLED"] = "False"
 # Disable chromaDB telemetry
 # It is already disabled, see PR#1144
 # os.environ["ANONYMIZED_TELEMETRY"] = "False"
+
+# adding tiktoken cache path within repo to be able to run in offline environment.
+os.environ["TIKTOKEN_CACHE_DIR"] = "tiktoken_cache"
--- a/private_gpt/components/embedding/custom/sagemaker.py
+++ b/private_gpt/components/embedding/custom/sagemaker.py
@@ -3,7 +3,7 @@ import json
 from typing import Any

 import boto3
-from llama_index.embeddings.base import BaseEmbedding
+from llama_index.core.base.embeddings.base import BaseEmbedding
 from pydantic import Field, PrivateAttr


--- a/private_gpt/components/embedding/embedding_component.py
+++ b/private_gpt/components/embedding/embedding_component.py
@@ -1,8 +1,7 @@
 import logging

 from injector import inject, singleton
-from llama_index import MockEmbedding
-from llama_index.embeddings.base import BaseEmbedding
+from llama_index.core.embeddings import BaseEmbedding, MockEmbedding

 from private_gpt.paths import models_cache_path
 from private_gpt.settings.settings import Settings
@@ -19,27 +18,78 @@ class EmbeddingComponent:
        embedding_mode = settings.embedding.mode
        logger.info("Initializing the embedding model in mode=%s", embedding_mode)
        match embedding_mode:
-            case "local":
-                from llama_index.embeddings import HuggingFaceEmbedding
+            case "huggingface":
+                try:
+                    from llama_index.embeddings.huggingface import (  # type: ignore
+                        HuggingFaceEmbedding,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "Local dependencies not found, install with `poetry install --extras embeddings-huggingface`"
+                    ) from e

                self.embedding_model = HuggingFaceEmbedding(
-                    model_name=settings.local.embedding_hf_model_name,
+                    model_name=settings.huggingface.embedding_hf_model_name,
                    cache_folder=str(models_cache_path),
                )
            case "sagemaker":
-
-                from private_gpt.components.embedding.custom.sagemaker import (
-                    SagemakerEmbedding,
-                )
+                try:
+                    from private_gpt.components.embedding.custom.sagemaker import (
+                        SagemakerEmbedding,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "Sagemaker dependencies not found, install with `poetry install --extras embeddings-sagemaker`"
+                    ) from e

                self.embedding_model = SagemakerEmbedding(
                    endpoint_name=settings.sagemaker.embedding_endpoint_name,
                )
            case "openai":
-                from llama_index import OpenAIEmbedding
+                try:
+                    from llama_index.embeddings.openai import (  # type: ignore
+                        OpenAIEmbedding,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "OpenAI dependencies not found, install with `poetry install --extras embeddings-openai`"
+                    ) from e

                openai_settings = settings.openai.api_key
                self.embedding_model = OpenAIEmbedding(api_key=openai_settings)
+            case "ollama":
+                try:
+                    from llama_index.embeddings.ollama import (  # type: ignore
+                        OllamaEmbedding,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "Local dependencies not found, install with `poetry install --extras embeddings-ollama`"
+                    ) from e
+
+                ollama_settings = settings.ollama
+                self.embedding_model = OllamaEmbedding(
+                    model_name=ollama_settings.embedding_model,
+                    base_url=ollama_settings.embedding_api_base,
+                )
+            case "azopenai":
+                try:
+                    from llama_index.embeddings.azure_openai import (  # type: ignore
+                        AzureOpenAIEmbedding,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "Azure OpenAI dependencies not found, install with `poetry install --extras embeddings-azopenai`"
+                    ) from e
+
+                azopenai_settings = settings.azopenai
+                self.embedding_model = AzureOpenAIEmbedding(
+                    model=azopenai_settings.embedding_model,
+                    deployment_name=azopenai_settings.embedding_deployment_name,
+                    api_key=azopenai_settings.api_key,
+                    azure_endpoint=azopenai_settings.azure_endpoint,
+                    api_version=azopenai_settings.api_version,
+                )
            case "mock":
                # Not a random number, is the dimensionality used by
                # the default embedding model
--- a/private_gpt/components/ingest/ingest_component.py
+++ b/private_gpt/components/ingest/ingest_component.py
@@ -6,22 +6,21 @@ import multiprocessing.pool
 import os
 import threading
 from pathlib import Path
+from queue import Queue
 from typing import Any

-from llama_index import (
-    Document,
-    ServiceContext,
-    StorageContext,
-    VectorStoreIndex,
-    load_index_from_storage,
-)
-from llama_index.data_structs import IndexDict
-from llama_index.indices.base import BaseIndex
-from llama_index.ingestion import run_transformations
+from llama_index.core.data_structs import IndexDict
+from llama_index.core.embeddings.utils import EmbedType
+from llama_index.core.indices import VectorStoreIndex, load_index_from_storage
+from llama_index.core.indices.base import BaseIndex
+from llama_index.core.ingestion import run_transformations
+from llama_index.core.schema import BaseNode, Document, TransformComponent
+from llama_index.core.storage import StorageContext

 from private_gpt.components.ingest.ingest_helper import IngestionHelper
 from private_gpt.paths import local_data_path
 from private_gpt.settings.settings import Settings
+from private_gpt.utils.eta import eta

 logger = logging.getLogger(__name__)

@@ -30,13 +29,15 @@ class BaseIngestComponent(abc.ABC):
    def __init__(
        self,
        storage_context: StorageContext,
-        service_context: ServiceContext,
+        embed_model: EmbedType,
+        transformations: list[TransformComponent],
        *args: Any,
        **kwargs: Any,
    ) -> None:
        logger.debug("Initializing base ingest component type=%s", type(self).__name__)
        self.storage_context = storage_context
-        self.service_context = service_context
+        self.embed_model = embed_model
+        self.transformations = transformations

    @abc.abstractmethod
    def ingest(self, file_name: str, file_data: Path) -> list[Document]:
@@ -55,11 +56,12 @@ class BaseIngestComponentWithIndex(BaseIngestComponent, abc.ABC):
    def __init__(
        self,
        storage_context: StorageContext,
-        service_context: ServiceContext,
+        embed_model: EmbedType,
+        transformations: list[TransformComponent],
        *args: Any,
        **kwargs: Any,
    ) -> None:
-        super().__init__(storage_context, service_context, *args, **kwargs)
+        super().__init__(storage_context, embed_model, transformations, *args, **kwargs)

        self.show_progress = True
        self._index_thread_lock = (
@@ -73,9 +75,10 @@ class BaseIngestComponentWithIndex(BaseIngestComponent, abc.ABC):
            # Load the index with store_nodes_override=True to be able to delete them
            index = load_index_from_storage(
                storage_context=self.storage_context,
-                service_context=self.service_context,
                store_nodes_override=True,  # Force store nodes in index and document stores
                show_progress=self.show_progress,
+                embed_model=self.embed_model,
+                transformations=self.transformations,
            )
        except ValueError:
            # There are no index in the storage context, creating a new one
@@ -83,9 +86,10 @@ class BaseIngestComponentWithIndex(BaseIngestComponent, abc.ABC):
            index = VectorStoreIndex.from_documents(
                [],
                storage_context=self.storage_context,
-                service_context=self.service_context,
                store_nodes_override=True,  # Force store nodes in index and document stores
                show_progress=self.show_progress,
+                embed_model=self.embed_model,
+                transformations=self.transformations,
            )
            index.storage_context.persist(persist_dir=local_data_path)
        return index
@@ -106,11 +110,12 @@ class SimpleIngestComponent(BaseIngestComponentWithIndex):
    def __init__(
        self,
        storage_context: StorageContext,
-        service_context: ServiceContext,
+        embed_model: EmbedType,
+        transformations: list[TransformComponent],
        *args: Any,
        **kwargs: Any,
    ) -> None:
-        super().__init__(storage_context, service_context, *args, **kwargs)
+        super().__init__(storage_context, embed_model, transformations, *args, **kwargs)

    def ingest(self, file_name: str, file_data: Path) -> list[Document]:
        logger.info("Ingesting file_name=%s", file_name)
@@ -151,16 +156,17 @@ class BatchIngestComponent(BaseIngestComponentWithIndex):
    def __init__(
        self,
        storage_context: StorageContext,
-        service_context: ServiceContext,
+        embed_model: EmbedType,
+        transformations: list[TransformComponent],
        count_workers: int,
        *args: Any,
        **kwargs: Any,
    ) -> None:
-        super().__init__(storage_context, service_context, *args, **kwargs)
+        super().__init__(storage_context, embed_model, transformations, *args, **kwargs)
        # Make an efficient use of the CPU and GPU, the embedding
        # must be in the transformations
        assert (
-            len(self.service_context.transformations) >= 2
+            len(self.transformations) >= 2
        ), "Embeddings must be in the transformations"
        assert count_workers > 0, "count_workers must be > 0"
        self.count_workers = count_workers
@@ -197,7 +203,7 @@ class BatchIngestComponent(BaseIngestComponentWithIndex):
        logger.debug("Transforming count=%s documents into nodes", len(documents))
        nodes = run_transformations(
            documents,  # type: ignore[arg-type]
-            self.service_context.transformations,
+            self.transformations,
            show_progress=self.show_progress,
        )
        # Locking the index to avoid concurrent writes
@@ -225,16 +231,17 @@ class ParallelizedIngestComponent(BaseIngestComponentWithIndex):
    def __init__(
        self,
        storage_context: StorageContext,
-        service_context: ServiceContext,
+        embed_model: EmbedType,
+        transformations: list[TransformComponent],
        count_workers: int,
        *args: Any,
        **kwargs: Any,
    ) -> None:
-        super().__init__(storage_context, service_context, *args, **kwargs)
+        super().__init__(storage_context, embed_model, transformations, *args, **kwargs)
        # To make an efficient use of the CPU and GPU, the embeddings
        # must be in the transformations (to be computed in batches)
        assert (
-            len(self.service_context.transformations) >= 2
+            len(self.transformations) >= 2
        ), "Embeddings must be in the transformations"
        assert count_workers > 0, "count_workers must be > 0"
        self.count_workers = count_workers
@@ -278,7 +285,7 @@ class ParallelizedIngestComponent(BaseIngestComponentWithIndex):
        logger.debug("Transforming count=%s documents into nodes", len(documents))
        nodes = run_transformations(
            documents,  # type: ignore[arg-type]
-            self.service_context.transformations,
+            self.transformations,
            show_progress=self.show_progress,
        )
        # Locking the index to avoid concurrent writes
@@ -309,20 +316,202 @@ class ParallelizedIngestComponent(BaseIngestComponentWithIndex):
        self._file_to_documents_work_pool.terminate()


+class PipelineIngestComponent(BaseIngestComponentWithIndex):
+    """Pipeline ingestion - keeping the embedding worker pool as busy as possible.
+
+    This class implements a threaded ingestion pipeline, which comprises two threads
+    and two queues. The primary thread is responsible for reading and parsing files
+    into documents. These documents are then placed into a queue, which is
+    distributed to a pool of worker processes for embedding computation. After
+    embedding, the documents are transferred to another queue where they are
+    accumulated until a threshold is reached. Upon reaching this threshold, the
+    accumulated documents are flushed to the document store, index, and vector
+    store.
+
+    Exception handling ensures robustness against erroneous files. However, in the
+    pipelined design, one error can lead to the discarding of multiple files. Any
+    discarded files will be reported.
+    """
+
+    NODE_FLUSH_COUNT = 5000  # Save the index every # nodes.
+
+    def __init__(
+        self,
+        storage_context: StorageContext,
+        embed_model: EmbedType,
+        transformations: list[TransformComponent],
+        count_workers: int,
+        *args: Any,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(storage_context, embed_model, transformations, *args, **kwargs)
+        self.count_workers = count_workers
+        assert (
+            len(self.transformations) >= 2
+        ), "Embeddings must be in the transformations"
+        assert count_workers > 0, "count_workers must be > 0"
+        self.count_workers = count_workers
+        # We are doing our own multiprocessing
+        # To do not collide with the multiprocessing of huggingface, we disable it
+        os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+        # doc_q stores parsed files as Document chunks.
+        # Using a shallow queue causes the filesystem parser to block
+        # when it reaches capacity. This ensures it doesn't outpace the
+        # computationally intensive embeddings phase, avoiding unnecessary
+        # memory consumption.  The semaphore is used to bound the async worker
+        # embedding computations to cause the doc Q to fill and block.
+        self.doc_semaphore = multiprocessing.Semaphore(
+            self.count_workers
+        )  # limit the doc queue to # items.
+        self.doc_q: Queue[tuple[str, str | None, list[Document] | None]] = Queue(20)
+        # node_q stores documents parsed into nodes (embeddings).
+        # Larger queue size so we don't block the embedding workers during a slow
+        # index update.
+        self.node_q: Queue[
+            tuple[str, str | None, list[Document] | None, list[BaseNode] | None]
+        ] = Queue(40)
+        threading.Thread(target=self._doc_to_node, daemon=True).start()
+        threading.Thread(target=self._write_nodes, daemon=True).start()
+
+    def _doc_to_node(self) -> None:
+        # Parse documents into nodes
+        with multiprocessing.pool.ThreadPool(processes=self.count_workers) as pool:
+            while True:
+                try:
+                    cmd, file_name, documents = self.doc_q.get(
+                        block=True
+                    )  # Documents for a file
+                    if cmd == "process":
+                        # Push CPU/GPU embedding work to the worker pool
+                        # Acquire semaphore to control access to worker pool
+                        self.doc_semaphore.acquire()
+                        pool.apply_async(
+                            self._doc_to_node_worker, (file_name, documents)
+                        )
+                    elif cmd == "quit":
+                        break
+                finally:
+                    if cmd != "process":
+                        self.doc_q.task_done()  # unblock Q joins
+
+    def _doc_to_node_worker(self, file_name: str, documents: list[Document]) -> None:
+        # CPU/GPU intensive work in its own process
+        try:
+            nodes = run_transformations(
+                documents,  # type: ignore[arg-type]
+                self.transformations,
+                show_progress=self.show_progress,
+            )
+            self.node_q.put(("process", file_name, documents, nodes))
+        finally:
+            self.doc_semaphore.release()
+            self.doc_q.task_done()  # unblock Q joins
+
+    def _save_docs(
+        self, files: list[str], documents: list[Document], nodes: list[BaseNode]
+    ) -> None:
+        try:
+            logger.info(
+                f"Saving {len(files)} files ({len(documents)} documents / {len(nodes)} nodes)"
+            )
+            self._index.insert_nodes(nodes)
+            for document in documents:
+                self._index.docstore.set_document_hash(
+                    document.get_doc_id(), document.hash
+                )
+            self._save_index()
+        except Exception:
+            # Tell the user so they can investigate these files
+            logger.exception(f"Processing files {files}")
+        finally:
+            # Clearing work, even on exception, maintains a clean state.
+            nodes.clear()
+            documents.clear()
+            files.clear()
+
+    def _write_nodes(self) -> None:
+        # Save nodes to index.  I/O intensive.
+        node_stack: list[BaseNode] = []
+        doc_stack: list[Document] = []
+        file_stack: list[str] = []
+        while True:
+            try:
+                cmd, file_name, documents, nodes = self.node_q.get(block=True)
+                if cmd in ("flush", "quit"):
+                    if file_stack:
+                        self._save_docs(file_stack, doc_stack, node_stack)
+                    if cmd == "quit":
+                        break
+                elif cmd == "process":
+                    node_stack.extend(nodes)  # type: ignore[arg-type]
+                    doc_stack.extend(documents)  # type: ignore[arg-type]
+                    file_stack.append(file_name)  # type: ignore[arg-type]
+                    # Constant saving is heavy on I/O - accumulate to a threshold
+                    if len(node_stack) >= self.NODE_FLUSH_COUNT:
+                        self._save_docs(file_stack, doc_stack, node_stack)
+            finally:
+                self.node_q.task_done()
+
+    def _flush(self) -> None:
+        self.doc_q.put(("flush", None, None))
+        self.doc_q.join()
+        self.node_q.put(("flush", None, None, None))
+        self.node_q.join()
+
+    def ingest(self, file_name: str, file_data: Path) -> list[Document]:
+        documents = IngestionHelper.transform_file_into_documents(file_name, file_data)
+        self.doc_q.put(("process", file_name, documents))
+        self._flush()
+        return documents
+
+    def bulk_ingest(self, files: list[tuple[str, Path]]) -> list[Document]:
+        docs = []
+        for file_name, file_data in eta(files):
+            try:
+                documents = IngestionHelper.transform_file_into_documents(
+                    file_name, file_data
+                )
+                self.doc_q.put(("process", file_name, documents))
+                docs.extend(documents)
+            except Exception:
+                logger.exception(f"Skipping {file_data.name}")
+        self._flush()
+        return docs
+
+
 def get_ingestion_component(
    storage_context: StorageContext,
-    service_context: ServiceContext,
+    embed_model: EmbedType,
+    transformations: list[TransformComponent],
    settings: Settings,
 ) -> BaseIngestComponent:
    """Get the ingestion component for the given configuration."""
    ingest_mode = settings.embedding.ingest_mode
    if ingest_mode == "batch":
        return BatchIngestComponent(
-            storage_context, service_context, settings.embedding.count_workers
+            storage_context=storage_context,
+            embed_model=embed_model,
+            transformations=transformations,
+            count_workers=settings.embedding.count_workers,
        )
    elif ingest_mode == "parallel":
        return ParallelizedIngestComponent(
-            storage_context, service_context, settings.embedding.count_workers
+            storage_context=storage_context,
+            embed_model=embed_model,
+            transformations=transformations,
+            count_workers=settings.embedding.count_workers,
+        )
+    elif ingest_mode == "pipeline":
+        return PipelineIngestComponent(
+            storage_context=storage_context,
+            embed_model=embed_model,
+            transformations=transformations,
+            count_workers=settings.embedding.count_workers,
        )
    else:
-        return SimpleIngestComponent(storage_context, service_context)
+        return SimpleIngestComponent(
+            storage_context=storage_context,
+            embed_model=embed_model,
+            transformations=transformations,
+        )
--- a/private_gpt/components/ingest/ingest_helper.py
+++ b/private_gpt/components/ingest/ingest_helper.py
@@ -1,14 +1,58 @@
 import logging
 from pathlib import Path

-from llama_index import Document
-from llama_index.readers import JSONReader, StringIterableReader
-from llama_index.readers.file.base import DEFAULT_FILE_READER_CLS
+from llama_index.core.readers import StringIterableReader
+from llama_index.core.readers.base import BaseReader
+from llama_index.core.readers.json import JSONReader
+from llama_index.core.schema import Document

 logger = logging.getLogger(__name__)

+
+# Inspired by the `llama_index.core.readers.file.base` module
+def _try_loading_included_file_formats() -> dict[str, type[BaseReader]]:
+    try:
+        from llama_index.readers.file.docs import (  # type: ignore
+            DocxReader,
+            HWPReader,
+            PDFReader,
+        )
+        from llama_index.readers.file.epub import EpubReader  # type: ignore
+        from llama_index.readers.file.image import ImageReader  # type: ignore
+        from llama_index.readers.file.ipynb import IPYNBReader  # type: ignore
+        from llama_index.readers.file.markdown import MarkdownReader  # type: ignore
+        from llama_index.readers.file.mbox import MboxReader  # type: ignore
+        from llama_index.readers.file.slides import PptxReader  # type: ignore
+        from llama_index.readers.file.tabular import PandasCSVReader  # type: ignore
+        from llama_index.readers.file.video_audio import (  # type: ignore
+            VideoAudioReader,
+        )
+    except ImportError as e:
+        raise ImportError("`llama-index-readers-file` package not found") from e
+
+    default_file_reader_cls: dict[str, type[BaseReader]] = {
+        ".hwp": HWPReader,
+        ".pdf": PDFReader,
+        ".docx": DocxReader,
+        ".pptx": PptxReader,
+        ".ppt": PptxReader,
+        ".pptm": PptxReader,
+        ".jpg": ImageReader,
+        ".png": ImageReader,
+        ".jpeg": ImageReader,
+        ".mp3": VideoAudioReader,
+        ".mp4": VideoAudioReader,
+        ".csv": PandasCSVReader,
+        ".epub": EpubReader,
+        ".md": MarkdownReader,
+        ".mbox": MboxReader,
+        ".ipynb": IPYNBReader,
+    }
+    return default_file_reader_cls
+
+
 # Patching the default file reader to support other file types
-FILE_READER_CLS = DEFAULT_FILE_READER_CLS.copy()
+FILE_READER_CLS = _try_loading_included_file_formats()
 FILE_READER_CLS.update(
    {
        ".json": JSONReader,
--- a/private_gpt/components/llm/custom/sagemaker.py
+++ b/private_gpt/components/llm/custom/sagemaker.py
@@ -7,26 +7,20 @@ import logging
 from typing import TYPE_CHECKING, Any

 import boto3  # type: ignore
-from llama_index.bridge.pydantic import Field
-from llama_index.llms import (
+from llama_index.core.base.llms.generic_utils import (
+    completion_response_to_chat_response,
+    stream_completion_response_to_chat_response,
+)
+from llama_index.core.bridge.pydantic import Field
+from llama_index.core.llms import (
    CompletionResponse,
    CustomLLM,
    LLMMetadata,
 )
-from llama_index.llms.base import (
+from llama_index.core.llms.callbacks import (
    llm_chat_callback,
    llm_completion_callback,
 )
-from llama_index.llms.generic_utils import (
-    completion_response_to_chat_response,
-    stream_completion_response_to_chat_response,
-)
-from llama_index.llms.llama_utils import (
-    completion_to_prompt as generic_completion_to_prompt,
-)
-from llama_index.llms.llama_utils import (
-    messages_to_prompt as generic_messages_to_prompt,
-)

 if TYPE_CHECKING:
    from collections.abc import Sequence
@@ -161,8 +155,8 @@ class SagemakerLLM(CustomLLM):
        model_kwargs = model_kwargs or {}
        model_kwargs.update({"n_ctx": context_window, "verbose": verbose})

-        messages_to_prompt = messages_to_prompt or generic_messages_to_prompt
-        completion_to_prompt = completion_to_prompt or generic_completion_to_prompt
+        messages_to_prompt = messages_to_prompt or {}
+        completion_to_prompt = completion_to_prompt or {}

        generate_kwargs = generate_kwargs or {}
        generate_kwargs.update(
@@ -249,12 +243,19 @@ class SagemakerLLM(CustomLLM):
            event_stream = resp["Body"]
            start_json = b"{"
            stop_token = "<|endoftext|>"
+            first_token = True

            for line in LineIterator(event_stream):
                if line != b"" and start_json in line:
                    data = json.loads(line[line.find(start_json) :].decode("utf-8"))
-                    if data["token"]["text"] != stop_token:
+                    special = data["token"]["special"]
+                    stop = data["token"]["text"] == stop_token
+                    if not special and not stop:
                        delta = data["token"]["text"]
+                        # trim the leading space for the first token if present
+                        if first_token:
+                            delta = delta.lstrip()
+                            first_token = False
                        text += delta
                        yield CompletionResponse(delta=delta, text=text, raw=data)

--- a/private_gpt/components/llm/llm_component.py
+++ b/private_gpt/components/llm/llm_component.py
@@ -1,10 +1,15 @@
 import logging
+from collections.abc import Callable
+from typing import Any

 from injector import inject, singleton
-from llama_index.llms import MockLLM
-from llama_index.llms.base import LLM
+from llama_index.core.llms import LLM, MockLLM
+from llama_index.core.settings import Settings as LlamaIndexSettings
+from llama_index.core.utils import set_global_tokenizer
+from transformers import AutoTokenizer  # type: ignore

-from private_gpt.paths import models_path
+from private_gpt.components.llm.prompt_helper import get_prompt_style
+from private_gpt.paths import models_cache_path, models_path
 from private_gpt.settings.settings import Settings

 logger = logging.getLogger(__name__)
@@ -17,48 +22,154 @@ class LLMComponent:
    @inject
    def __init__(self, settings: Settings) -> None:
        llm_mode = settings.llm.mode
+        if settings.llm.tokenizer:
+            set_global_tokenizer(
+                AutoTokenizer.from_pretrained(
+                    pretrained_model_name_or_path=settings.llm.tokenizer,
+                    cache_dir=str(models_cache_path),
+                )
+            )
+
        logger.info("Initializing the LLM in mode=%s", llm_mode)
        match settings.llm.mode:
-            case "local":
-                from llama_index.llms import LlamaCPP
-
-                from private_gpt.components.llm.prompt.prompt_helper import (
-                    get_prompt_style,
-                )
-
-                prompt_style = get_prompt_style(
-                    prompt_style=settings.local.prompt_style,
-                    template_name=settings.local.template_name,
-                    default_system_prompt=settings.local.default_system_prompt,
-                )
+            case "llamacpp":
+                try:
+                    from llama_index.llms.llama_cpp import LlamaCPP  # type: ignore
+                except ImportError as e:
+                    raise ImportError(
+                        "Local dependencies not found, install with `poetry install --extras llms-llama-cpp`"
+                    ) from e

+                prompt_style = get_prompt_style(settings.llamacpp.prompt_style)
+                settings_kwargs = {
+                    "tfs_z": settings.llamacpp.tfs_z,  # ollama and llama-cpp
+                    "top_k": settings.llamacpp.top_k,  # ollama and llama-cpp
+                    "top_p": settings.llamacpp.top_p,  # ollama and llama-cpp
+                    "repeat_penalty": settings.llamacpp.repeat_penalty,  # ollama llama-cpp
+                    "n_gpu_layers": -1,
+                    "offload_kqv": True,
+                }
                self.llm = LlamaCPP(
-                    model_path=str(models_path / settings.local.llm_hf_model_file),
-                    temperature=0.1,
+                    model_path=str(models_path / settings.llamacpp.llm_hf_model_file),
+                    temperature=settings.llm.temperature,
                    max_new_tokens=settings.llm.max_new_tokens,
-                    # llama2 has a context window of 4096 tokens,
-                    # but we set it lower to allow for some wiggle room
-                    context_window=3900,
+                    context_window=settings.llm.context_window,
                    generate_kwargs={},
+                    callback_manager=LlamaIndexSettings.callback_manager,
                    # All to GPU
-                    model_kwargs={"n_gpu_layers": -1},
+                    model_kwargs=settings_kwargs,
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )
-                # prompt_style.improve_prompt_format(llm=cast(LlamaCPP, self.llm))

            case "sagemaker":
-                from private_gpt.components.llm.custom.sagemaker import SagemakerLLM
+                try:
+                    from private_gpt.components.llm.custom.sagemaker import SagemakerLLM
+                except ImportError as e:
+                    raise ImportError(
+                        "Sagemaker dependencies not found, install with `poetry install --extras llms-sagemaker`"
+                    ) from e

                self.llm = SagemakerLLM(
                    endpoint_name=settings.sagemaker.llm_endpoint_name,
+                    max_new_tokens=settings.llm.max_new_tokens,
+                    context_window=settings.llm.context_window,
                )
            case "openai":
-                from llama_index.llms import OpenAI
+                try:
+                    from llama_index.llms.openai import OpenAI  # type: ignore
+                except ImportError as e:
+                    raise ImportError(
+                        "OpenAI dependencies not found, install with `poetry install --extras llms-openai`"
+                    ) from e

-                openai_settings = settings.openai.api_key
-                self.llm = OpenAI(api_key=openai_settings)
+                openai_settings = settings.openai
+                self.llm = OpenAI(
+                    api_base=openai_settings.api_base,
+                    api_key=openai_settings.api_key,
+                    model=openai_settings.model,
+                )
+            case "openailike":
+                try:
+                    from llama_index.llms.openai_like import OpenAILike  # type: ignore
+                except ImportError as e:
+                    raise ImportError(
+                        "OpenAILike dependencies not found, install with `poetry install --extras llms-openai-like`"
+                    ) from e
+
+                openai_settings = settings.openai
+                self.llm = OpenAILike(
+                    api_base=openai_settings.api_base,
+                    api_key=openai_settings.api_key,
+                    model=openai_settings.model,
+                    is_chat_model=True,
+                    max_tokens=None,
+                    api_version="",
+                )
+            case "ollama":
+                try:
+                    from llama_index.llms.ollama import Ollama  # type: ignore
+                except ImportError as e:
+                    raise ImportError(
+                        "Ollama dependencies not found, install with `poetry install --extras llms-ollama`"
+                    ) from e
+
+                ollama_settings = settings.ollama
+
+                settings_kwargs = {
+                    "tfs_z": ollama_settings.tfs_z,  # ollama and llama-cpp
+                    "num_predict": ollama_settings.num_predict,  # ollama only
+                    "top_k": ollama_settings.top_k,  # ollama and llama-cpp
+                    "top_p": ollama_settings.top_p,  # ollama and llama-cpp
+                    "repeat_last_n": ollama_settings.repeat_last_n,  # ollama
+                    "repeat_penalty": ollama_settings.repeat_penalty,  # ollama llama-cpp
+                }
+
+                self.llm = Ollama(
+                    model=ollama_settings.llm_model,
+                    base_url=ollama_settings.api_base,
+                    temperature=settings.llm.temperature,
+                    context_window=settings.llm.context_window,
+                    additional_kwargs=settings_kwargs,
+                    request_timeout=ollama_settings.request_timeout,
+                )
+
+                if (
+                    ollama_settings.keep_alive
+                    != ollama_settings.model_fields["keep_alive"].default
+                ):
+                    # Modify Ollama methods to use the "keep_alive" field.
+                    def add_keep_alive(func: Callable[..., Any]) -> Callable[..., Any]:
+                        def wrapper(*args: Any, **kwargs: Any) -> Any:
+                            kwargs["keep_alive"] = ollama_settings.keep_alive
+                            return func(*args, **kwargs)
+
+                        return wrapper
+
+                    Ollama.chat = add_keep_alive(Ollama.chat)
+                    Ollama.stream_chat = add_keep_alive(Ollama.stream_chat)
+                    Ollama.complete = add_keep_alive(Ollama.complete)
+                    Ollama.stream_complete = add_keep_alive(Ollama.stream_complete)
+
+            case "azopenai":
+                try:
+                    from llama_index.llms.azure_openai import (  # type: ignore
+                        AzureOpenAI,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "Azure OpenAI dependencies not found, install with `poetry install --extras llms-azopenai`"
+                    ) from e
+
+                azopenai_settings = settings.azopenai
+                self.llm = AzureOpenAI(
+                    model=azopenai_settings.llm_model,
+                    deployment_name=azopenai_settings.llm_deployment_name,
+                    api_key=azopenai_settings.api_key,
+                    azure_endpoint=azopenai_settings.azure_endpoint,
+                    api_version=azopenai_settings.api_version,
+                )
            case "mock":
                self.llm = MockLLM()
--- a/private_gpt/components/llm/prompt/init.py
+++ b/private_gpt/components/llm/prompt/init.py
--- a/private_gpt/components/llm/prompt/prompt_helper.py
+++ b/private_gpt/components/llm/prompt/prompt_helper.py
@@ -1,446 +0,0 @@
-# Ignoring the mypy check in this file, given that this file is imported only if
-# running in local mode (and therefore the llama-cpp-python library is installed).
-# type: ignore
-"""Helper to get your llama_index messages correctly serialized into a prompt.
-
-This set of classes and functions is used to format a series of
-llama_index ChatMessage into a prompt (a unique string) that will be passed
-as is to the LLM. The LLM will then use this prompt to generate a completion.
-
-There are **MANY** formats for prompts; usually, each model has its own format.
-Models posted on HuggingFace usually have a description of the format they use.
-The original models, that are shipped through `transformers`, have their
-format defined in the file `tokenizer_config.json` in the model's directory.
-The prompt format are usually defined as a Jinja template (with some custom
-Jinja token definitions). These prompt templates are usable using
-the `transformers.AutoTokenizer`, as described in
-https://huggingface.co/docs/transformers/main/chat_templating
-
-
-
-Examples of `tokenizer_config.json` files:
-https://huggingface.co/bofenghuang/vigogne-2-7b-chat/blob/main/tokenizer_config.json
-https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json
-https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/blob/main/tokenizer_config.json
-
-The format of the prompt is important, as if the wrong one is used, it
-will lead to "hallucinations" and other completions that are not relevant.
-"""
-
-import abc
-import logging
-from collections.abc import Sequence
-from pathlib import Path
-from typing import Any
-
-from jinja2 import FileSystemLoader
-from jinja2.exceptions import TemplateError
-from jinja2.sandbox import ImmutableSandboxedEnvironment
-from llama_cpp import llama_chat_format, llama_types
-from llama_index.llms import ChatMessage, MessageRole
-from llama_index.llms.llama_utils import (
-    DEFAULT_SYSTEM_PROMPT,
-    completion_to_prompt,
-    messages_to_prompt,
-)
-
-from private_gpt.constants import PROJECT_ROOT_PATH
-
-logger = logging.getLogger(__name__)
-
-THIS_DIRECTORY_RELATIVE = Path(__file__).parent.relative_to(PROJECT_ROOT_PATH)
-
-
-_LLAMA_CPP_PYTHON_CHAT_FORMAT: dict[str, llama_chat_format.ChatFormatter] = {
-    "llama-2": llama_chat_format.format_llama2,
-    "alpaca": llama_chat_format.format_alpaca,
-    "vicuna": llama_chat_format.format,
-    "oasst_llama": llama_chat_format.format_oasst_llama,
-    "baichuan-2": llama_chat_format.format_baichuan2,
-    "baichuan": llama_chat_format.format_baichuan,
-    "openbuddy": llama_chat_format.format_openbuddy,
-    "redpajama-incite": llama_chat_format.format_redpajama_incite,
-    "snoozy": llama_chat_format.format_snoozy,
-    "phind": llama_chat_format.format_phind,
-    "intel": llama_chat_format.format_intel,
-    "open-orca": llama_chat_format.format_open_orca,
-    "mistrallite": llama_chat_format.format_mistrallite,
-    "zephyr": llama_chat_format.format_zephyr,
-    "chatml": llama_chat_format.format_chatml,
-    "openchat": llama_chat_format.format_openchat,
-}
-
-
-# FIXME partial support
-def llama_index_to_llama_cpp_messages(
-    messages: Sequence[ChatMessage],
-) -> list[llama_types.ChatCompletionRequestMessage]:
-    """Convert messages from llama_index to llama_cpp format.
-
-    Convert a list of llama_index ChatMessage to a
-    list of llama_cpp ChatCompletionRequestMessage.
-    """
-    llama_cpp_messages: list[llama_types.ChatCompletionRequestMessage] = []
-    l_msg: llama_types.ChatCompletionRequestMessage
-    for msg in messages:
-        if msg.role == MessageRole.SYSTEM:
-            l_msg = llama_types.ChatCompletionRequestSystemMessage(
-                content=msg.content, role=msg.role.value
-            )
-        elif msg.role == MessageRole.USER:
-            # FIXME partial support
-            l_msg = llama_types.ChatCompletionRequestUserMessage(
-                content=msg.content, role=msg.role.value
-            )
-        elif msg.role == MessageRole.ASSISTANT:
-            # FIXME partial support
-            l_msg = llama_types.ChatCompletionRequestAssistantMessage(
-                content=msg.content, role=msg.role.value
-            )
-        elif msg.role == MessageRole.TOOL:
-            # FIXME partial support
-            l_msg = llama_types.ChatCompletionRequestToolMessage(
-                content=msg.content, role=msg.role.value, tool_call_id=""
-            )
-        elif msg.role == MessageRole.FUNCTION:
-            # FIXME partial support
-            l_msg = llama_types.ChatCompletionRequestFunctionMessage(
-                content=msg.content, role=msg.role.value, name=""
-            )
-        else:
-            raise ValueError(f"Unknown role='{msg.role}'")
-        llama_cpp_messages.append(l_msg)
-    return llama_cpp_messages
-
-
-def _get_llama_cpp_chat_format(name: str) -> llama_chat_format.ChatFormatter:
-    logger.debug("Getting llama_cpp_python prompt_format='%s'", name)
-    try:
-        return _LLAMA_CPP_PYTHON_CHAT_FORMAT[name]
-    except KeyError as err:
-        raise ValueError(f"Unknown llama_cpp_python prompt style '{name}'") from err
-
-
-class AbstractPromptStyle(abc.ABC):
-    """Abstract class for prompt styles.
-
-    This class is used to format a series of messages into a prompt that can be
-    understood by the models. A series of messages represents the interaction(s)
-    between a user and an assistant. This series of messages can be considered as a
-    session between a user X and an assistant Y.This session holds, through the
-    messages, the state of the conversation. This session, to be understood by the
-    model, needs to be formatted into a prompt (i.e. a string that the models
-    can understand). Prompts can be formatted in different ways,
-    depending on the model.
-
-    The implementations of this class represent the different ways to format a
-    series of messages into a prompt.
-    """
-
-    @abc.abstractmethod
-    def __init__(self, *args: Any, **kwargs: Any) -> None:
-        logger.debug("Initializing prompt_style=%s", self.__class__.__name__)
-        self.bos_token = "<s>"
-        self.eos_token = "</s>"
-        self.nl_token = "\n"
-
-    @abc.abstractmethod
-    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        pass
-
-    @abc.abstractmethod
-    def _completion_to_prompt(self, completion: str) -> str:
-        pass
-
-    def messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        logger.debug("Formatting messages='%s' to prompt", messages)
-        prompt = self._messages_to_prompt(messages)
-        logger.debug("Got for messages='%s' the prompt='%s'", messages, prompt)
-        return prompt
-
-    def completion_to_prompt(self, completion: str) -> str:
-        logger.debug("Formatting completion='%s' to prompt", completion)
-        prompt = self._completion_to_prompt(completion)
-        logger.debug("Got for completion='%s' the prompt='%s'", completion, prompt)
-        return prompt
-
-    # def improve_prompt_format(self, llm: LlamaCPP) -> None:
-    #     """Improve the prompt format of the given LLM.
-    #
-    #     Use the given metadata in the LLM to improve the prompt format.
-    #     """
-    #     # FIXME: we are getting IDs (1,2,13) from llama.cpp, and not actual strings
-    #     llama_cpp_llm = cast(Llama, llm._model)
-    #     self.bos_token = llama_cpp_llm.token_bos()
-    #     self.eos_token = llama_cpp_llm.token_eos()
-    #     self.nl_token = llama_cpp_llm.token_nl()
-    #     print([self.bos_token, self.eos_token, self.nl_token])
-    #     # (1,2,13) are the IDs of the tokens
-
-
-class AbstractPromptStyleWithSystemPrompt(AbstractPromptStyle, abc.ABC):
-    _DEFAULT_SYSTEM_PROMPT = DEFAULT_SYSTEM_PROMPT
-
-    def __init__(
-        self, default_system_prompt: str | None, *args: Any, **kwargs: Any
-    ) -> None:
-        super().__init__(*args, **kwargs)
-        logger.debug("Got default_system_prompt='%s'", default_system_prompt)
-        self.default_system_prompt = default_system_prompt
-
-    def _add_missing_system_prompt(
-        self, messages: Sequence[ChatMessage]
-    ) -> Sequence[ChatMessage]:
-        if messages[0].role != MessageRole.SYSTEM:
-            logger.debug(
-                "Adding system_promt='%s' to the given messages as there are none given in the session",
-                self.default_system_prompt,
-            )
-            messages = [
-                ChatMessage(
-                    content=self.default_system_prompt, role=MessageRole.SYSTEM
-                ),
-                *messages,
-            ]
-        return messages
-
-
-class DefaultPromptStyle(AbstractPromptStyle):
-    """Default prompt style that uses the defaults from llama_utils.
-
-    It basically passes None to the LLM, indicating it should use
-    the default functions.
-    """
-
-    def __init__(self, *args: Any, **kwargs: Any) -> None:
-        super().__init__(*args, **kwargs)
-
-        # Hacky way to override the functions
-        # Override the functions to be None, and pass None to the LLM.
-        self.messages_to_prompt = None  # type: ignore[method-assign, assignment]
-        self.completion_to_prompt = None  # type: ignore[method-assign, assignment]
-
-    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        """Dummy implementation."""
-        return ""
-
-    def _completion_to_prompt(self, completion: str) -> str:
-        """Dummy implementation."""
-        return ""
-
-
-class LlamaIndexPromptStyle(AbstractPromptStyleWithSystemPrompt):
-    """Simple prompt style that just uses the default llama_utils functions.
-
-    It transforms the sequence of messages into a prompt that should look like:
-    ```text
-    <s> [INST] <<SYS>> your system prompt here. <</SYS>>
-
-    user message here [/INST] assistant (model) response here </s>
-    ```
-    """
-
-    def __init__(
-        self, default_system_prompt: str | None = None, *args: Any, **kwargs: Any
-    ) -> None:
-        # If no system prompt is given, the default one of the implementation is used.
-        # default_system_prompt can be None here
-        kwargs["default_system_prompt"] = default_system_prompt
-        super().__init__(*args, **kwargs)
-
-    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        return messages_to_prompt(messages, self.default_system_prompt)
-
-    def _completion_to_prompt(self, completion: str) -> str:
-        return completion_to_prompt(completion, self.default_system_prompt)
-
-
-class VigognePromptStyle(AbstractPromptStyleWithSystemPrompt):
-    """Tag prompt style (used by Vigogne) that uses the prompt style `<|ROLE|>`.
-
-    It transforms the sequence of messages into a prompt that should look like:
-    ```text
-    <|system|>: your system prompt here.
-    <|user|>: user message here
-    (possibly with context and question)
-    <|assistant|>: assistant (model) response here.
-    ```
-
-    FIXME: should we add surrounding `<s>` and `</s>` tags, like in llama2?
-    """
-
-    def __init__(
-        self,
-        default_system_prompt: str | None = None,
-        add_generation_prompt: bool = True,
-        *args: Any,
-        **kwargs: Any,
-    ) -> None:
-        # We have to define a default system prompt here as the LLM will not
-        # use the default llama_utils functions.
-        default_system_prompt = default_system_prompt or self._DEFAULT_SYSTEM_PROMPT
-        kwargs["default_system_prompt"] = default_system_prompt
-        super().__init__(*args, **kwargs)
-        self.add_generation_prompt = add_generation_prompt
-
-    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        messages = self._add_missing_system_prompt(messages)
-        return self._format_messages_to_prompt(messages)
-
-    def _completion_to_prompt(self, completion: str) -> str:
-        messages = [ChatMessage(content=completion, role=MessageRole.USER)]
-        return self._format_messages_to_prompt(messages)
-
-    def _format_messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        # TODO add BOS and EOS TOKEN !!!!! (c.f. jinja template)
-        """Format message to prompt with `<|ROLE|>: MSG` style."""
-        assert messages[0].role == MessageRole.SYSTEM
-        prompt = ""
-        # TODO enclose the interaction between self.token_bos and self.token_eos
-        for message in messages:
-            role = message.role
-            content = message.content or ""
-            message_from_user = f"<|{role.lower()}|>: {content.strip()}"
-            message_from_user += self.nl_token
-            prompt += message_from_user
-        if self.add_generation_prompt:
-            # we are missing the last <|assistant|> tag that will trigger a completion
-            prompt += "<|assistant|>: "
-        return prompt
-
-
-class LlamaCppPromptStyle(AbstractPromptStyleWithSystemPrompt):
-    def __init__(
-        self,
-        prompt_style: str,
-        default_system_prompt: str | None = None,
-        *args: Any,
-        **kwargs: Any,
-    ) -> None:
-        """Wrapper for llama_cpp_python defined prompt format.
-
-        :param prompt_style:
-        :param default_system_prompt: Used if no system prompt is given in the messages.
-        """
-        assert prompt_style.startswith("llama_cpp.")
-        default_system_prompt = default_system_prompt or self._DEFAULT_SYSTEM_PROMPT
-        kwargs["default_system_prompt"] = default_system_prompt
-        super().__init__(*args, **kwargs)
-
-        self.prompt_style = prompt_style[len("llama_cpp.") :]
-        if self.prompt_style is None:
-            return
-
-        self._llama_cpp_formatter = _get_llama_cpp_chat_format(self.prompt_style)
-
-    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        messages = self._add_missing_system_prompt(messages)
-        return self._llama_cpp_formatter(
-            messages=llama_index_to_llama_cpp_messages(messages)
-        ).prompt
-
-    def _completion_to_prompt(self, completion: str) -> str:
-        messages = self._add_missing_system_prompt(
-            [ChatMessage(content=completion, role=MessageRole.USER)]
-        )
-        return self._llama_cpp_formatter(
-            messages=llama_index_to_llama_cpp_messages(messages)
-        ).prompt
-
-
-class TemplatePromptStyle(AbstractPromptStyleWithSystemPrompt):
-    def __init__(
-        self,
-        template_name: str,
-        template_dir: str | None = None,
-        add_generation_prompt: bool = True,
-        default_system_prompt: str | None = None,
-        *args: Any,
-        **kwargs: Any,
-    ) -> None:
-        """Prompt format using a Jinja template.
-
-        :param template_name: the filename of the template to use, must be in
-            the `./template/` directory.
-        :param template_dir: the directory where the template is located.
-            Defaults to `./template/`.
-        :param default_system_prompt: Used if no system prompt is
-            given in the messages.
-        """
-        default_system_prompt = default_system_prompt or DEFAULT_SYSTEM_PROMPT
-        kwargs["default_system_prompt"] = default_system_prompt
-        super().__init__(*args, **kwargs)
-
-        self._add_generation_prompt = add_generation_prompt
-
-        def raise_exception(message: str) -> None:
-            raise TemplateError(message)
-
-        if template_dir is None:
-            self.template_dir = THIS_DIRECTORY_RELATIVE / "template"
-        else:
-            self.template_dir = Path(template_dir)
-
-        self._jinja_fs_loader = FileSystemLoader(searchpath=self.template_dir)
-        self._jinja_env = ImmutableSandboxedEnvironment(
-            loader=self._jinja_fs_loader, trim_blocks=True, lstrip_blocks=True
-        )
-        self._jinja_env.globals["raise_exception"] = raise_exception
-
-        self.template = self._jinja_env.get_template(template_name)
-
-    @property
-    def _extra_kwargs_render(self) -> dict[str, Any]:
-        return {
-            "eos_token": self.eos_token,
-            "bos_token": self.bos_token,
-            "nl_token": self.nl_token,
-        }
-
-    @staticmethod
-    def _j_raise_exception(x: str) -> None:
-        """Helper method to let Jinja template raise exceptions."""
-        raise RuntimeError(x)
-
-    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
-        messages = self._add_missing_system_prompt(messages)
-        msgs = [{"role": msg.role.value, "content": msg.content} for msg in messages]
-        return self.template.render(
-            messages=msgs,
-            add_generation_prompt=self._add_generation_prompt,
-            **self._extra_kwargs_render,
-        )
-
-    def _completion_to_prompt(self, completion: str) -> str:
-        messages = self._add_missing_system_prompt(
-            [
-                ChatMessage(content=completion, role=MessageRole.USER),
-            ]
-        )
-        return self._messages_to_prompt(messages)
-
-
-# TODO Maybe implement an auto-prompt style?
-
-
-# Pass all the arguments at once
-def get_prompt_style(
-    prompt_style: str | None,
-    **kwargs: Any,
-) -> AbstractPromptStyle:
-    """Get the prompt style to use from the given string.
-
-    :param prompt_style: The prompt style to use.
-    :return: The prompt style to use.
-    """
-    if prompt_style is None:
-        return DefaultPromptStyle(**kwargs)
-    if prompt_style.startswith("llama_cpp."):
-        return LlamaCppPromptStyle(prompt_style, **kwargs)
-    elif prompt_style == "llama2":
-        return LlamaIndexPromptStyle(**kwargs)
-    elif prompt_style == "vigogne":
-        return VigognePromptStyle(**kwargs)
-    elif prompt_style == "template":
-        return TemplatePromptStyle(**kwargs)
-    raise ValueError(f"Unknown prompt_style='{prompt_style}'")
--- a/private_gpt/components/llm/prompt/template/Mistral-7B-Instruct-v0.1.jinja
+++ b/private_gpt/components/llm/prompt/template/Mistral-7B-Instruct-v0.1.jinja
@@ -1,2 +0,0 @@
-{# This template is coming from: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json  #}
-{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
--- a/private_gpt/components/llm/prompt/template/vigogne-2-7b-chat.jinja
+++ b/private_gpt/components/llm/prompt/template/vigogne-2-7b-chat.jinja
@@ -1,2 +0,0 @@
-{# This template is coming from: https://huggingface.co/bofenghuang/vigogne-2-7b-chat/blob/main/tokenizer_config.json  #}
-{{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif true == true %}{% set loop_messages = messages %}{% set system_message = 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez autant que vous le pouvez.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|system|>: ' + system_message + '\n' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '<|user|>: ' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<|assistant|>: ' + message['content'].strip() + eos_token + '\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>:' }}{% endif %}
--- a/private_gpt/components/llm/prompt/template/zephyr-7b-beta.jinja
+++ b/private_gpt/components/llm/prompt/template/zephyr-7b-beta.jinja
@@ -1,2 +0,0 @@
-{# This template is coming from: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/blob/main/tokenizer_config.json  #}
-{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}
--- a/private_gpt/components/llm/prompt_helper.py
+++ b/private_gpt/components/llm/prompt_helper.py
@@ -0,0 +1,235 @@
+import abc
+import logging
+from collections.abc import Sequence
+from typing import Any, Literal
+
+from llama_index.core.llms import ChatMessage, MessageRole
+
+logger = logging.getLogger(__name__)
+
+
+class AbstractPromptStyle(abc.ABC):
+    """Abstract class for prompt styles.
+
+    This class is used to format a series of messages into a prompt that can be
+    understood by the models. A series of messages represents the interaction(s)
+    between a user and an assistant. This series of messages can be considered as a
+    session between a user X and an assistant Y.This session holds, through the
+    messages, the state of the conversation. This session, to be understood by the
+    model, needs to be formatted into a prompt (i.e. a string that the models
+    can understand). Prompts can be formatted in different ways,
+    depending on the model.
+
+    The implementations of this class represent the different ways to format a
+    series of messages into a prompt.
+    """
+
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        logger.debug("Initializing prompt_style=%s", self.__class__.__name__)
+
+    @abc.abstractmethod
+    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        pass
+
+    @abc.abstractmethod
+    def _completion_to_prompt(self, completion: str) -> str:
+        pass
+
+    def messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        prompt = self._messages_to_prompt(messages)
+        logger.debug("Got for messages='%s' the prompt='%s'", messages, prompt)
+        return prompt
+
+    def completion_to_prompt(self, completion: str) -> str:
+        prompt = self._completion_to_prompt(completion)
+        logger.debug("Got for completion='%s' the prompt='%s'", completion, prompt)
+        return prompt
+
+
+class DefaultPromptStyle(AbstractPromptStyle):
+    """Default prompt style that uses the defaults from llama_utils.
+
+    It basically passes None to the LLM, indicating it should use
+    the default functions.
+    """
+
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        super().__init__(*args, **kwargs)
+
+        # Hacky way to override the functions
+        # Override the functions to be None, and pass None to the LLM.
+        self.messages_to_prompt = None  # type: ignore[method-assign, assignment]
+        self.completion_to_prompt = None  # type: ignore[method-assign, assignment]
+
+    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        return ""
+
+    def _completion_to_prompt(self, completion: str) -> str:
+        return ""
+
+
+class Llama2PromptStyle(AbstractPromptStyle):
+    """Simple prompt style that uses llama 2 prompt style.
+
+    Inspired by llama_index/legacy/llms/llama_utils.py
+
+    It transforms the sequence of messages into a prompt that should look like:
+    ```text
+    <s> [INST] <<SYS>> your system prompt here. <</SYS>>
+
+    user message here [/INST] assistant (model) response here </s>
+    ```
+    """
+
+    BOS, EOS = "<s>", "</s>"
+    B_INST, E_INST = "[INST]", "[/INST]"
+    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
+    DEFAULT_SYSTEM_PROMPT = """\
+    You are a helpful, respectful and honest assistant. \
+    Always answer as helpfully as possible and follow ALL given instructions. \
+    Do not speculate or make up information. \
+    Do not reference any given instructions or context. \
+    """
+
+    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        string_messages: list[str] = []
+        if messages[0].role == MessageRole.SYSTEM:
+            # pull out the system message (if it exists in messages)
+            system_message_str = messages[0].content or ""
+            messages = messages[1:]
+        else:
+            system_message_str = self.DEFAULT_SYSTEM_PROMPT
+
+        system_message_str = f"{self.B_SYS} {system_message_str.strip()} {self.E_SYS}"
+
+        for i in range(0, len(messages), 2):
+            # first message should always be a user
+            user_message = messages[i]
+            assert user_message.role == MessageRole.USER
+
+            if i == 0:
+                # make sure system prompt is included at the start
+                str_message = f"{self.BOS} {self.B_INST} {system_message_str} "
+            else:
+                # end previous user-assistant interaction
+                string_messages[-1] += f" {self.EOS}"
+                # no need to include system prompt
+                str_message = f"{self.BOS} {self.B_INST} "
+
+            # include user message content
+            str_message += f"{user_message.content} {self.E_INST}"
+
+            if len(messages) > (i + 1):
+                # if assistant message exists, add to str_message
+                assistant_message = messages[i + 1]
+                assert assistant_message.role == MessageRole.ASSISTANT
+                str_message += f" {assistant_message.content}"
+
+            string_messages.append(str_message)
+
+        return "".join(string_messages)
+
+    def _completion_to_prompt(self, completion: str) -> str:
+        system_prompt_str = self.DEFAULT_SYSTEM_PROMPT
+
+        return (
+            f"{self.BOS} {self.B_INST} {self.B_SYS} {system_prompt_str.strip()} {self.E_SYS} "
+            f"{completion.strip()} {self.E_INST}"
+        )
+
+
+class TagPromptStyle(AbstractPromptStyle):
+    """Tag prompt style (used by Vigogne) that uses the prompt style `<|ROLE|>`.
+
+    It transforms the sequence of messages into a prompt that should look like:
+    ```text
+    <|system|>: your system prompt here.
+    <|user|>: user message here
+    (possibly with context and question)
+    <|assistant|>: assistant (model) response here.
+    ```
+
+    FIXME: should we add surrounding `<s>` and `</s>` tags, like in llama2?
+    """
+
+    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        """Format message to prompt with `<|ROLE|>: MSG` style."""
+        prompt = ""
+        for message in messages:
+            role = message.role
+            content = message.content or ""
+            message_from_user = f"<|{role.lower()}|>: {content.strip()}"
+            message_from_user += "\n"
+            prompt += message_from_user
+        # we are missing the last <|assistant|> tag that will trigger a completion
+        prompt += "<|assistant|>: "
+        return prompt
+
+    def _completion_to_prompt(self, completion: str) -> str:
+        return self._messages_to_prompt(
+            [ChatMessage(content=completion, role=MessageRole.USER)]
+        )
+
+
+class MistralPromptStyle(AbstractPromptStyle):
+    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        prompt = "<s>"
+        for message in messages:
+            role = message.role
+            content = message.content or ""
+            if role.lower() == "system":
+                message_from_user = f"[INST] {content.strip()} [/INST]"
+                prompt += message_from_user
+            elif role.lower() == "user":
+                prompt += "</s>"
+                message_from_user = f"[INST] {content.strip()} [/INST]"
+                prompt += message_from_user
+        return prompt
+
+    def _completion_to_prompt(self, completion: str) -> str:
+        return self._messages_to_prompt(
+            [ChatMessage(content=completion, role=MessageRole.USER)]
+        )
+
+
+class ChatMLPromptStyle(AbstractPromptStyle):
+    def _messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        prompt = "<|im_start|>system\n"
+        for message in messages:
+            role = message.role
+            content = message.content or ""
+            if role.lower() == "system":
+                message_from_user = f"{content.strip()}"
+                prompt += message_from_user
+            elif role.lower() == "user":
+                prompt += "<|im_end|>\n<|im_start|>user\n"
+                message_from_user = f"{content.strip()}<|im_end|>\n"
+                prompt += message_from_user
+        prompt += "<|im_start|>assistant\n"
+        return prompt
+
+    def _completion_to_prompt(self, completion: str) -> str:
+        return self._messages_to_prompt(
+            [ChatMessage(content=completion, role=MessageRole.USER)]
+        )
+
+
+def get_prompt_style(
+    prompt_style: Literal["default", "llama2", "tag", "mistral", "chatml"] | None
+) -> AbstractPromptStyle:
+    """Get the prompt style to use from the given string.
+
+    :param prompt_style: The prompt style to use.
+    :return: The prompt style to use.
+    """
+    if prompt_style is None or prompt_style == "default":
+        return DefaultPromptStyle()
+    elif prompt_style == "llama2":
+        return Llama2PromptStyle()
+    elif prompt_style == "tag":
+        return TagPromptStyle()
+    elif prompt_style == "mistral":
+        return MistralPromptStyle()
+    elif prompt_style == "chatml":
+        return ChatMLPromptStyle()
+    raise ValueError(f"Unknown prompt_style='{prompt_style}'")
--- a/private_gpt/components/node_store/node_store_component.py
+++ b/private_gpt/components/node_store/node_store_component.py
@@ -1,11 +1,12 @@
 import logging

 from injector import inject, singleton
-from llama_index.storage.docstore import BaseDocumentStore, SimpleDocumentStore
-from llama_index.storage.index_store import SimpleIndexStore
-from llama_index.storage.index_store.types import BaseIndexStore
+from llama_index.core.storage.docstore import BaseDocumentStore, SimpleDocumentStore
+from llama_index.core.storage.index_store import SimpleIndexStore
+from llama_index.core.storage.index_store.types import BaseIndexStore

 from private_gpt.paths import local_data_path
+from private_gpt.settings.settings import Settings

 logger = logging.getLogger(__name__)

@@ -16,19 +17,51 @@ class NodeStoreComponent:
    doc_store: BaseDocumentStore

    @inject
-    def __init__(self) -> None:
-        try:
-            self.index_store = SimpleIndexStore.from_persist_dir(
-                persist_dir=str(local_data_path)
-            )
-        except FileNotFoundError:
-            logger.debug("Local index store not found, creating a new one")
-            self.index_store = SimpleIndexStore()
+    def __init__(self, settings: Settings) -> None:
+        match settings.nodestore.database:
+            case "simple":
+                try:
+                    self.index_store = SimpleIndexStore.from_persist_dir(
+                        persist_dir=str(local_data_path)
+                    )
+                except FileNotFoundError:
+                    logger.debug("Local index store not found, creating a new one")
+                    self.index_store = SimpleIndexStore()

-        try:
-            self.doc_store = SimpleDocumentStore.from_persist_dir(
-                persist_dir=str(local_data_path)
-            )
-        except FileNotFoundError:
-            logger.debug("Local document store not found, creating a new one")
-            self.doc_store = SimpleDocumentStore()
+                try:
+                    self.doc_store = SimpleDocumentStore.from_persist_dir(
+                        persist_dir=str(local_data_path)
+                    )
+                except FileNotFoundError:
+                    logger.debug("Local document store not found, creating a new one")
+                    self.doc_store = SimpleDocumentStore()
+
+            case "postgres":
+                try:
+                    from llama_index.core.storage.docstore.postgres_docstore import (
+                        PostgresDocumentStore,
+                    )
+                    from llama_index.core.storage.index_store.postgres_index_store import (
+                        PostgresIndexStore,
+                    )
+                except ImportError:
+                    raise ImportError(
+                        "Postgres dependencies not found, install with `poetry install --extras storage-nodestore-postgres`"
+                    ) from None
+
+                if settings.postgres is None:
+                    raise ValueError("Postgres index/doc store settings not found.")
+
+                self.index_store = PostgresIndexStore.from_params(
+                    **settings.postgres.model_dump(exclude_none=True)
+                )
+                self.doc_store = PostgresDocumentStore.from_params(
+                    **settings.postgres.model_dump(exclude_none=True)
+                )
+
+            case _:
+                # Should be unreachable
+                # The settings validator should have caught this
+                raise ValueError(
+                    f"Database {settings.nodestore.database} not supported"
+                )
--- a/private_gpt/components/vector_store/batched_chroma.py
+++ b/private_gpt/components/vector_store/batched_chroma.py
@@ -1,12 +1,28 @@
+from collections.abc import Generator
 from typing import Any

-from llama_index.schema import BaseNode, MetadataMode
-from llama_index.vector_stores import ChromaVectorStore
-from llama_index.vector_stores.chroma import chunk_list
-from llama_index.vector_stores.utils import node_to_metadata_dict
+from llama_index.core.schema import BaseNode, MetadataMode
+from llama_index.core.vector_stores.utils import node_to_metadata_dict
+from llama_index.vector_stores.chroma import ChromaVectorStore  # type: ignore


-class BatchedChromaVectorStore(ChromaVectorStore):
+def chunk_list(
+    lst: list[BaseNode], max_chunk_size: int
+) -> Generator[list[BaseNode], None, None]:
+    """Yield successive max_chunk_size-sized chunks from lst.
+
+    Args:
+        lst (List[BaseNode]): list of nodes with embeddings
+        max_chunk_size (int): max chunk size
+
+    Yields:
+        Generator[List[BaseNode], None, None]: list of nodes with embeddings
+    """
+    for i in range(0, len(lst), max_chunk_size):
+        yield lst[i : i + max_chunk_size]
+
+
+class BatchedChromaVectorStore(ChromaVectorStore):  # type: ignore
    """Chroma vector store, batching additions to avoid reaching the max batch limit.

    In this vector store, embeddings are stored within a ChromaDB collection.
--- a/private_gpt/components/vector_store/vector_store_component.py
+++ b/private_gpt/components/vector_store/vector_store_component.py
@@ -2,11 +2,14 @@ import logging
 import typing

 from injector import inject, singleton
-from llama_index import VectorStoreIndex
-from llama_index.indices.vector_store import VectorIndexRetriever
-from llama_index.vector_stores.types import VectorStore
+from llama_index.core.indices.vector_store import VectorIndexRetriever, VectorStoreIndex
+from llama_index.core.vector_stores.types import (
+    FilterCondition,
+    MetadataFilter,
+    MetadataFilters,
+    VectorStore,
+)

-from private_gpt.components.vector_store.batched_chroma import BatchedChromaVectorStore
 from private_gpt.open_ai.extensions.context_filter import ContextFilter
 from private_gpt.paths import local_data_path
 from private_gpt.settings.settings import Settings
@@ -14,43 +17,64 @@ from private_gpt.settings.settings import Settings
 logger = logging.getLogger(__name__)


-@typing.no_type_check
-def _chromadb_doc_id_metadata_filter(
+def _doc_id_metadata_filter(
    context_filter: ContextFilter | None,
-) -> dict | None:
-    if context_filter is None or context_filter.docs_ids is None:
-        return {}  # No filter
-    elif len(context_filter.docs_ids) < 1:
-        return {"doc_id": "-"}  # Effectively filtering out all docs
-    else:
-        doc_filter_items = []
-        if len(context_filter.docs_ids) > 1:
-            doc_filter = {"$or": doc_filter_items}
-            for doc_id in context_filter.docs_ids:
-                doc_filter_items.append({"doc_id": doc_id})
-        else:
-            doc_filter = {"doc_id": context_filter.docs_ids[0]}
-        return doc_filter
+) -> MetadataFilters:
+    filters = MetadataFilters(filters=[], condition=FilterCondition.OR)
+
+    if context_filter is not None and context_filter.docs_ids is not None:
+        for doc_id in context_filter.docs_ids:
+            filters.filters.append(MetadataFilter(key="doc_id", value=doc_id))
+
+    return filters


@singleton
 class VectorStoreComponent:
+    settings: Settings
    vector_store: VectorStore

    @inject
    def __init__(self, settings: Settings) -> None:
+        self.settings = settings
        match settings.vectorstore.database:
+            case "postgres":
+                try:
+                    from llama_index.vector_stores.postgres import (  # type: ignore
+                        PGVectorStore,
+                    )
+                except ImportError as e:
+                    raise ImportError(
+                        "Postgres dependencies not found, install with `poetry install --extras vector-stores-postgres`"
+                    ) from e
+
+                if settings.postgres is None:
+                    raise ValueError(
+                        "Postgres settings not found. Please provide settings."
+                    )
+
+                self.vector_store = typing.cast(
+                    VectorStore,
+                    PGVectorStore.from_params(
+                        **settings.postgres.model_dump(exclude_none=True),
+                        table_name="embeddings",
+                        embed_dim=settings.embedding.embed_dim,
+                    ),
+                )
+
            case "chroma":
                try:
                    import chromadb  # type: ignore
                    from chromadb.config import (  # type: ignore
                        Settings as ChromaSettings,
                    )
+
+                    from private_gpt.components.vector_store.batched_chroma import (
+                        BatchedChromaVectorStore,
+                    )
                except ImportError as e:
                    raise ImportError(
-                        "'chromadb' is not installed."
-                        "To use PrivateGPT with Chroma, install the 'chroma' extra."
-                        "`poetry install --extras chroma`"
+                        "ChromaDB dependencies not found, install with `poetry install --extras vector-stores-chroma`"
                    ) from e

                chroma_settings = ChromaSettings(anonymized_telemetry=False)
@@ -70,8 +94,15 @@ class VectorStoreComponent:
                )

            case "qdrant":
-                from llama_index.vector_stores.qdrant import QdrantVectorStore
-                from qdrant_client import QdrantClient
+                try:
+                    from llama_index.vector_stores.qdrant import (  # type: ignore
+                        QdrantVectorStore,
+                    )
+                    from qdrant_client import QdrantClient  # type: ignore
+                except ImportError as e:
+                    raise ImportError(
+                        "Qdrant dependencies not found, install with `poetry install --extras vector-stores-qdrant`"
+                    ) from e

                if settings.qdrant is None:
                    logger.info(
@@ -97,20 +128,22 @@ class VectorStoreComponent:
                    f"Vectorstore database {settings.vectorstore.database} not supported"
                )

-    @staticmethod
    def get_retriever(
+        self,
        index: VectorStoreIndex,
        context_filter: ContextFilter | None = None,
        similarity_top_k: int = 2,
    ) -> VectorIndexRetriever:
-        # This way we support qdrant (using doc_ids) and chroma (using where clause)
+        # This way we support qdrant (using doc_ids) and the rest (using filters)
        return VectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
            doc_ids=context_filter.docs_ids if context_filter else None,
-            vector_store_kwargs={
-                "where": _chromadb_doc_id_metadata_filter(context_filter)
-            },
+            filters=(
+                _doc_id_metadata_filter(context_filter)
+                if self.settings.vectorstore.database != "qdrant"
+                else None
+            ),
        )

    def close(self) -> None:
--- a/private_gpt/launcher.py
+++ b/private_gpt/launcher.py
@@ -1,13 +1,14 @@
 """FastAPI app creation, logger configuration and main API routes."""
+
 import logging
-from typing import Any

 from fastapi import Depends, FastAPI, Request
 from fastapi.middleware.cors import CORSMiddleware
-from fastapi.openapi.utils import get_openapi
 from injector import Injector
+from llama_index.core.callbacks import CallbackManager
+from llama_index.core.callbacks.global_handlers import create_global_handler
+from llama_index.core.settings import Settings as LlamaIndexSettings

-from private_gpt.paths import docs_path
 from private_gpt.server.chat.chat_router import chat_router
 from private_gpt.server.chunks.chunks_router import chunks_router
 from private_gpt.server.completions.completions_router import completions_router
@@ -22,107 +23,44 @@ logger = logging.getLogger(__name__)
 def create_app(root_injector: Injector) -> FastAPI:

    # Start the API
-    with open(docs_path / "description.md") as description_file:
-        description = description_file.read()
+    async def bind_injector_to_request(request: Request) -> None:
+        request.state.injector = root_injector

-        tags_metadata = [
-            {
-                "name": "Ingestion",
-                "description": "High-level APIs covering document ingestion -internally "
-                "managing document parsing, splitting,"
-                "metadata extraction, embedding generation and storage- and ingested "
-                "documents CRUD."
-                "Each ingested document is identified by an ID that can be used to filter the "
-                "context"
-                "used in *Contextual Completions* and *Context Chunks* APIs.",
-            },
-            {
-                "name": "Contextual Completions",
-                "description": "High-level APIs covering contextual Chat and Completions. They "
-                "follow OpenAI's format, extending it to "
-                "allow using the context coming from ingested documents to create the "
-                "response. Internally"
-                "manage context retrieval, prompt engineering and the response generation.",
-            },
-            {
-                "name": "Context Chunks",
-                "description": "Low-level API that given a query return relevant chunks of "
-                "text coming from the ingested"
-                "documents.",
-            },
-            {
-                "name": "Embeddings",
-                "description": "Low-level API to obtain the vector representation of a given "
-                "text, using an Embeddings model."
-                "Follows OpenAI's embeddings API format.",
-            },
-            {
-                "name": "Health",
-                "description": "Simple health API to make sure the server is up and running.",
-            },
-        ]
+    app = FastAPI(dependencies=[Depends(bind_injector_to_request)])

-        async def bind_injector_to_request(request: Request) -> None:
-            request.state.injector = root_injector
+    app.include_router(completions_router)
+    app.include_router(chat_router)
+    app.include_router(chunks_router)
+    app.include_router(ingest_router)
+    app.include_router(embeddings_router)
+    app.include_router(health_router)

-        app = FastAPI(dependencies=[Depends(bind_injector_to_request)])
+    # Add LlamaIndex simple observability
+    global_handler = create_global_handler("simple")
+    LlamaIndexSettings.callback_manager = CallbackManager([global_handler])

-        def custom_openapi() -> dict[str, Any]:
-            if app.openapi_schema:
-                return app.openapi_schema
-            openapi_schema = get_openapi(
-                title="PrivateGPT",
-                description=description,
-                version="0.1.0",
-                summary="PrivateGPT is a production-ready AI project that allows you to "
-                "ask questions to your documents using the power of Large Language "
-                "Models (LLMs), even in scenarios without Internet connection. "
-                "100% private, no data leaves your execution environment at any point.",
-                contact={
-                    "url": "https://github.com/imartinez/privateGPT",
-                },
-                license_info={
-                    "name": "Apache 2.0",
-                    "url": "https://www.apache.org/licenses/LICENSE-2.0.html",
-                },
-                routes=app.routes,
-                tags=tags_metadata,
-            )
-            openapi_schema["info"]["x-logo"] = {
-                "url": "https://lh3.googleusercontent.com/drive-viewer"
-                "/AK7aPaD_iNlMoTquOBsw4boh4tIYxyEuhz6EtEs8nzq3yNkNAK00xGj"
-                "E1KUCmPJSk3TYOjcs6tReG6w_cLu1S7L_gPgT9z52iw=s2560"
-            }
+    settings = root_injector.get(Settings)
+    if settings.server.cors.enabled:
+        logger.debug("Setting up CORS middleware")
+        app.add_middleware(
+            CORSMiddleware,
+            allow_credentials=settings.server.cors.allow_credentials,
+            allow_origins=settings.server.cors.allow_origins,
+            allow_origin_regex=settings.server.cors.allow_origin_regex,
+            allow_methods=settings.server.cors.allow_methods,
+            allow_headers=settings.server.cors.allow_headers,
+        )

-            app.openapi_schema = openapi_schema
-            return app.openapi_schema
-
-        app.openapi = custom_openapi  # type: ignore[method-assign]
-
-        app.include_router(completions_router)
-        app.include_router(chat_router)
-        app.include_router(chunks_router)
-        app.include_router(ingest_router)
-        app.include_router(embeddings_router)
-        app.include_router(health_router)
-
-        settings = root_injector.get(Settings)
-        if settings.server.cors.enabled:
-            logger.debug("Setting up CORS middleware")
-            app.add_middleware(
-                CORSMiddleware,
-                allow_credentials=settings.server.cors.allow_credentials,
-                allow_origins=settings.server.cors.allow_origins,
-                allow_origin_regex=settings.server.cors.allow_origin_regex,
-                allow_methods=settings.server.cors.allow_methods,
-                allow_headers=settings.server.cors.allow_headers,
-            )
-
-        if settings.ui.enabled:
-            logger.debug("Importing the UI module")
+    if settings.ui.enabled:
+        logger.debug("Importing the UI module")
+        try:
            from private_gpt.ui.ui import PrivateGptUi
+        except ImportError as e:
+            raise ImportError(
+                "UI dependencies not found, install with `poetry install --extras ui`"
+            ) from e

-            ui = root_injector.get(PrivateGptUi)
-            ui.mount_in_app(app, settings.ui.path)
+        ui = root_injector.get(PrivateGptUi)
+        ui.mount_in_app(app, settings.ui.path)

-        return app
+    return app
--- a/private_gpt/main.py
+++ b/private_gpt/main.py
@@ -1,11 +1,6 @@
 """FastAPI app creation, logger configuration and main API routes."""

-import llama_index
-
 from private_gpt.di import global_injector
 from private_gpt.launcher import create_app

-# Add LlamaIndex simple observability
-llama_index.set_global_handler("simple")
-
 app = create_app(global_injector)
--- a/private_gpt/open_ai/openai_models.py
+++ b/private_gpt/open_ai/openai_models.py
@@ -3,7 +3,7 @@ import uuid
 from collections.abc import Iterator
 from typing import Literal

-from llama_index.llms import ChatResponse, CompletionResponse
+from llama_index.core.llms import ChatResponse, CompletionResponse
 from pydantic import BaseModel, Field

 from private_gpt.server.chunks.chunks_service import Chunk
@@ -118,5 +118,5 @@ def to_openai_sse_stream(
            yield f"data: {OpenAICompletion.json_from_delta(text=response.delta)}\n\n"
        else:
            yield f"data: {OpenAICompletion.json_from_delta(text=response, sources=sources)}\n\n"
-    yield f"data: {OpenAICompletion.json_from_delta(text=None, finish_reason='stop')}\n\n"
+    yield f"data: {OpenAICompletion.json_from_delta(text='', finish_reason='stop')}\n\n"
    yield "data: [DONE]\n\n"
--- a/private_gpt/server/chat/chat_router.py
+++ b/private_gpt/server/chat/chat_router.py
@@ -1,5 +1,5 @@
 from fastapi import APIRouter, Depends, Request
-from llama_index.llms import ChatMessage, MessageRole
+from llama_index.core.llms import ChatMessage, MessageRole
 from pydantic import BaseModel
 from starlette.responses import StreamingResponse

@@ -54,6 +54,13 @@ class ChatBody(BaseModel):
    response_model=None,
    responses={200: {"model": OpenAICompletion}},
    tags=["Contextual Completions"],
+    openapi_extra={
+        "x-fern-streaming": {
+            "stream-condition": "stream",
+            "response": {"$ref": "#/components/schemas/OpenAICompletion"},
+            "response-stream": {"$ref": "#/components/schemas/OpenAICompletion"},
+        }
+    },
 )
 def chat_completion(
    request: Request, body: ChatBody
--- a/private_gpt/server/chat/chat_service.py
+++ b/private_gpt/server/chat/chat_service.py
@@ -1,14 +1,19 @@
 from dataclasses import dataclass

 from injector import inject, singleton
-from llama_index import ServiceContext, StorageContext, VectorStoreIndex
-from llama_index.chat_engine import ContextChatEngine, SimpleChatEngine
-from llama_index.chat_engine.types import (
+from llama_index.core.chat_engine import ContextChatEngine, SimpleChatEngine
+from llama_index.core.chat_engine.types import (
    BaseChatEngine,
 )
-from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
-from llama_index.llms import ChatMessage, MessageRole
-from llama_index.types import TokenGen
+from llama_index.core.indices import VectorStoreIndex
+from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor
+from llama_index.core.llms import ChatMessage, MessageRole
+from llama_index.core.postprocessor import (
+    SentenceTransformerRerank,
+    SimilarityPostprocessor,
+)
+from llama_index.core.storage import StorageContext
+from llama_index.core.types import TokenGen
 from pydantic import BaseModel

 from private_gpt.components.embedding.embedding_component import EmbeddingComponent
@@ -19,6 +24,7 @@ from private_gpt.components.vector_store.vector_store_component import (
 )
 from private_gpt.open_ai.extensions.context_filter import ContextFilter
 from private_gpt.server.chunks.chunks_service import Chunk
+from private_gpt.settings.settings import Settings


 class Completion(BaseModel):
@@ -67,28 +73,31 @@ class ChatEngineInput:

@singleton
 class ChatService:
+    settings: Settings
+
    @inject
    def __init__(
        self,
+        settings: Settings,
        llm_component: LLMComponent,
        vector_store_component: VectorStoreComponent,
        embedding_component: EmbeddingComponent,
        node_store_component: NodeStoreComponent,
    ) -> None:
-        self.llm_service = llm_component
+        self.settings = settings
+        self.llm_component = llm_component
+        self.embedding_component = embedding_component
        self.vector_store_component = vector_store_component
        self.storage_context = StorageContext.from_defaults(
            vector_store=vector_store_component.vector_store,
            docstore=node_store_component.doc_store,
            index_store=node_store_component.index_store,
        )
-        self.service_context = ServiceContext.from_defaults(
-            llm=llm_component.llm, embed_model=embedding_component.embedding_model
-        )
        self.index = VectorStoreIndex.from_vector_store(
            vector_store_component.vector_store,
            storage_context=self.storage_context,
-            service_context=self.service_context,
+            llm=llm_component.llm,
+            embed_model=embedding_component.embedding_model,
            show_progress=True,
        )

@@ -98,22 +107,36 @@ class ChatService:
        use_context: bool = False,
        context_filter: ContextFilter | None = None,
    ) -> BaseChatEngine:
+        settings = self.settings
        if use_context:
            vector_index_retriever = self.vector_store_component.get_retriever(
-                index=self.index, context_filter=context_filter
+                index=self.index,
+                context_filter=context_filter,
+                similarity_top_k=self.settings.rag.similarity_top_k,
            )
+            node_postprocessors = [
+                MetadataReplacementPostProcessor(target_metadata_key="window"),
+                SimilarityPostprocessor(
+                    similarity_cutoff=settings.rag.similarity_value
+                ),
+            ]
+
+            if settings.rag.rerank.enabled:
+                rerank_postprocessor = SentenceTransformerRerank(
+                    model=settings.rag.rerank.model, top_n=settings.rag.rerank.top_n
+                )
+                node_postprocessors.append(rerank_postprocessor)
+
            return ContextChatEngine.from_defaults(
                system_prompt=system_prompt,
                retriever=vector_index_retriever,
-                service_context=self.service_context,
-                node_postprocessors=[
-                    MetadataReplacementPostProcessor(target_metadata_key="window"),
-                ],
+                llm=self.llm_component.llm,  # Takes no effect at the moment
+                node_postprocessors=node_postprocessors,
            )
        else:
            return SimpleChatEngine.from_defaults(
                system_prompt=system_prompt,
-                service_context=self.service_context,
+                llm=self.llm_component.llm,
            )

    def stream_chat(
--- a/private_gpt/server/chunks/chunks_service.py
+++ b/private_gpt/server/chunks/chunks_service.py
@@ -1,8 +1,9 @@
 from typing import TYPE_CHECKING, Literal

 from injector import inject, singleton
-from llama_index import ServiceContext, StorageContext, VectorStoreIndex
-from llama_index.schema import NodeWithScore
+from llama_index.core.indices import VectorStoreIndex
+from llama_index.core.schema import NodeWithScore
+from llama_index.core.storage import StorageContext
 from pydantic import BaseModel, Field

 from private_gpt.components.embedding.embedding_component import EmbeddingComponent
@@ -15,7 +16,7 @@ from private_gpt.open_ai.extensions.context_filter import ContextFilter
 from private_gpt.server.ingest.model import IngestedDoc

 if TYPE_CHECKING:
-    from llama_index.schema import RelatedNodeInfo
+    from llama_index.core.schema import RelatedNodeInfo


 class Chunk(BaseModel):
@@ -63,14 +64,13 @@ class ChunksService:
        node_store_component: NodeStoreComponent,
    ) -> None:
        self.vector_store_component = vector_store_component
+        self.llm_component = llm_component
+        self.embedding_component = embedding_component
        self.storage_context = StorageContext.from_defaults(
            vector_store=vector_store_component.vector_store,
            docstore=node_store_component.doc_store,
            index_store=node_store_component.index_store,
        )
-        self.query_service_context = ServiceContext.from_defaults(
-            llm=llm_component.llm, embed_model=embedding_component.embedding_model
-        )

    def _get_sibling_nodes_text(
        self, node_with_score: NodeWithScore, related_number: int, forward: bool = True
@@ -103,7 +103,8 @@ class ChunksService:
        index = VectorStoreIndex.from_vector_store(
            self.vector_store_component.vector_store,
            storage_context=self.storage_context,
-            service_context=self.query_service_context,
+            llm=self.llm_component.llm,
+            embed_model=self.embedding_component.embedding_model,
            show_progress=True,
        )
        vector_index_retriever = self.vector_store_component.get_retriever(
--- a/private_gpt/server/completions/completions_router.py
+++ b/private_gpt/server/completions/completions_router.py
@@ -42,6 +42,13 @@ class CompletionsBody(BaseModel):
    summary="Completion",
    responses={200: {"model": OpenAICompletion}},
    tags=["Contextual Completions"],
+    openapi_extra={
+        "x-fern-streaming": {
+            "stream-condition": "stream",
+            "response": {"$ref": "#/components/schemas/OpenAICompletion"},
+            "response-stream": {"$ref": "#/components/schemas/OpenAICompletion"},
+        }
+    },
 )
 def prompt_completion(
    request: Request, body: CompletionsBody
--- a/private_gpt/server/ingest/ingest_router.py
+++ b/private_gpt/server/ingest/ingest_router.py
@@ -1,7 +1,7 @@
 from typing import Literal

 from fastapi import APIRouter, Depends, HTTPException, Request, UploadFile
-from pydantic import BaseModel
+from pydantic import BaseModel, Field

 from private_gpt.server.ingest.ingest_service import IngestService
 from private_gpt.server.ingest.model import IngestedDoc
@@ -10,14 +10,35 @@ from private_gpt.server.utils.auth import authenticated
 ingest_router = APIRouter(prefix="/v1", dependencies=[Depends(authenticated)])


+class IngestTextBody(BaseModel):
+    file_name: str = Field(examples=["Avatar: The Last Airbender"])
+    text: str = Field(
+        examples=[
+            "Avatar is set in an Asian and Arctic-inspired world in which some "
+            "people can telekinetically manipulate one of the four elements—water, "
+            "earth, fire or air—through practices known as 'bending', inspired by "
+            "Chinese martial arts."
+        ]
+    )
+
+
 class IngestResponse(BaseModel):
    object: Literal["list"]
    model: Literal["private-gpt"]
    data: list[IngestedDoc]


-@ingest_router.post("/ingest", tags=["Ingestion"])
+@ingest_router.post("/ingest", tags=["Ingestion"], deprecated=True)
 def ingest(request: Request, file: UploadFile) -> IngestResponse:
+    """Ingests and processes a file.
+
+    Deprecated. Use ingest/file instead.
+    """
+    return ingest_file(request, file)
+
+
+@ingest_router.post("/ingest/file", tags=["Ingestion"])
+def ingest_file(request: Request, file: UploadFile) -> IngestResponse:
    """Ingests and processes a file, storing its chunks to be used as context.

    The context obtained from files is later used in
@@ -40,6 +61,26 @@ def ingest(request: Request, file: UploadFile) -> IngestResponse:
    return IngestResponse(object="list", model="private-gpt", data=ingested_documents)


+@ingest_router.post("/ingest/text", tags=["Ingestion"])
+def ingest_text(request: Request, body: IngestTextBody) -> IngestResponse:
+    """Ingests and processes a text, storing its chunks to be used as context.
+
+    The context obtained from files is later used in
+    `/chat/completions`, `/completions`, and `/chunks` APIs.
+
+    A Document will be generated with the given text. The Document
+    ID is returned in the response, together with the
+    extracted Metadata (which is later used to improve context retrieval). That ID
+    can be used to filter the context used to create responses in
+    `/chat/completions`, `/completions`, and `/chunks` APIs.
+    """
+    service = request.state.injector.get(IngestService)
+    if len(body.file_name) == 0:
+        raise HTTPException(400, "No file name provided")
+    ingested_documents = service.ingest_text(body.file_name, body.text)
+    return IngestResponse(object="list", model="private-gpt", data=ingested_documents)
+
+
@ingest_router.get("/ingest/list", tags=["Ingestion"])
 def list_ingested(request: Request) -> IngestResponse:
    """Lists already ingested Documents including their Document ID and metadata.
--- a/private_gpt/server/ingest/ingest_service.py
+++ b/private_gpt/server/ingest/ingest_service.py
@@ -1,14 +1,11 @@
 import logging
 import tempfile
 from pathlib import Path
-from typing import BinaryIO
+from typing import TYPE_CHECKING, AnyStr, BinaryIO

 from injector import inject, singleton
-from llama_index import (
-    ServiceContext,
-    StorageContext,
-)
-from llama_index.node_parser import SentenceWindowNodeParser
+from llama_index.core.node_parser import SentenceWindowNodeParser
+from llama_index.core.storage import StorageContext

 from private_gpt.components.embedding.embedding_component import EmbeddingComponent
 from private_gpt.components.ingest.ingest_component import get_ingestion_component
@@ -20,6 +17,9 @@ from private_gpt.components.vector_store.vector_store_component import (
 from private_gpt.server.ingest.model import IngestedDoc
 from private_gpt.settings.settings import settings

+if TYPE_CHECKING:
+    from llama_index.core.storage.docstore.types import RefDocInfo
+
 logger = logging.getLogger(__name__)


@@ -40,29 +40,15 @@ class IngestService:
            index_store=node_store_component.index_store,
        )
        node_parser = SentenceWindowNodeParser.from_defaults()
-        self.ingest_service_context = ServiceContext.from_defaults(
-            llm=self.llm_service.llm,
-            embed_model=embedding_component.embedding_model,
-            node_parser=node_parser,
-            # Embeddings done early in the pipeline of node transformations, right
-            # after the node parsing
-            transformations=[node_parser, embedding_component.embedding_model],
-        )

        self.ingest_component = get_ingestion_component(
-            self.storage_context, self.ingest_service_context, settings=settings()
+            self.storage_context,
+            embed_model=embedding_component.embedding_model,
+            transformations=[node_parser, embedding_component.embedding_model],
+            settings=settings(),
        )

-    def ingest(self, file_name: str, file_data: Path) -> list[IngestedDoc]:
-        logger.info("Ingesting file_name=%s", file_name)
-        documents = self.ingest_component.ingest(file_name, file_data)
-        return [IngestedDoc.from_document(document) for document in documents]
-
-    def ingest_bin_data(
-        self, file_name: str, raw_file_data: BinaryIO
-    ) -> list[IngestedDoc]:
-        logger.debug("Ingesting binary data with file_name=%s", file_name)
-        file_data = raw_file_data.read()
+    def _ingest_data(self, file_name: str, file_data: AnyStr) -> list[IngestedDoc]:
        logger.debug("Got file data of size=%s to ingest", len(file_data))
        # llama-index mainly supports reading from files, so
        # we have to create a tmp file to read for it to work
@@ -74,28 +60,44 @@ class IngestService:
                    path_to_tmp.write_bytes(file_data)
                else:
                    path_to_tmp.write_text(str(file_data))
-                return self.ingest(file_name, path_to_tmp)
+                return self.ingest_file(file_name, path_to_tmp)
            finally:
                tmp.close()
                path_to_tmp.unlink()

+    def ingest_file(self, file_name: str, file_data: Path) -> list[IngestedDoc]:
+        logger.info("Ingesting file_name=%s", file_name)
+        documents = self.ingest_component.ingest(file_name, file_data)
+        logger.info("Finished ingestion file_name=%s", file_name)
+        return [IngestedDoc.from_document(document) for document in documents]
+
+    def ingest_text(self, file_name: str, text: str) -> list[IngestedDoc]:
+        logger.debug("Ingesting text data with file_name=%s", file_name)
+        return self._ingest_data(file_name, text)
+
+    def ingest_bin_data(
+        self, file_name: str, raw_file_data: BinaryIO
+    ) -> list[IngestedDoc]:
+        logger.debug("Ingesting binary data with file_name=%s", file_name)
+        file_data = raw_file_data.read()
+        return self._ingest_data(file_name, file_data)
+
    def bulk_ingest(self, files: list[tuple[str, Path]]) -> list[IngestedDoc]:
        logger.info("Ingesting file_names=%s", [f[0] for f in files])
        documents = self.ingest_component.bulk_ingest(files)
+        logger.info("Finished ingestion file_name=%s", [f[0] for f in files])
        return [IngestedDoc.from_document(document) for document in documents]

    def list_ingested(self) -> list[IngestedDoc]:
-        ingested_docs = []
+        ingested_docs: list[IngestedDoc] = []
        try:
            docstore = self.storage_context.docstore
-            ingested_docs_ids: set[str] = set()
+            ref_docs: dict[str, RefDocInfo] | None = docstore.get_all_ref_doc_info()

-            for node in docstore.docs.values():
-                if node.ref_doc_id is not None:
-                    ingested_docs_ids.add(node.ref_doc_id)
+            if not ref_docs:
+                return ingested_docs

-            for doc_id in ingested_docs_ids:
-                ref_doc_info = docstore.get_ref_doc_info(ref_doc_id=doc_id)
+            for doc_id, ref_doc_info in ref_docs.items():
                doc_metadata = None
                if ref_doc_info is not None and ref_doc_info.metadata is not None:
                    doc_metadata = IngestedDoc.curate_metadata(ref_doc_info.metadata)
--- a/private_gpt/server/ingest/ingest_watcher.py
+++ b/private_gpt/server/ingest/ingest_watcher.py
@@ -3,10 +3,9 @@ from pathlib import Path
 from typing import Any

 from watchdog.events import (
-    DirCreatedEvent,
-    DirModifiedEvent,
    FileCreatedEvent,
    FileModifiedEvent,
+    FileSystemEvent,
    FileSystemEventHandler,
 )
 from watchdog.observers import Observer
@@ -20,11 +19,11 @@ class IngestWatcher:
        self.on_file_changed = on_file_changed

        class Handler(FileSystemEventHandler):
-            def on_modified(self, event: DirModifiedEvent | FileModifiedEvent) -> None:
+            def on_modified(self, event: FileSystemEvent) -> None:
                if isinstance(event, FileModifiedEvent):
                    on_file_changed(Path(event.src_path))

-            def on_created(self, event: DirCreatedEvent | FileCreatedEvent) -> None:
+            def on_created(self, event: FileSystemEvent) -> None:
                if isinstance(event, FileCreatedEvent):
                    on_file_changed(Path(event.src_path))

--- a/private_gpt/server/ingest/model.py
+++ b/private_gpt/server/ingest/model.py
@@ -1,6 +1,6 @@
 from typing import Any, Literal

-from llama_index import Document
+from llama_index.core.schema import Document
 from pydantic import BaseModel, Field


--- a/private_gpt/server/utils/auth.py
+++ b/private_gpt/server/utils/auth.py
@@ -12,6 +12,7 @@ Authorization can be done by following fastapi's guides:
 * https://fastapi.tiangolo.com/tutorial/security/
 * https://fastapi.tiangolo.com/tutorial/dependencies/dependencies-in-path-operation-decorators/
 """
+
 # mypy: ignore-errors
 # Disabled mypy error: All conditional function variants must have identical signatures
 # We are changing the implementation of the authenticated method, based on
--- a/private_gpt/settings/settings.py
+++ b/private_gpt/settings/settings.py
@@ -81,80 +81,88 @@ class DataSettings(BaseModel):


 class LLMSettings(BaseModel):
-    mode: Literal["local", "openai", "sagemaker", "mock"]
+    mode: Literal[
+        "llamacpp", "openai", "openailike", "azopenai", "sagemaker", "mock", "ollama"
+    ]
    max_new_tokens: int = Field(
        256,
        description="The maximum number of token that the LLM is authorized to generate in one completion.",
    )
+    context_window: int = Field(
+        3900,
+        description="The maximum number of context tokens for the model.",
+    )
+    tokenizer: str = Field(
+        None,
+        description="The model id of a predefined tokenizer hosted inside a model repo on "
+        "huggingface.co. Valid model ids can be located at the root-level, like "
+        "`bert-base-uncased`, or namespaced under a user or organization name, "
+        "like `HuggingFaceH4/zephyr-7b-beta`. If not set, will load a tokenizer matching "
+        "gpt-3.5-turbo LLM.",
+    )
+    temperature: float = Field(
+        0.1,
+        description="The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual.",
+    )


 class VectorstoreSettings(BaseModel):
-    database: Literal["chroma", "qdrant"]
+    database: Literal["chroma", "qdrant", "postgres"]


-class LocalSettings(BaseModel):
+class NodeStoreSettings(BaseModel):
+    database: Literal["simple", "postgres"]
+
+
+class LlamaCPPSettings(BaseModel):
    llm_hf_repo_id: str
    llm_hf_model_file: str
-    embedding_hf_model_name: str = Field(
-        description="Name of the HuggingFace model to use for embeddings"
-    )
-    prompt_style: Literal[
-        "llama_cpp.llama-2",
-        "llama_cpp.alpaca",
-        "llama_cpp.vicuna",
-        "llama_cpp.oasst_llama",
-        "llama_cpp.baichuan-2",
-        "llama_cpp.baichuan",
-        "llama_cpp.openbuddy",
-        "llama_cpp.redpajama-incite",
-        "llama_cpp.snoozy",
-        "llama_cpp.phind",
-        "llama_cpp.intel",
-        "llama_cpp.open-orca",
-        "llama_cpp.mistrallite",
-        "llama_cpp.zephyr",
-        "llama_cpp.chatml",
-        "llama_cpp.openchat",
+    prompt_style: Literal["default", "llama2", "tag", "mistral", "chatml"] = Field(
        "llama2",
-        "vigogne",
-        "template",
-    ] | None = Field(
-        None,
        description=(
            "The prompt style to use for the chat engine. "
-            "If None is given - use the default prompt style from the llama_index. It should look like `role: message`.\n"
+            "If `default` - use the default prompt style from the llama_index. It should look like `role: message`.\n"
            "If `llama2` - use the llama2 prompt style from the llama_index. Based on `<s>`, `[INST]` and `<<SYS>>`.\n"
-            "If `llama_cpp.<name>` - use the `<name>` prompt style, implemented by `llama-cpp-python`. \n"
+            "If `tag` - use the `tag` prompt style. It should look like `<|role|>: message`. \n"
+            "If `mistral` - use the `mistral prompt style. It shoudl look like <s>[INST] {System Prompt} [/INST]</s>[INST] { UserInstructions } [/INST]"
            "`llama2` is the historic behaviour. `default` might work better with your custom models."
        ),
    )
-    default_system_prompt: str | None = Field(
-        None,
-        description=(
-            "The default system prompt to use for the chat engine. "
-            "If none is given - use the default system prompt (from the llama_index). "
-            "Please note that the default prompt might not be the same for all prompt styles. "
-            "Also note that this is only used if the first message is not a system message. "
-        ),
+
+    tfs_z: float = Field(
+        1.0,
+        description="Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.",
+    )
+    top_k: int = Field(
+        40,
+        description="Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)",
+    )
+    top_p: float = Field(
+        0.9,
+        description="Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)",
+    )
+    repeat_penalty: float = Field(
+        1.1,
+        description="Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)",
    )

-    template_name: str | None = Field(
-        None,
-        description=(
-            "The name of the template to use for the chat engine, if the `prompt_style` is `template`."
-        ),
+
+class HuggingFaceSettings(BaseModel):
+    embedding_hf_model_name: str = Field(
+        description="Name of the HuggingFace model to use for embeddings"
    )


 class EmbeddingSettings(BaseModel):
-    mode: Literal["local", "openai", "sagemaker", "mock"]
-    ingest_mode: Literal["simple", "batch", "parallel"] = Field(
+    mode: Literal["huggingface", "openai", "azopenai", "sagemaker", "ollama", "mock"]
+    ingest_mode: Literal["simple", "batch", "parallel", "pipeline"] = Field(
        "simple",
        description=(
            "The ingest mode to use for the embedding engine:\n"
            "If `simple` - ingest files sequentially and one by one. It is the historic behaviour.\n"
            "If `batch` - if multiple files, parse all the files in parallel, "
            "and send them in batch to the embedding model.\n"
+            "In `pipeline` - The Embedding engine is kept as busy as possible\n"
            "If `parallel` - parse the files in parallel using multiple cores, and embedd them in parallel.\n"
            "`parallel` is the fastest mode for local setup, as it parallelize IO RW in the index.\n"
            "For modes that leverage parallelization, you can specify the number of "
@@ -167,11 +175,16 @@ class EmbeddingSettings(BaseModel):
            "The number of workers to use for file ingestion.\n"
            "In `batch` mode, this is the number of workers used to parse the files.\n"
            "In `parallel` mode, this is the number of workers used to parse the files and embed them.\n"
+            "In `pipeline` mode, this is the number of workers that can perform embeddings.\n"
            "This is only used if `ingest_mode` is not `simple`.\n"
            "Do not go too high with this number, as it might cause memory issues. (especially in `parallel` mode)\n"
            "Do not set it higher than your number of threads of your CPU."
        ),
    )
+    embed_dim: int = Field(
+        384,
+        description="The dimension of the embeddings stored in the Postgres database",
+    )


 class SagemakerSettings(BaseModel):
@@ -180,12 +193,157 @@ class SagemakerSettings(BaseModel):


 class OpenAISettings(BaseModel):
+    api_base: str = Field(
+        None,
+        description="Base URL of OpenAI API. Example: 'https://api.openai.com/v1'.",
+    )
    api_key: str
+    model: str = Field(
+        "gpt-3.5-turbo",
+        description="OpenAI Model to use. Example: 'gpt-4'.",
+    )
+
+
+class OllamaSettings(BaseModel):
+    api_base: str = Field(
+        "http://localhost:11434",
+        description="Base URL of Ollama API. Example: 'https://localhost:11434'.",
+    )
+    embedding_api_base: str = Field(
+        "http://localhost:11434",
+        description="Base URL of Ollama embedding API. Example: 'https://localhost:11434'.",
+    )
+    llm_model: str = Field(
+        None,
+        description="Model to use. Example: 'llama2-uncensored'.",
+    )
+    embedding_model: str = Field(
+        None,
+        description="Model to use. Example: 'nomic-embed-text'.",
+    )
+    keep_alive: str = Field(
+        "5m",
+        description="Time the model will stay loaded in memory after a request. examples: 5m, 5h, '-1' ",
+    )
+    tfs_z: float = Field(
+        1.0,
+        description="Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.",
+    )
+    num_predict: int = Field(
+        None,
+        description="Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)",
+    )
+    top_k: int = Field(
+        40,
+        description="Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)",
+    )
+    top_p: float = Field(
+        0.9,
+        description="Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)",
+    )
+    repeat_last_n: int = Field(
+        64,
+        description="Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)",
+    )
+    repeat_penalty: float = Field(
+        1.1,
+        description="Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)",
+    )
+    request_timeout: float = Field(
+        120.0,
+        description="Time elapsed until ollama times out the request. Default is 120s. Format is float. ",
+    )
+
+
+class AzureOpenAISettings(BaseModel):
+    api_key: str
+    azure_endpoint: str
+    api_version: str = Field(
+        "2023_05_15",
+        description="The API version to use for this operation. This follows the YYYY-MM-DD format.",
+    )
+    embedding_deployment_name: str
+    embedding_model: str = Field(
+        "text-embedding-ada-002",
+        description="OpenAI Model to use. Example: 'text-embedding-ada-002'.",
+    )
+    llm_deployment_name: str
+    llm_model: str = Field(
+        "gpt-35-turbo",
+        description="OpenAI Model to use. Example: 'gpt-4'.",
+    )


 class UISettings(BaseModel):
    enabled: bool
    path: str
+    default_chat_system_prompt: str = Field(
+        None,
+        description="The default system prompt to use for the chat mode.",
+    )
+    default_query_system_prompt: str = Field(
+        None, description="The default system prompt to use for the query mode."
+    )
+    delete_file_button_enabled: bool = Field(
+        True, description="If the button to delete a file is enabled or not."
+    )
+    delete_all_files_button_enabled: bool = Field(
+        False, description="If the button to delete all files is enabled or not."
+    )
+
+
+class RerankSettings(BaseModel):
+    enabled: bool = Field(
+        False,
+        description="This value controls whether a reranker should be included in the RAG pipeline.",
+    )
+    model: str = Field(
+        "cross-encoder/ms-marco-MiniLM-L-2-v2",
+        description="Rerank model to use. Limited to SentenceTransformer cross-encoder models.",
+    )
+    top_n: int = Field(
+        2,
+        description="This value controls the number of documents returned by the RAG pipeline.",
+    )
+
+
+class RagSettings(BaseModel):
+    similarity_top_k: int = Field(
+        2,
+        description="This value controls the number of documents returned by the RAG pipeline or considered for reranking if enabled.",
+    )
+    similarity_value: float = Field(
+        None,
+        description="If set, any documents retrieved from the RAG must meet a certain match score. Acceptable values are between 0 and 1.",
+    )
+    rerank: RerankSettings
+
+
+class PostgresSettings(BaseModel):
+    host: str = Field(
+        "localhost",
+        description="The server hosting the Postgres database",
+    )
+    port: int = Field(
+        5432,
+        description="The port on which the Postgres database is accessible",
+    )
+    user: str = Field(
+        "postgres",
+        description="The user to use to connect to the Postgres database",
+    )
+    password: str = Field(
+        "postgres",
+        description="The password to use to connect to the Postgres database",
+    )
+    database: str = Field(
+        "postgres",
+        description="The database to use to connect to the Postgres database",
+    )
+    schema_name: str = Field(
+        "public",
+        description="The name of the schema in the Postgres database to use",
+    )


 class QdrantSettings(BaseModel):
@@ -248,11 +406,17 @@ class Settings(BaseModel):
    ui: UISettings
    llm: LLMSettings
    embedding: EmbeddingSettings
-    local: LocalSettings
+    llamacpp: LlamaCPPSettings
+    huggingface: HuggingFaceSettings
    sagemaker: SagemakerSettings
    openai: OpenAISettings
+    ollama: OllamaSettings
+    azopenai: AzureOpenAISettings
    vectorstore: VectorstoreSettings
+    nodestore: NodeStoreSettings
+    rag: RagSettings
    qdrant: QdrantSettings | None = None
+    postgres: PostgresSettings | None = None


 """
--- a/private_gpt/settings/settings_loader.py
+++ b/private_gpt/settings/settings_loader.py
@@ -16,7 +16,7 @@ logger = logging.getLogger(__name__)
 _settings_folder = os.environ.get("PGPT_SETTINGS_FOLDER", PROJECT_ROOT_PATH)

 # if running in unittest, use the test profile
-_test_profile = ["test"] if "unittest" in sys.modules else []
+_test_profile = ["test"] if "tests.fixtures" in sys.modules else []

 active_profiles: list[str] = unique_list(
    ["default"]
--- a/private_gpt/ui/ui.py
+++ b/private_gpt/ui/ui.py
@@ -1,6 +1,8 @@
-"""This file should be imported only and only if you want to run the UI locally."""
+"""This file should be imported if and only if you want to run the UI locally."""
+
 import itertools
 import logging
+import time
 from collections.abc import Iterable
 from pathlib import Path
 from typing import Any
@@ -9,11 +11,12 @@ import gradio as gr  # type: ignore
 from fastapi import FastAPI
 from gradio.themes.utils.colors import slate  # type: ignore
 from injector import inject, singleton
-from llama_index.llms import ChatMessage, MessageRole
+from llama_index.core.llms import ChatMessage, ChatResponse, MessageRole
 from pydantic import BaseModel

 from private_gpt.constants import PROJECT_ROOT_PATH
 from private_gpt.di import global_injector
+from private_gpt.open_ai.extensions.context_filter import ContextFilter
 from private_gpt.server.chat.chat_service import ChatService, CompletionGen
 from private_gpt.server.chunks.chunks_service import Chunk, ChunksService
 from private_gpt.server.ingest.ingest_service import IngestService
@@ -30,6 +33,8 @@ UI_TAB_TITLE = "My Private GPT"

 SOURCES_SEPARATOR = "\n\n Sources: \n"

+MODES = ["Query Files", "Search Files", "LLM Chat (no context from files)"]
+

 class Source(BaseModel):
    file: str
@@ -40,8 +45,8 @@ class Source(BaseModel):
        frozen = True

    @staticmethod
-    def curate_sources(sources: list[Chunk]) -> set["Source"]:
-        curated_sources = set()
+    def curate_sources(sources: list[Chunk]) -> list["Source"]:
+        curated_sources = []

        for chunk in sources:
            doc_metadata = chunk.document.doc_metadata
@@ -50,32 +55,14 @@ class Source(BaseModel):
            page_label = doc_metadata.get("page_label", "-") if doc_metadata else "-"

            source = Source(file=file_name, page=page_label, text=chunk.text)
-            curated_sources.add(source)
+            curated_sources.append(source)
+            curated_sources = list(
+                dict.fromkeys(curated_sources).keys()
+            )  # Unique sources only

        return curated_sources


-def yield_deltas(completion_gen: CompletionGen) -> Iterable[str]:
-    full_response: str = ""
-    stream = completion_gen.response
-    for delta in stream:
-        # if isinstance(delta, str):
-        full_response += str(delta)
-        # elif isinstance(delta, ChatResponse):
-        #     full_response += delta.delta or ""
-        yield full_response
-
-    if completion_gen.sources:
-        full_response += SOURCES_SEPARATOR
-        cur_sources = Source.curate_sources(completion_gen.sources)
-        sources_text = "\n\n\n".join(
-            f"{index}. {source.file} (page {source.page})"
-            for index, source in enumerate(cur_sources, start=1)
-        )
-        full_response += sources_text
-    yield full_response
-
-
@singleton
 class PrivateGptUi:
    @inject
@@ -92,7 +79,39 @@ class PrivateGptUi:
        # Cache the UI blocks
        self._ui_block = None

+        self._selected_filename = None
+
+        # Initialize system prompt based on default mode
+        self.mode = MODES[0]
+        self._system_prompt = self._get_default_system_prompt(self.mode)
+
    def _chat(self, message: str, history: list[list[str]], mode: str, *_: Any) -> Any:
+        def yield_deltas(completion_gen: CompletionGen) -> Iterable[str]:
+            full_response: str = ""
+            stream = completion_gen.response
+            for delta in stream:
+                if isinstance(delta, str):
+                    full_response += str(delta)
+                elif isinstance(delta, ChatResponse):
+                    full_response += delta.delta or ""
+                yield full_response
+                time.sleep(0.02)
+
+            if completion_gen.sources:
+                full_response += SOURCES_SEPARATOR
+                cur_sources = Source.curate_sources(completion_gen.sources)
+                sources_text = "\n\n\n"
+                used_files = set()
+                for index, source in enumerate(cur_sources, start=1):
+                    if f"{source.file}-{source.page}" not in used_files:
+                        sources_text = (
+                            sources_text
+                            + f"{index}. {source.file} (page {source.page}) \n\n"
+                        )
+                        used_files.add(f"{source.file}-{source.page}")
+                full_response += sources_text
+            yield full_response
+
        def build_history() -> list[ChatMessage]:
            history_messages: list[ChatMessage] = list(
                itertools.chain(
@@ -115,33 +134,44 @@ class PrivateGptUi:

        new_message = ChatMessage(content=message, role=MessageRole.USER)
        all_messages = [*build_history(), new_message]
+        # If a system prompt is set, add it as a system message
+        if self._system_prompt:
+            all_messages.insert(
+                0,
+                ChatMessage(
+                    content=self._system_prompt,
+                    role=MessageRole.SYSTEM,
+                ),
+            )
        match mode:
-            case "Query Docs":
-                # Add a system message to force the behaviour of the LLM
-                # to answer only questions about the provided context.
-                all_messages.insert(
-                    0,
-                    ChatMessage(
-                        content="You can only answer questions about the provided context. If you know the answer "
-                        "but it is not based in the provided context, don't provide the answer, just state "
-                        "the answer is not in the context provided.",
-                        role=MessageRole.SYSTEM,
-                    ),
-                )
+            case "Query Files":
+
+                # Use only the selected file for the query
+                context_filter = None
+                if self._selected_filename is not None:
+                    docs_ids = []
+                    for ingested_document in self._ingest_service.list_ingested():
+                        if (
+                            ingested_document.doc_metadata["file_name"]
+                            == self._selected_filename
+                        ):
+                            docs_ids.append(ingested_document.doc_id)
+                    context_filter = ContextFilter(docs_ids=docs_ids)
+
                query_stream = self._chat_service.stream_chat(
                    messages=all_messages,
                    use_context=True,
+                    context_filter=context_filter,
                )
                yield from yield_deltas(query_stream)
-
-            case "LLM Chat":
+            case "LLM Chat (no context from files)":
                llm_stream = self._chat_service.stream_chat(
                    messages=all_messages,
                    use_context=False,
                )
                yield from yield_deltas(llm_stream)

-            case "Search in Docs":
+            case "Search Files":
                response = self._chunks_service.retrieve_relevant(
                    text=message, limit=4, prev_next_chunks=0
                )
@@ -155,6 +185,37 @@ class PrivateGptUi:
                    for index, source in enumerate(sources, start=1)
                )

+    # On initialization and on mode change, this function set the system prompt
+    # to the default prompt based on the mode (and user settings).
+    @staticmethod
+    def _get_default_system_prompt(mode: str) -> str:
+        p = ""
+        match mode:
+            # For query chat mode, obtain default system prompt from settings
+            case "Query Files":
+                p = settings().ui.default_query_system_prompt
+            # For chat mode, obtain default system prompt from settings
+            case "LLM Chat (no context from files)":
+                p = settings().ui.default_chat_system_prompt
+            # For any other mode, clear the system prompt
+            case _:
+                p = ""
+        return p
+
+    def _set_system_prompt(self, system_prompt_input: str) -> None:
+        logger.info(f"Setting system prompt to: {system_prompt_input}")
+        self._system_prompt = system_prompt_input
+
+    def _set_current_mode(self, mode: str) -> Any:
+        self.mode = mode
+        self._set_system_prompt(self._get_default_system_prompt(mode))
+        # Update placeholder and allow interaction if default system prompt is set
+        if self._system_prompt:
+            return gr.update(placeholder=self._system_prompt, interactive=True)
+        # Update placeholder and disable interaction if no default system prompt is set
+        else:
+            return gr.update(placeholder=self._system_prompt, interactive=False)
+
    def _list_ingested_files(self) -> list[list[str]]:
        files = set()
        for ingested_document in self._ingest_service.list_ingested():
@@ -170,8 +231,71 @@ class PrivateGptUi:
    def _upload_file(self, files: list[str]) -> None:
        logger.debug("Loading count=%s files", len(files))
        paths = [Path(file) for file in files]
+
+        # remove all existing Documents with name identical to a new file upload:
+        file_names = [path.name for path in paths]
+        doc_ids_to_delete = []
+        for ingested_document in self._ingest_service.list_ingested():
+            if (
+                ingested_document.doc_metadata
+                and ingested_document.doc_metadata["file_name"] in file_names
+            ):
+                doc_ids_to_delete.append(ingested_document.doc_id)
+        if len(doc_ids_to_delete) > 0:
+            logger.info(
+                "Uploading file(s) which were already ingested: %s document(s) will be replaced.",
+                len(doc_ids_to_delete),
+            )
+            for doc_id in doc_ids_to_delete:
+                self._ingest_service.delete(doc_id)
+
        self._ingest_service.bulk_ingest([(str(path.name), path) for path in paths])

+    def _delete_all_files(self) -> Any:
+        ingested_files = self._ingest_service.list_ingested()
+        logger.debug("Deleting count=%s files", len(ingested_files))
+        for ingested_document in ingested_files:
+            self._ingest_service.delete(ingested_document.doc_id)
+        return [
+            gr.List(self._list_ingested_files()),
+            gr.components.Button(interactive=False),
+            gr.components.Button(interactive=False),
+            gr.components.Textbox("All files"),
+        ]
+
+    def _delete_selected_file(self) -> Any:
+        logger.debug("Deleting selected %s", self._selected_filename)
+        # Note: keep looping for pdf's (each page became a Document)
+        for ingested_document in self._ingest_service.list_ingested():
+            if (
+                ingested_document.doc_metadata
+                and ingested_document.doc_metadata["file_name"]
+                == self._selected_filename
+            ):
+                self._ingest_service.delete(ingested_document.doc_id)
+        return [
+            gr.List(self._list_ingested_files()),
+            gr.components.Button(interactive=False),
+            gr.components.Button(interactive=False),
+            gr.components.Textbox("All files"),
+        ]
+
+    def _deselect_selected_file(self) -> Any:
+        self._selected_filename = None
+        return [
+            gr.components.Button(interactive=False),
+            gr.components.Button(interactive=False),
+            gr.components.Textbox("All files"),
+        ]
+
+    def _selected_a_file(self, select_data: gr.SelectData) -> Any:
+        self._selected_filename = select_data.value
+        return [
+            gr.components.Button(interactive=True),
+            gr.components.Button(interactive=True),
+            gr.components.Textbox(self._selected_filename),
+        ]
+
    def _build_ui_blocks(self) -> gr.Blocks:
        logger.debug("Creating the UI blocks")
        with gr.Blocks(
@@ -186,17 +310,21 @@ class PrivateGptUi:
            "justify-content: center;"
            "align-items: center;"
            "}"
-            ".logo img { height: 25% }",
+            ".logo img { height: 25% }"
+            ".contain { display: flex !important; flex-direction: column !important; }"
+            "#component-0, #component-3, #component-10, #component-8  { height: 100% !important; }"
+            "#chatbot { flex-grow: 1 !important; overflow: auto !important;}"
+            "#col { height: calc(100vh - 112px - 16px) !important; }",
        ) as blocks:
            with gr.Row():
                gr.HTML(f"<div class='logo'/><img src={logo_svg} alt=PrivateGPT></div")

-            with gr.Row():
-                with gr.Column(scale=3, variant="compact"):
+            with gr.Row(equal_height=False):
+                with gr.Column(scale=3):
                    mode = gr.Radio(
-                        ["Query Docs", "Search in Docs", "LLM Chat"],
+                        MODES,
                        label="Mode",
-                        value="Query Docs",
+                        value="Query Files",
                    )
                    upload_button = gr.components.UploadButton(
                        "Upload File(s)",
@@ -208,6 +336,7 @@ class PrivateGptUi:
                        self._list_ingested_files,
                        headers=["File name"],
                        label="Ingested Files",
+                        height=235,
                        interactive=False,
                        render=False,  # Rendered under the button
                    )
@@ -221,19 +350,131 @@ class PrivateGptUi:
                        outputs=ingested_dataset,
                    )
                    ingested_dataset.render()
-                with gr.Column(scale=7):
+                    deselect_file_button = gr.components.Button(
+                        "De-select selected file", size="sm", interactive=False
+                    )
+                    selected_text = gr.components.Textbox(
+                        "All files", label="Selected for Query or Deletion", max_lines=1
+                    )
+                    delete_file_button = gr.components.Button(
+                        "🗑️ Delete selected file",
+                        size="sm",
+                        visible=settings().ui.delete_file_button_enabled,
+                        interactive=False,
+                    )
+                    delete_files_button = gr.components.Button(
+                        "⚠️ Delete ALL files",
+                        size="sm",
+                        visible=settings().ui.delete_all_files_button_enabled,
+                    )
+                    deselect_file_button.click(
+                        self._deselect_selected_file,
+                        outputs=[
+                            delete_file_button,
+                            deselect_file_button,
+                            selected_text,
+                        ],
+                    )
+                    ingested_dataset.select(
+                        fn=self._selected_a_file,
+                        outputs=[
+                            delete_file_button,
+                            deselect_file_button,
+                            selected_text,
+                        ],
+                    )
+                    delete_file_button.click(
+                        self._delete_selected_file,
+                        outputs=[
+                            ingested_dataset,
+                            delete_file_button,
+                            deselect_file_button,
+                            selected_text,
+                        ],
+                    )
+                    delete_files_button.click(
+                        self._delete_all_files,
+                        outputs=[
+                            ingested_dataset,
+                            delete_file_button,
+                            deselect_file_button,
+                            selected_text,
+                        ],
+                    )
+                    system_prompt_input = gr.Textbox(
+                        placeholder=self._system_prompt,
+                        label="System Prompt",
+                        lines=2,
+                        interactive=True,
+                        render=False,
+                    )
+                    # When mode changes, set default system prompt
+                    mode.change(
+                        self._set_current_mode, inputs=mode, outputs=system_prompt_input
+                    )
+                    # On blur, set system prompt to use in queries
+                    system_prompt_input.blur(
+                        self._set_system_prompt,
+                        inputs=system_prompt_input,
+                    )
+
+                    def get_model_label() -> str | None:
+                        """Get model label from llm mode setting YAML.
+
+                        Raises:
+                            ValueError: If an invalid 'llm_mode' is encountered.
+
+                        Returns:
+                            str: The corresponding model label.
+                        """
+                        # Get model label from llm mode setting YAML
+                        # Labels: local, openai, openailike, sagemaker, mock, ollama
+                        config_settings = settings()
+                        if config_settings is None:
+                            raise ValueError("Settings are not configured.")
+
+                        # Get llm_mode from settings
+                        llm_mode = config_settings.llm.mode
+
+                        # Mapping of 'llm_mode' to corresponding model labels
+                        model_mapping = {
+                            "llamacpp": config_settings.llamacpp.llm_hf_model_file,
+                            "openai": config_settings.openai.model,
+                            "openailike": config_settings.openai.model,
+                            "sagemaker": config_settings.sagemaker.llm_endpoint_name,
+                            "mock": llm_mode,
+                            "ollama": config_settings.ollama.llm_model,
+                        }
+
+                        if llm_mode not in model_mapping:
+                            print(f"Invalid 'llm mode': {llm_mode}")
+                            return None
+
+                        return model_mapping[llm_mode]
+
+                with gr.Column(scale=7, elem_id="col"):
+                    # Determine the model label based on the value of PGPT_PROFILES
+                    model_label = get_model_label()
+                    if model_label is not None:
+                        label_text = (
+                            f"LLM: {settings().llm.mode} | Model: {model_label}"
+                        )
+                    else:
+                        label_text = f"LLM: {settings().llm.mode}"
+
                    _ = gr.ChatInterface(
                        self._chat,
                        chatbot=gr.Chatbot(
-                            label=f"LLM: {settings().llm.mode}",
+                            label=label_text,
                            show_copy_button=True,
+                            elem_id="chatbot",
                            render=False,
                            avatar_images=(
                                None,
                                AVATAR_BOT,
                            ),
                        ),
-                        additional_inputs=[mode, upload_button],
+                        additional_inputs=[mode, upload_button, system_prompt_input],
                    )
        return blocks

--- a/private_gpt/utils/eta.py
+++ b/private_gpt/utils/eta.py
@@ -0,0 +1,122 @@
+import datetime
+import logging
+import math
+import time
+from collections import deque
+from typing import Any
+
+logger = logging.getLogger(__name__)
+
+
+def human_time(*args: Any, **kwargs: Any) -> str:
+    def timedelta_total_seconds(timedelta: datetime.timedelta) -> float:
+        return (
+            timedelta.microseconds
+            + 0.0
+            + (timedelta.seconds + timedelta.days * 24 * 3600) * 10**6
+        ) / 10**6
+
+    secs = float(timedelta_total_seconds(datetime.timedelta(*args, **kwargs)))
+    # We want (ms) precision below 2 seconds
+    if secs < 2:
+        return f"{secs * 1000}ms"
+    units = [("y", 86400 * 365), ("d", 86400), ("h", 3600), ("m", 60), ("s", 1)]
+    parts = []
+    for unit, mul in units:
+        if secs / mul >= 1 or mul == 1:
+            if mul > 1:
+                n = int(math.floor(secs / mul))
+                secs -= n * mul
+            else:
+                # >2s we drop the (ms) component.
+                n = int(secs)
+            if n:
+                parts.append(f"{n}{unit}")
+    return " ".join(parts)
+
+
+def eta(iterator: list[Any]) -> Any:
+    """Report an ETA after 30s and every 60s thereafter."""
+    total = len(iterator)
+    _eta = ETA(total)
+    _eta.needReport(30)
+    for processed, data in enumerate(iterator, start=1):
+        yield data
+        _eta.update(processed)
+        if _eta.needReport(60):
+            logger.info(f"{processed}/{total} - ETA {_eta.human_time()}")
+
+
+class ETA:
+    """Predict how long something will take to complete."""
+
+    def __init__(self, total: int):
+        self.total: int = total  # Total expected records.
+        self.rate: float = 0.0  # per second
+        self._timing_data: deque[tuple[float, int]] = deque(maxlen=100)
+        self.secondsLeft: float = 0.0
+        self.nexttime: float = 0.0
+
+    def human_time(self) -> str:
+        if self._calc():
+            return f"{human_time(seconds=self.secondsLeft)} @ {int(self.rate * 60)}/min"
+        return "(computing)"
+
+    def update(self, count: int) -> None:
+        # count should be in the range 0 to self.total
+        assert count > 0
+        assert count <= self.total
+        self._timing_data.append((time.time(), count))  # (X,Y) for pearson
+
+    def needReport(self, whenSecs: int) -> bool:
+        now = time.time()
+        if now > self.nexttime:
+            self.nexttime = now + whenSecs
+            return True
+        return False
+
+    def _calc(self) -> bool:
+        # A sample before a prediction.   Need two points to compute slope!
+        if len(self._timing_data) < 3:
+            return False
+
+        # http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
+        # Calculate means and standard deviations.
+        samples = len(self._timing_data)
+        # column wise sum of the timing tuples to compute their mean.
+        mean_x, mean_y = (
+            sum(i) / samples for i in zip(*self._timing_data, strict=False)
+        )
+        std_x = math.sqrt(
+            sum(pow(i[0] - mean_x, 2) for i in self._timing_data) / (samples - 1)
+        )
+        std_y = math.sqrt(
+            sum(pow(i[1] - mean_y, 2) for i in self._timing_data) / (samples - 1)
+        )
+
+        # Calculate coefficient.
+        sum_xy, sum_sq_v_x, sum_sq_v_y = 0.0, 0.0, 0
+        for x, y in self._timing_data:
+            x -= mean_x
+            y -= mean_y
+            sum_xy += x * y
+            sum_sq_v_x += pow(x, 2)
+            sum_sq_v_y += pow(y, 2)
+        pearson_r = sum_xy / math.sqrt(sum_sq_v_x * sum_sq_v_y)
+
+        # Calculate regression line.
+        # y = mx + b where m is the slope and b is the y-intercept.
+        m = self.rate = pearson_r * (std_y / std_x)
+        y = self.total
+        b = mean_y - m * mean_x
+        x = (y - b) / m
+
+        # Calculate fitted line (transformed/shifted regression line horizontally).
+        fitted_b = self._timing_data[-1][1] - (m * self._timing_data[-1][0])
+        fitted_x = (y - fitted_b) / m
+        _, count = self._timing_data[-1]  # adjust last data point progress count
+        adjusted_x = ((fitted_x - x) * (count / self.total)) + x
+        eta_epoch = adjusted_x
+
+        self.secondsLeft = max([eta_epoch - time.time(), 0])
+        return True
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,21 +1,68 @@
 [tool.poetry]
 name = "private-gpt"
-version = "0.1.0"
+version = "0.5.0"
 description = "Private GPT"
 authors = ["Zylon <hi@zylon.ai>"]

 [tool.poetry.dependencies]
 python = ">=3.11,<3.12"
-fastapi = { extras = ["all"], version = "^0.103.1" }
-boto3 = "^1.28.56"
+# PrivateGPT
+fastapi = { extras = ["all"], version = "^0.110.0" }
+python-multipart = "^0.0.9"
 injector = "^0.21.0"
 pyyaml = "^6.0.1"
-python-multipart = "^0.0.6"
-pypdf = "^3.16.2"
-llama-index = { extras = ["local_models"], version = "0.9.10" }
-watchdog = "^3.0.0"
-qdrant-client = "^1.6.9"
-chromadb = {version = "^0.4.13", optional = true}
+watchdog = "^4.0.0"
+transformers = "^4.38.2"
+# LlamaIndex core libs
+llama-index-core = "^0.10.14"
+llama-index-readers-file = "^0.1.6"
+# Optional LlamaIndex integration libs
+llama-index-llms-llama-cpp = {version = "^0.1.3", optional = true}
+llama-index-llms-openai = {version = "^0.1.6", optional = true}
+llama-index-llms-openai-like = {version ="^0.1.3", optional = true}
+llama-index-llms-ollama = {version ="^0.1.2", optional = true}
+llama-index-llms-azure-openai = {version ="^0.1.5", optional = true}
+llama-index-embeddings-ollama = {version ="^0.1.2", optional = true}
+llama-index-embeddings-huggingface = {version ="^0.1.4", optional = true}
+llama-index-embeddings-openai = {version ="^0.1.6", optional = true}
+llama-index-embeddings-azure-openai = {version ="^0.1.6", optional = true}
+llama-index-vector-stores-qdrant = {version ="^0.1.3", optional = true}
+llama-index-vector-stores-chroma = {version ="^0.1.4", optional = true}
+llama-index-vector-stores-postgres = {version ="^0.1.2", optional = true}
+llama-index-storage-docstore-postgres = {version ="^0.1.2", optional = true}
+llama-index-storage-index-store-postgres = {version ="^0.1.2", optional = true}
+# Postgres
+psycopg2-binary = {version ="^2.9.9", optional = true}
+asyncpg = {version="^0.29.0", optional = true}
+
+# Optional Sagemaker dependency
+boto3 = {version ="^1.34.51", optional = true}
+
+# Optional Reranker dependencies
+torch = {version ="^2.1.2", optional = true}
+sentence-transformers = {version ="^2.6.1", optional = true}
+
+# Optional UI
+gradio = {version ="^4.19.2", optional = true}
+
+[tool.poetry.extras]
+ui = ["gradio"]
+llms-llama-cpp = ["llama-index-llms-llama-cpp"]
+llms-openai = ["llama-index-llms-openai"]
+llms-openai-like = ["llama-index-llms-openai-like"]
+llms-ollama = ["llama-index-llms-ollama"]
+llms-sagemaker = ["boto3"]
+llms-azopenai = ["llama-index-llms-azure-openai"]
+embeddings-ollama = ["llama-index-embeddings-ollama"]
+embeddings-huggingface = ["llama-index-embeddings-huggingface"]
+embeddings-openai = ["llama-index-embeddings-openai"]
+embeddings-sagemaker = ["boto3"]
+embeddings-azopenai = ["llama-index-embeddings-azure-openai"]
+vector-stores-qdrant = ["llama-index-vector-stores-qdrant"]
+vector-stores-chroma = ["llama-index-vector-stores-chroma"]
+vector-stores-postgres = ["llama-index-vector-stores-postgres"]
+storage-nodestore-postgres = ["llama-index-storage-docstore-postgres","llama-index-storage-index-store-postgres","psycopg2-binary","asyncpg"]
+rerank-sentence-transformers = ["torch", "sentence-transformers"]

 [tool.poetry.group.dev.dependencies]
 black = "^22"
@@ -27,26 +74,6 @@ ruff = "^0"
 pytest-asyncio = "^0.21.1"
 types-pyyaml = "^6.0.12.12"

-# Dependencies for gradio UI
-[tool.poetry.group.ui]
-optional = true
-[tool.poetry.group.ui.dependencies]
-gradio = "^4.7.1"
-
-[tool.poetry.group.local]
-optional = true
-[tool.poetry.group.local.dependencies]
-llama-cpp-python = "^0.2.20"
-jinja2 = "^3.1.2"
-# numpy = "1.26.0"
-sentence-transformers = "^2.2.2"
-# https://stackoverflow.com/questions/76327419/valueerror-libcublas-so-0-9-not-found-in-the-system-path
-torch = ">=2.0.0, !=2.0.1, !=2.1.0"
-transformers = "^4.35.2"
-
-[tool.poetry.extras]
-chroma = ["chromadb"]
-
 [build-system]
 requires = ["poetry-core>=1.0.0"]
 build-backend = "poetry.core.masonry.api"
@@ -139,6 +166,9 @@ explicit_package_bases = true
 warn_unused_ignores = false
 exclude = ["tests"]

+[tool.mypy-llama-index]
+ignore_missing_imports = true
+
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
 testpaths = ["tests"]
--- a/scripts/extract_openapi.py
+++ b/scripts/extract_openapi.py
@@ -1,6 +1,7 @@
 import argparse
 import json
 import sys
+
 import yaml
 from uvicorn.importer import import_from_string

--- a/scripts/ingest_folder.py
+++ b/scripts/ingest_folder.py
@@ -18,22 +18,23 @@ class LocalIngestWorker:
        self.total_documents = 0
        self.current_document_count = 0

-        self._files_under_root_folder: list[Path] = list()
+        self._files_under_root_folder: list[Path] = []

-    def _find_all_files_in_folder(self, root_path: Path) -> None:
+    def _find_all_files_in_folder(self, root_path: Path, ignored: list[str]) -> None:
        """Search all files under the root folder recursively.
+
        Count them at the same time
        """
        for file_path in root_path.iterdir():
-            if file_path.is_file():
+            if file_path.is_file() and file_path.name not in ignored:
                self.total_documents += 1
                self._files_under_root_folder.append(file_path)
-            elif file_path.is_dir():
-                self._find_all_files_in_folder(file_path)
+            elif file_path.is_dir() and file_path.name not in ignored:
+                self._find_all_files_in_folder(file_path, ignored)

-    def ingest_folder(self, folder_path: Path) -> None:
+    def ingest_folder(self, folder_path: Path, ignored: list[str]) -> None:
        # Count total documents before ingestion
-        self._find_all_files_in_folder(folder_path)
+        self._find_all_files_in_folder(folder_path, ignored)
        self._ingest_all(self._files_under_root_folder)

    def _ingest_all(self, files_to_ingest: list[Path]) -> None:
@@ -48,7 +49,7 @@ class LocalIngestWorker:
        try:
            if changed_path.exists():
                logger.info(f"Started ingesting file={changed_path}")
-                self.ingest_service.ingest(changed_path.name, changed_path)
+                self.ingest_service.ingest_file(changed_path.name, changed_path)
                logger.info(f"Completed ingesting file={changed_path}")
        except Exception:
            logger.exception(
@@ -64,12 +65,19 @@ parser.add_argument(
    action=argparse.BooleanOptionalAction,
    default=False,
 )
+parser.add_argument(
+    "--ignored",
+    nargs="*",
+    help="List of files/directories to ignore",
+    default=[],
+)
 parser.add_argument(
    "--log-file",
    help="Optional path to a log file. If provided, logs will be written to this file.",
    type=str,
    default=None,
 )
+
 args = parser.parse_args()

 # Set up logging to a file if a path is provided
@@ -91,9 +99,17 @@ if __name__ == "__main__":

    ingest_service = global_injector.get(IngestService)
    worker = LocalIngestWorker(ingest_service)
-    worker.ingest_folder(root_path)
+    worker.ingest_folder(root_path, args.ignored)
+
+    if args.ignored:
+        logger.info(f"Skipping following files and directories: {args.ignored}")

    if args.watch:
        logger.info(f"Watching {args.folder} for changes, press Ctrl+C to stop...")
+        directories_to_watch = [
+            dir
+            for dir in root_path.iterdir()
+            if dir.is_dir() and dir.name not in args.ignored
+        ]
        watcher = IngestWatcher(args.folder, worker.ingest_on_watch)
        watcher.start()
--- a/scripts/setup
+++ b/scripts/setup
@@ -3,37 +3,47 @@ import os
 import argparse

 from huggingface_hub import hf_hub_download, snapshot_download
+from transformers import AutoTokenizer

 from private_gpt.paths import models_path, models_cache_path
 from private_gpt.settings.settings import settings

 resume_download = True
 if __name__ == '__main__':
-    parser = argparse.ArgumentParser(prog='Setup: Download models from huggingface')
+    parser = argparse.ArgumentParser(prog='Setup: Download models from Hugging Face')
    parser.add_argument('--resume', default=True, action=argparse.BooleanOptionalAction, help='Enable/Disable resume_download options to restart the download progress interrupted')
    args = parser.parse_args()
    resume_download = args.resume

 os.makedirs(models_path, exist_ok=True)
-embedding_path = models_path / "embedding"

-print(f"Downloading embedding {settings().local.embedding_hf_model_name}")
+# Download Embedding model
+embedding_path = models_path / "embedding"
+print(f"Downloading embedding {settings().huggingface.embedding_hf_model_name}")
 snapshot_download(
-    repo_id=settings().local.embedding_hf_model_name,
+    repo_id=settings().huggingface.embedding_hf_model_name,
    cache_dir=models_cache_path,
    local_dir=embedding_path,
 )
 print("Embedding model downloaded!")
-print("Downloading models for local execution...")

 # Download LLM and create a symlink to the model file
+print(f"Downloading LLM {settings().llamacpp.llm_hf_model_file}")
 hf_hub_download(
-    repo_id=settings().local.llm_hf_repo_id,
-    filename=settings().local.llm_hf_model_file,
+    repo_id=settings().llamacpp.llm_hf_repo_id,
+    filename=settings().llamacpp.llm_hf_model_file,
    cache_dir=models_cache_path,
    local_dir=models_path,
    resume_download=resume_download,
 )
-
 print("LLM model downloaded!")
+
+# Download Tokenizer
+print(f"Downloading tokenizer {settings().llm.tokenizer}")
+AutoTokenizer.from_pretrained(
+    pretrained_model_name_or_path=settings().llm.tokenizer,
+    cache_dir=models_cache_path,
+)
+print("Tokenizer downloaded!")
+
 print("Setup done")
--- a/scripts/utils.py
+++ b/scripts/utils.py
@@ -1,10 +1,22 @@
 import argparse
 import os
 import shutil
+from typing import Any, ClassVar
+
+from private_gpt.paths import local_data_path
+from private_gpt.settings.settings import settings


-def wipe():
-    path = "local_data"
+def wipe_file(file: str) -> None:
+    if os.path.isfile(file):
+        os.remove(file)
+        print(f" - Deleted {file}")
+
+
+def wipe_tree(path: str) -> None:
+    if not os.path.exists(path):
+        print(f"Warning: Path not found {path}")
+        return
    print(f"Wiping {path}...")
    all_files = os.listdir(path)

@@ -24,14 +36,149 @@ def wipe():
            continue


-if __name__ == "__main__":
-    commands = {
-        "wipe": wipe,
+class Postgres:
+    tables: ClassVar[dict[str, list[str]]] = {
+        "nodestore": ["data_docstore", "data_indexstore"],
+        "vectorstore": ["data_embeddings"],
    }

-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "mode", help="select a mode to run", choices=list(commands.keys())
+    def __init__(self) -> None:
+        try:
+            import psycopg2
+        except ModuleNotFoundError:
+            raise ModuleNotFoundError("Postgres dependencies not found") from None
+
+        connection = settings().postgres.model_dump(exclude_none=True)
+        self.schema = connection.pop("schema_name")
+        self.conn = psycopg2.connect(**connection)
+
+    def wipe(self, storetype: str) -> None:
+        cur = self.conn.cursor()
+        try:
+            for table in self.tables[storetype]:
+                sql = f"DROP TABLE IF EXISTS {self.schema}.{table}"
+                cur.execute(sql)
+                print(f"Table {self.schema}.{table} dropped.")
+            self.conn.commit()
+        finally:
+            cur.close()
+
+    def stats(self, store_type: str) -> None:
+        template = "SELECT '{table}', COUNT(*), pg_size_pretty(pg_total_relation_size('{table}')) FROM {table}"
+        sql = " UNION ALL ".join(
+            template.format(table=tbl) for tbl in self.tables[store_type]
+        )
+
+        cur = self.conn.cursor()
+        try:
+            print(f"Storage for Postgres {store_type}.")
+            print("{:<15} | {:>15} | {:>9}".format("Table", "Rows", "Size"))
+            print("-" * 45)  # Print a line separator
+
+            cur.execute(sql)
+            for row in cur.fetchall():
+                formatted_row_count = f"{row[1]:,}"
+                print(f"{row[0]:<15} | {formatted_row_count:>15} | {row[2]:>9}")
+
+            print()
+        finally:
+            cur.close()
+
+    def __del__(self):
+        if hasattr(self, "conn") and self.conn:
+            self.conn.close()
+
+
+class Simple:
+    def wipe(self, store_type: str) -> None:
+        assert store_type == "nodestore"
+        from llama_index.core.storage.docstore.types import (
+            DEFAULT_PERSIST_FNAME as DOCSTORE,
+        )
+        from llama_index.core.storage.index_store.types import (
+            DEFAULT_PERSIST_FNAME as INDEXSTORE,
+        )
+
+        for store in (DOCSTORE, INDEXSTORE):
+            wipe_file(str((local_data_path / store).absolute()))
+
+
+class Chroma:
+    def wipe(self, store_type: str) -> None:
+        assert store_type == "vectorstore"
+        wipe_tree(str((local_data_path / "chroma_db").absolute()))
+
+
+class Qdrant:
+    COLLECTION = (
+        "make_this_parameterizable_per_api_call"  # ?! see vector_store_component.py
    )
+
+    def __init__(self) -> None:
+        try:
+            from qdrant_client import QdrantClient  # type: ignore
+        except ImportError:
+            raise ImportError("Qdrant dependencies not found") from None
+        self.client = QdrantClient(**settings().qdrant.model_dump(exclude_none=True))
+
+    def wipe(self, store_type: str) -> None:
+        assert store_type == "vectorstore"
+        try:
+            self.client.delete_collection(self.COLLECTION)
+            print("Collection dropped successfully.")
+        except Exception as e:
+            print("Error dropping collection:", e)
+
+    def stats(self, store_type: str) -> None:
+        print(f"Storage for Qdrant {store_type}.")
+        try:
+            collection_data = self.client.get_collection(self.COLLECTION)
+            if collection_data:
+                # Collection Info
+                # https://qdrant.tech/documentation/concepts/collections/
+                print(f"\tPoints:        {collection_data.points_count:,}")
+                print(f"\tVectors:       {collection_data.vectors_count:,}")
+                print(f"\tIndex Vectors: {collection_data.indexed_vectors_count:,}")
+                return
+        except ValueError:
+            pass
+        print("\t- Qdrant collection not found or empty")
+
+
+class Command:
+    DB_HANDLERS: ClassVar[dict[str, Any]] = {
+        "simple": Simple,  # node store
+        "chroma": Chroma,  # vector store
+        "postgres": Postgres,  # node, index and vector store
+        "qdrant": Qdrant,  # vector store
+    }
+
+    def for_each_store(self, cmd: str):
+        for store_type in ("nodestore", "vectorstore"):
+            database = getattr(settings(), store_type).database
+            handler_class = self.DB_HANDLERS.get(database)
+            if handler_class is None:
+                print(f"No handler found for database '{database}'")
+                continue
+            handler_instance = handler_class()  # Instantiate the class
+            # If the DB can handle this cmd dispatch it.
+            if hasattr(handler_instance, cmd) and callable(
+                func := getattr(handler_instance, cmd)
+            ):
+                func(store_type)
+            else:
+                print(
+                    f"Unable to execute command '{cmd}' on '{store_type}' in database '{database}'"
+                )
+
+    def execute(self, cmd: str) -> None:
+        if cmd in ("wipe", "stats"):
+            self.for_each_store(cmd)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("mode", help="select a mode to run", choices=["wipe", "stats"])
    args = parser.parse_args()
-    commands[args.mode.lower()]()
+
+    Command().execute(args.mode.lower())
--- a/settings-azopenai.yaml
+++ b/settings-azopenai.yaml
@@ -0,0 +1,17 @@
+server:
+  env_name: ${APP_ENV:azopenai}
+
+llm:
+  mode: azopenai
+
+embedding:
+  mode: azopenai
+
+azopenai:
+  api_key: ${AZ_OPENAI_API_KEY:}
+  azure_endpoint: ${AZ_OPENAI_ENDPOINT:}
+  embedding_deployment_name: ${AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME:}
+  llm_deployment_name: ${AZ_OPENAI_LLM_DEPLOYMENT_NAME:}
+  api_version: "2023-05-15"
+  embedding_model: text-embedding-ada-002
+  llm_model: gpt-35-turbo
--- a/settings-docker.yaml
+++ b/settings-docker.yaml
@@ -5,15 +5,31 @@ server:
 llm:
  mode: ${PGPT_MODE:mock}

-local:
+embedding:
+  mode: ${PGPT_MODE:sagemaker}
+
+llamacpp:
  llm_hf_repo_id: ${PGPT_HF_REPO_ID:TheBloke/Mistral-7B-Instruct-v0.1-GGUF}
  llm_hf_model_file: ${PGPT_HF_MODEL_FILE:mistral-7b-instruct-v0.1.Q4_K_M.gguf}
+
+huggingface:
  embedding_hf_model_name: ${PGPT_EMBEDDING_HF_MODEL_NAME:BAAI/bge-small-en-v1.5}

 sagemaker:
  llm_endpoint_name: ${PGPT_SAGEMAKER_LLM_ENDPOINT_NAME:}
  embedding_endpoint_name: ${PGPT_SAGEMAKER_EMBEDDING_ENDPOINT_NAME:}

+ollama:
+  llm_model: ${PGPT_OLLAMA_LLM_MODEL:mistral}
+  embedding_model: ${PGPT_OLLAMA_EMBEDDING_MODEL:nomic-embed-text}
+  api_base: ${PGPT_OLLAMA_API_BASE:http://ollama:11434}
+  tfs_z: ${PGPT_OLLAMA_TFS_Z:1.0}
+  top_k: ${PGPT_OLLAMA_TOP_K:40}
+  top_p: ${PGPT_OLLAMA_TOP_P:0.9}
+  repeat_last_n: ${PGPT_OLLAMA_REPEAT_LAST_N:64}
+  repeat_penalty: ${PGPT_OLLAMA_REPEAT_PENALTY:1.2}
+  request_timeout: ${PGPT_OLLAMA_REQUEST_TIMEOUT:600.0}
+
 ui:
  enabled: true
-  path: /
+  path: /
--- a/settings-local.yaml
+++ b/settings-local.yaml
@@ -1,5 +1,27 @@
+# poetry install --extras "ui llms-llama-cpp vector-stores-qdrant embeddings-huggingface"
 server:
  env_name: ${APP_ENV:local}

 llm:
-  mode: local
+  mode: llamacpp
+  # Should be matching the selected model
+  max_new_tokens: 512
+  context_window: 3900
+  tokenizer: mistralai/Mistral-7B-Instruct-v0.2
+
+llamacpp:
+  prompt_style: "mistral"
+  llm_hf_repo_id: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
+  llm_hf_model_file: mistral-7b-instruct-v0.2.Q4_K_M.gguf
+
+embedding:
+  mode: huggingface
+
+huggingface:
+  embedding_hf_model_name: BAAI/bge-small-en-v1.5
+
+vectorstore:
+  database: qdrant
+
+qdrant:
+  path: local_data/private_gpt/qdrant
--- a/settings-mock.yaml
+++ b/settings-mock.yaml
@@ -4,5 +4,6 @@ server:
 # This configuration allows you to use GPU for creating embeddings while avoiding loading LLM into vRAM
 llm:
  mode: mock
+
 embedding:
-  mode: local
+  mode: huggingface
--- a/settings-ollama-pg.yaml
+++ b/settings-ollama-pg.yaml
@@ -0,0 +1,34 @@
+# Using ollama and postgres for the vector, doc and index store. Ollama is also used for embeddings.
+# To use install these extras:
+# poetry install --extras "llms-ollama ui vector-stores-postgres embeddings-ollama storage-nodestore-postgres"
+server:
+  env_name: ${APP_ENV:ollama}
+
+llm:
+  mode: ollama
+  max_new_tokens: 512
+  context_window: 3900
+
+embedding:
+  mode: ollama
+  embed_dim: 768
+
+ollama:
+  llm_model: mistral
+  embedding_model: nomic-embed-text
+  api_base: http://localhost:11434
+
+nodestore:
+  database: postgres
+
+vectorstore:
+  database: postgres
+
+postgres:
+  host: localhost
+  port: 5432
+  database: postgres
+  user: postgres
+  password: admin
+  schema_name: private_gpt
+
--- a/settings-ollama.yaml
+++ b/settings-ollama.yaml
@@ -0,0 +1,30 @@
+server:
+  env_name: ${APP_ENV:ollama}
+
+llm:
+  mode: ollama
+  max_new_tokens: 512
+  context_window: 3900
+  temperature: 0.1     #The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)
+
+embedding:
+  mode: ollama
+
+ollama:
+  llm_model: mistral
+  embedding_model: nomic-embed-text
+  api_base: http://localhost:11434
+  embedding_api_base: http://localhost:11434  # change if your embedding model runs on another ollama
+  keep_alive: 5m
+  tfs_z: 1.0              # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.
+  top_k: 40               # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
+  top_p: 0.9              # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
+  repeat_last_n: 64       # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
+  repeat_penalty: 1.2     # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
+  request_timeout: 120.0  # Time elapsed until ollama times out the request. Default is 120s. Format is float.
+
+vectorstore:
+  database: qdrant
+
+qdrant:
+  path: local_data/private_gpt/qdrant
--- a/settings-openai.yaml
+++ b/settings-openai.yaml
@@ -0,0 +1,12 @@
+server:
+  env_name: ${APP_ENV:openai}
+
+llm:
+  mode: openai
+
+embedding:
+  mode: openai
+
+openai:
+  api_key: ${OPENAI_API_KEY:}
+  model: gpt-3.5-turbo
--- a/settings-sagemaker.yaml
+++ b/settings-sagemaker.yaml
@@ -1,5 +1,5 @@
 server:
-  env_name: ${APP_ENV:prod}
+  env_name: ${APP_ENV:sagemaker}
  port: ${PORT:8001}

 ui:
@@ -9,6 +9,9 @@ ui:
 llm:
  mode: sagemaker

+embedding:
+  mode: sagemaker
+
 sagemaker:
-  llm_endpoint_name: huggingface-pytorch-tgi-inference-2023-09-25-19-53-32-140
-  embedding_endpoint_name: huggingface-pytorch-inference-2023-11-03-07-41-36-479
+  llm_endpoint_name: llm
+  embedding_endpoint_name: embedding
--- a/settings-test.yaml
+++ b/settings-test.yaml
@@ -14,5 +14,8 @@ qdrant:
 llm:
  mode: mock

+embedding:
+  mode: mock
+
 ui:
  enabled: false
--- a/settings-vllm.yaml
+++ b/settings-vllm.yaml
@@ -0,0 +1,17 @@
+server:
+  env_name: ${APP_ENV:vllm}
+
+llm:
+  mode: openailike
+
+embedding:
+  mode: huggingface
+  ingest_mode: simple
+
+huggingface:
+  embedding_hf_model_name: BAAI/bge-small-en-v1.5
+
+openai:
+  api_base: http://localhost:8000/v1
+  api_key: EMPTY
+  model: facebook/opt-125m
--- a/settings.yaml
+++ b/settings.yaml
@@ -22,26 +22,70 @@ data:
 ui:
  enabled: true
  path: /
+  default_chat_system_prompt: >
+    You are a helpful, respectful and honest assistant. 
+    Always answer as helpfully as possible and follow ALL given instructions.
+    Do not speculate or make up information.
+    Do not reference any given instructions or context.
+  default_query_system_prompt: >
+    You can only answer questions about the provided context. 
+    If you know the answer but it is not based in the provided context, don't provide 
+    the answer, just state the answer is not in the context provided.
+  delete_file_button_enabled: true
+  delete_all_files_button_enabled: true

 llm:
-  mode: local
+  mode: llamacpp
+  # Should be matching the selected model
+  max_new_tokens: 512
+  context_window: 3900
+  tokenizer: mistralai/Mistral-7B-Instruct-v0.2
+  temperature: 0.1      # The temperature of the model. Increasing the temperature will make the model answer more creatively. A value of 0.1 would be more factual. (Default: 0.1)
+
+rag:
+  similarity_top_k: 2
+  #This value controls how many "top" documents the RAG returns to use in the context.
+  #similarity_value: 0.45
+  #This value is disabled by default.  If you enable this settings, the RAG will only use articles that meet a certain percentage score.
+  rerank:
+    enabled: false
+    model: cross-encoder/ms-marco-MiniLM-L-2-v2
+    top_n: 1
+
+llamacpp:
+  prompt_style: "mistral"
+  llm_hf_repo_id: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
+  llm_hf_model_file: mistral-7b-instruct-v0.2.Q4_K_M.gguf
+  tfs_z: 1.0            # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting
+  top_k: 40             # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)
+  top_p: 1.0            # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
+  repeat_penalty: 1.1   # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)

 embedding:
  # Should be matching the value above in most cases
-  mode: local
+  mode: huggingface
  ingest_mode: simple
+  embed_dim: 384 # 384 is for BAAI/bge-small-en-v1.5
+
+huggingface:
+  embedding_hf_model_name: BAAI/bge-small-en-v1.5

 vectorstore:
  database: qdrant

+nodestore:
+  database: simple
+
 qdrant:
  path: local_data/private_gpt/qdrant

-local:
-  prompt_style: "llama2"
-  llm_hf_repo_id: TheBloke/Mistral-7B-Instruct-v0.1-GGUF
-  llm_hf_model_file: mistral-7b-instruct-v0.1.Q4_K_M.gguf
-  embedding_hf_model_name: BAAI/bge-small-en-v1.5
+postgres:
+  host: localhost
+  port: 5432
+  database: postgres
+  user: postgres
+  password: postgres
+  schema_name: private_gpt

 sagemaker:
  llm_endpoint_name: huggingface-pytorch-tgi-inference-2023-09-25-19-53-32-140
@@ -49,3 +93,21 @@ sagemaker:

 openai:
  api_key: ${OPENAI_API_KEY:}
+  model: gpt-3.5-turbo
+
+ollama:
+  llm_model: llama2
+  embedding_model: nomic-embed-text
+  api_base: http://localhost:11434
+  embedding_api_base: http://localhost:11434  # change if your embedding model runs on another ollama
+  keep_alive: 5m
+  request_timeout: 120.0
+
+azopenai:
+  api_key: ${AZ_OPENAI_API_KEY:}
+  azure_endpoint: ${AZ_OPENAI_ENDPOINT:}
+  embedding_deployment_name: ${AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME:}
+  llm_deployment_name: ${AZ_OPENAI_LLM_DEPLOYMENT_NAME:}
+  api_version: "2023-05-15"
+  embedding_model: text-embedding-ada-002
+  llm_model: gpt-35-turbo
--- a/tests/fixtures/ingest_helper.py
+++ b/tests/fixtures/ingest_helper.py
@@ -13,7 +13,7 @@ class IngestHelper:
    def ingest_file(self, path: Path) -> IngestResponse:
        files = {"file": (path.name, path.open("rb"))}

-        response = self.test_client.post("/v1/ingest", files=files)
+        response = self.test_client.post("/v1/ingest/file", files=files)
        assert response.status_code == 200
        ingest_result = IngestResponse.model_validate(response.json())
        return ingest_result
--- a/tests/server/ingest/test_ingest_routes.py
+++ b/tests/server/ingest/test_ingest_routes.py
@@ -3,6 +3,7 @@ from pathlib import Path

 from fastapi.testclient import TestClient

+from private_gpt.server.ingest.ingest_router import IngestResponse
 from tests.fixtures.ingest_helper import IngestHelper


@@ -34,3 +35,12 @@ def test_ingest_list_returns_something_after_ingestion(
    assert (
        count_ingest_after == count_ingest_before + 1
    ), "The temp doc should be returned"
+
+
+def test_ingest_plain_text(test_client: TestClient) -> None:
+    response = test_client.post(
+        "/v1/ingest/text", json={"file_name": "file_name", "text": "text"}
+    )
+    assert response.status_code == 200
+    ingest_result = IngestResponse.model_validate(response.json())
+    assert len(ingest_result.data) == 1
--- a/tests/server/utils/test_simple_auth.py
+++ b/tests/server/utils/test_simple_auth.py
@@ -5,6 +5,7 @@ NOTE: We are not testing the switch based on the config in
      is currently architecture (it is hard to patch the `settings` and the app while
      the tests are directly importing them).
 """
+
 from typing import Annotated

 import pytest
--- a/tests/test_prompt_helper.py
+++ b/tests/test_prompt_helper.py
@@ -1,80 +1,30 @@
-import sys
-from pathlib import Path
-from tempfile import NamedTemporaryFile
-
 import pytest
-from llama_index.llms import ChatMessage, MessageRole
+from llama_index.core.llms import ChatMessage, MessageRole

-try:
-    from private_gpt.components.llm.prompt.prompt_helper import (
-        DefaultPromptStyle,
-        LlamaCppPromptStyle,
-        LlamaIndexPromptStyle,
-        TemplatePromptStyle,
-        VigognePromptStyle,
-        get_prompt_style,
-    )
-except ImportError:
-    DefaultPromptStyle = None
-    LlamaCppPromptStyle = None
-    LlamaIndexPromptStyle = None
-    TemplatePromptStyle = None
-    VigognePromptStyle = None
-    get_prompt_style = None
-
-
-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
+from private_gpt.components.llm.prompt_helper import (
+    ChatMLPromptStyle,
+    DefaultPromptStyle,
+    Llama2PromptStyle,
+    MistralPromptStyle,
+    TagPromptStyle,
+    get_prompt_style,
 )
+
+
@pytest.mark.parametrize(
    ("prompt_style", "expected_prompt_style"),
    [
-        (None, DefaultPromptStyle),
-        ("llama2", LlamaIndexPromptStyle),
-        ("vigogne", VigognePromptStyle),
-        ("llama_cpp.alpaca", LlamaCppPromptStyle),
-        ("llama_cpp.zephyr", LlamaCppPromptStyle),
+        ("default", DefaultPromptStyle),
+        ("llama2", Llama2PromptStyle),
+        ("tag", TagPromptStyle),
+        ("mistral", MistralPromptStyle),
+        ("chatml", ChatMLPromptStyle),
    ],
 )
 def test_get_prompt_style_success(prompt_style, expected_prompt_style):
-    assert type(get_prompt_style(prompt_style)) == expected_prompt_style
+    assert isinstance(get_prompt_style(prompt_style), expected_prompt_style)


-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
-)
-def test_get_prompt_style_template_success():
-    jinja_template = "{% for message in messages %}<|{{message['role']}}|>: {{message['content'].strip() + '\\n'}}{% endfor %}<|assistant|>: "
-    with NamedTemporaryFile("w") as tmp_file:
-        path = Path(tmp_file.name)
-        tmp_file.write(jinja_template)
-        tmp_file.flush()
-        tmp_file.seek(0)
-        prompt_style = get_prompt_style(
-            "template", template_name=path.name, template_dir=path.parent
-        )
-        assert type(prompt_style) == TemplatePromptStyle
-        prompt = prompt_style.messages_to_prompt(
-            [
-                ChatMessage(
-                    content="You are an AI assistant.", role=MessageRole.SYSTEM
-                ),
-                ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
-            ]
-        )
-
-        expected_prompt = (
-            "<|system|>: You are an AI assistant.\n"
-            "<|user|>: Hello, how are you doing?\n"
-            "<|assistant|>: "
-        )
-
-        assert prompt == expected_prompt
-
-
-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
-)
 def test_get_prompt_style_failure():
    prompt_style = "unknown"
    with pytest.raises(ValueError) as exc_info:
@@ -82,11 +32,8 @@ def test_get_prompt_style_failure():
    assert str(exc_info.value) == f"Unknown prompt_style='{prompt_style}'"


-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
-)
 def test_tag_prompt_style_format():
-    prompt_style = VigognePromptStyle()
+    prompt_style = TagPromptStyle()
    messages = [
        ChatMessage(content="You are an AI assistant.", role=MessageRole.SYSTEM),
        ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
@@ -101,24 +48,8 @@ def test_tag_prompt_style_format():
    assert prompt_style.messages_to_prompt(messages) == expected_prompt


-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
-)
 def test_tag_prompt_style_format_with_system_prompt():
-    system_prompt = "This is a system prompt from configuration."
-    prompt_style = VigognePromptStyle(default_system_prompt=system_prompt)
-    messages = [
-        ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
-    ]
-
-    expected_prompt = (
-        f"<|system|>: {system_prompt}\n"
-        "<|user|>: Hello, how are you doing?\n"
-        "<|assistant|>: "
-    )
-
-    assert prompt_style.messages_to_prompt(messages) == expected_prompt
-
+    prompt_style = TagPromptStyle()
    messages = [
        ChatMessage(
            content="FOO BAR Custom sys prompt from messages.", role=MessageRole.SYSTEM
@@ -135,11 +66,41 @@ def test_tag_prompt_style_format_with_system_prompt():
    assert prompt_style.messages_to_prompt(messages) == expected_prompt


-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
-)
+def test_mistral_prompt_style_format():
+    prompt_style = MistralPromptStyle()
+    messages = [
+        ChatMessage(content="You are an AI assistant.", role=MessageRole.SYSTEM),
+        ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
+    ]
+
+    expected_prompt = (
+        "<s>[INST] You are an AI assistant. [/INST]</s>"
+        "[INST] Hello, how are you doing? [/INST]"
+    )
+
+    assert prompt_style.messages_to_prompt(messages) == expected_prompt
+
+
+def test_chatml_prompt_style_format():
+    prompt_style = ChatMLPromptStyle()
+    messages = [
+        ChatMessage(content="You are an AI assistant.", role=MessageRole.SYSTEM),
+        ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
+    ]
+
+    expected_prompt = (
+        "<|im_start|>system\n"
+        "You are an AI assistant.<|im_end|>\n"
+        "<|im_start|>user\n"
+        "Hello, how are you doing?<|im_end|>\n"
+        "<|im_start|>assistant\n"
+    )
+
+    assert prompt_style.messages_to_prompt(messages) == expected_prompt
+
+
 def test_llama2_prompt_style_format():
-    prompt_style = LlamaIndexPromptStyle()
+    prompt_style = Llama2PromptStyle()
    messages = [
        ChatMessage(content="You are an AI assistant.", role=MessageRole.SYSTEM),
        ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
@@ -156,26 +117,8 @@ def test_llama2_prompt_style_format():
    assert prompt_style.messages_to_prompt(messages) == expected_prompt


-@pytest.mark.skipif(
-    "llama_cpp" not in sys.modules, reason="requires the llama-cpp-python library"
-)
 def test_llama2_prompt_style_with_system_prompt():
-    system_prompt = "This is a system prompt from configuration."
-    prompt_style = LlamaIndexPromptStyle(default_system_prompt=system_prompt)
-    messages = [
-        ChatMessage(content="Hello, how are you doing?", role=MessageRole.USER),
-    ]
-
-    expected_prompt = (
-        "<s> [INST] <<SYS>>\n"
-        f" {system_prompt} \n"
-        "<</SYS>>\n"
-        "\n"
-        " Hello, how are you doing? [/INST]"
-    )
-
-    assert prompt_style.messages_to_prompt(messages) == expected_prompt
-
+    prompt_style = Llama2PromptStyle()
    messages = [
        ChatMessage(
            content="FOO BAR Custom sys prompt from messages.", role=MessageRole.SYSTEM
--- a/tiktoken_cache/.gitignore
+++ b/tiktoken_cache/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-0.1.0
+0.5.0
Author	SHA1	Message	Date
imartinez	829f42909c	Update twitter account	2024-04-09 15:18:03 +02:00
Pablo Orgaz	347be643f7	fix(llm): special tokens and leading space (#1831 )	2024-04-04 14:37:29 +02:00
imartinez	08c4ab175e	Fix version in poetry	2024-04-03 10:59:35 +02:00
imartinez	f469b4619d	Add required Ollama setting	2024-04-02 18:27:57 +02:00
github-actions[bot]	94ef38cbba	chore(main): release 0.5.0 (#1708 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-04-02 17:45:15 +02:00
Иван	8a836e4651	feat(docs): Add guide Llama-CPP Linux AMD GPU support (#1782 )	2024-04-02 16:55:05 +02:00
Ingrid Stevens	f0b174c097	feat(ui): Add Model Information to ChatInterface label	2024-04-02 16:52:27 +02:00
igeni	bac818add5	feat(code): improve concat of strings in ui (#1785 )	2024-04-02 16:42:40 +02:00
Brett England	ea153fb92f	feat(scripts): Wipe qdrant and obtain db Stats command (#1783 )	2024-04-02 16:41:42 +02:00
Robin Boone	b3b0140e24	feat(llm): Ollama LLM-Embeddings decouple + longer keep_alive settings (#1800 )	2024-04-02 16:23:10 +02:00
machatschek	83adc12a8e	feat(RAG): Introduce SentenceTransformer Reranker (#1810 )	2024-04-02 10:29:51 +02:00
Marco Repetto	f83abff8bc	feat(docker): set default Docker to use Ollama (#1812 )	2024-04-01 13:08:48 +02:00
icsy7867	087cb0b7b7	feat(rag): expose similarity_top_k and similarity_score to settings (#1771 ) * Added RAG settings to settings.py, vector_store and chat_service to add similarity_top_k and similarity_score * Updated settings in vector and chat service per Ivans request * Updated code for mypy	2024-03-20 22:25:26 +01:00
Marco Repetto	774e256052	fix: Fixed docker-compose (#1758 ) * Fixed docker-compose * Update docker-compose.yaml	2024-03-20 21:36:45 +01:00
Iván Martínez	6f6c785dac	feat(llm): Ollama timeout setting (#1773 ) * added request_timeout to ollama, default set to 30.0 in settings.yaml and settings-ollama.yaml * Update settings-ollama.yaml * Update settings.yaml * updated settings.py and tidied up settings-ollama-yaml * feat(UI): Faster startup and document listing (#1763) * fix(ingest): update script label (#1770) huggingface -> Hugging Face * Fix lint errors --------- Co-authored-by: Stephen Gresham <steve@gresham.id.au> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>	2024-03-20 21:33:46 +01:00
Brett England	c2d694852b	feat: wipe per storage type (#1772 )	2024-03-20 21:31:44 +01:00
Ikko Eltociear Ashimine	7d2de5c96f	fix(ingest): update script label (#1770 ) huggingface -> Hugging Face	2024-03-20 20:23:08 +01:00
Iván Martínez	348df781b5	feat(UI): Faster startup and document listing (#1763 )	2024-03-20 19:11:44 +01:00
Iván Martínez	572518143a	feat(docs): Feature/upgrade docs (#1741 ) * Upgrade fern version * Add info about SDKs	2024-03-19 21:26:53 +01:00
Brett England	134fc54d7d	feat(ingest): Created a faster ingestion mode - pipeline (#1750 ) * Unify pgvector and postgres connection settings * Remove local changes * Update file pgvector->postgres * postgresql should be postgres * Adding pipeline ingestion mode * disable hugging face parallelism. Continue on file to doc transform failure * Semaphore to limit docq async workers. ETA reporting	2024-03-19 21:24:46 +01:00
Otto L	1efac6a3fe	feat(llm - embed): Add support for Azure OpenAI (#1698 ) * Add support for Azure OpenAI * fix: wrong default api_version Should be dashes instead of underscores. see: https://learn.microsoft.com/en-us/azure/ai-services/openai/reference * fix: code styling applied "make check" changes * refactor: extend documentation * mention azopenai as available option and extras * add recommended section * include settings-azopenai.yaml configuration file * fix: documentation	2024-03-15 16:49:50 +01:00
Brett England	258d02d87c	fix(docs): Minor documentation amendment (#1739 ) * Unify pgvector and postgres connection settings * Remove local changes * Update file pgvector->postgres * postgresql should be postgres	2024-03-15 16:36:32 +01:00
Brett England	63de7e4930	feat: unify settings for vector and nodestore connections to PostgreSQL (#1730 ) * Unify pgvector and postgres connection settings * Remove local changes * Update file pgvector->postgres	2024-03-15 09:55:17 +01:00
Brett England	68b3a34b03	feat(nodestore): add Postgres for the doc and index store (#1706 ) * Adding Postgres for the doc and index store * Adding documentation. Rename postgres database local->simple. Postgres storage dependencies * Update documentation for postgres storage * Renaming feature to nodestore * update docstore -> nodestore in doc * missed some docstore changes in doc * Updated poetry.lock * Formatting updates to pass ruff/black checks * Correction to unreachable code! * Format adjustment to pass black test * Adjust extra inclusion name for vector pg * extra dep change for pg vector * storage-postgres -> storage-nodestore-postgres * Hash change on poetry lock	2024-03-14 17:12:33 +01:00
Iván Martínez	d17c34e81a	fix(settings): set default tokenizer to avoid running make setup fail (#1709 )	2024-03-13 09:53:40 +01:00
Andrew Jiang	84ad16af80	feat(docs): upgrade fern (#1596 )	2024-03-11 23:02:56 +01:00
Arun Yadav	821bca32e9	feat(local): tiktoken cache within repo for offline (#1467 )	2024-03-11 22:55:13 +01:00
icsy7867	02dc83e8e9	feat(llm): adds serveral settings for llamacpp and ollama (#1703 )	2024-03-11 22:51:05 +01:00
Hoffelhas	410bf7a71f	feat(ui): maintain score order when curating sources (#1643 ) * Update ui.py Changed 'curated_sources' from a list, in order to maintain score order when returning the curated sources. * Maintain score order after curating sources	2024-03-11 22:27:30 +01:00
icsy7867	290b9fb084	feat(ui): add sources check to not repeat identical sources (#1705 )	2024-03-11 22:24:18 +01:00
github-actions[bot]	1b03b369c0	chore(main): release 0.4.0 (#1628 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-03-06 17:53:35 +01:00
Iván Martínez	45f05711eb	feat: Upgrade to LlamaIndex to 0.10 (#1663 ) * Extract optional dependencies * Separate local mode into llms-llama-cpp and embeddings-huggingface for clarity * Support Ollama embeddings * Upgrade to llamaindex 0.10.14. Remove legacy use of ServiceContext in ContextChatEngine * Fix vector retriever filters	2024-03-06 17:51:30 +01:00
Daniel Gallego Vico	12f3a39e8a	Update x handle to zylon private gpt (#1644 )	2024-02-23 15:51:35 +01:00
TQ	cd40e3982b	feat(Vector): support pgvector (#1624 )	2024-02-20 15:29:26 +01:00
github-actions[bot]	066ea5bf28	chore(main): release 0.3.0 (#1413 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-02-16 17:42:39 +01:00
Iván Martínez	aa13afde07	feat(UI): Select file to Query or Delete + Delete ALL (#1612 ) --------- Co-authored-by: Robin Boone <rboone@sofics.com>	2024-02-16 17:36:09 +01:00
icsy7867	24fb80ca38	fix(UI): Updated ui.py. Frees up the CPU to not be bottlenecked. Updated ui.py to include a small sleep timer while building the stream deltas. This recursive function fires off so quickly to eats up too much of the CPU. This small sleep frees up the CPU to not be bottlenecked. This value can go lower/shorter. But 0.02 or 0.025 seems to work well. (#1589) Co-authored-by: root <root@wesgitlabdemo.icl.gtri.org>	2024-02-16 12:52:14 +01:00
Ygal Blum	6bbec79583	feat(llm): Add support for Ollama LLM (#1526 )	2024-02-09 15:50:50 +01:00
Nick Smirnov	b178b51451	feat(bulk-ingest): Add --ignored Flag to Exclude Specific Files and Directories During Ingestion (#1432 )	2024-02-07 19:59:32 +01:00
Iván Martínez	24fae660e6	feat: Add stream information to generate SDKs (#1569 )	2024-02-02 16:14:22 +01:00
Pablo Orgaz	3e67e21d38	Add embedding mode config (#1541 )	2024-01-25 10:55:32 +01:00
Naveen Kannan	869233f0e4	fix: Adding an LLM param to fix broken generator from llamacpp (#1519 )	2024-01-17 18:10:45 +01:00
CognitiveTech	e326126d0d	feat: add mistral + chatml prompts (#1426 )	2024-01-16 22:51:14 +01:00
Robert Gay	6191bcdbd6	fix: minor bug in chat stream output - python error being serialized (#1449 )	2024-01-16 16:41:20 +01:00
Iván Martínez	d3acd85fe3	fix(tests): load the test settings only when running tests Previous implementation causes false positives with the last version of LlamaIndex	2024-01-09 12:03:16 +01:00
Guido Schulz	0a89d76cc5	fix(docs): Update quickstart doc and set version in pyproject.toml to 0.2.0	2023-12-26 13:09:31 +01:00
Matthew Hill	2d27a9f956	feat(llm): Add openailike llm mode (#1447 ) This mode behaves the same as the openai mode, except that it allows setting custom models not supported by OpenAI. It can be used with any tool that serves models from an OpenAI compatible API. Implements #1424	2023-12-26 10:26:08 +01:00
imartinez	fee9f08ef3	Move back to 3900 for the context window to avoid melting local machines	2023-12-22 18:21:43 +01:00
Iván Martínez	fde2b942bc	fix(deploy): fix local and external dockerfiles	2023-12-22 14:16:46 +01:00
Iván Martínez	4c69c458ab	Improve ingest logs (#1438 )	2023-12-21 17:13:46 +01:00
Iván Martínez	4780540870	feat(settings): Configurable context_window and tokenizer (#1437 )	2023-12-21 14:49:35 +01:00
Iván Martínez	6eeb95ec7f	feat(API): Ingest plain text (#1417 ) * Add ingest/text route to ingest plain text * Add new ingest text test and adapt ingest/file ones * Include new API in docs * Remove duplicated logic	2023-12-18 21:47:05 +01:00
Pablo Orgaz	059f35840a	fix(docker): docker broken copy (#1419 )	2023-12-18 16:55:18 +01:00
Iván Martínez	8ec7cf49f4	feat(settings): Update default model to TheBloke/Mistral-7B-Instruct-v0.2-GGUF (#1415 ) * Update LlamaCPP dependency * Default to TheBloke/Mistral-7B-Instruct-v0.2-GGUF * Fix API docs	2023-12-17 16:11:08 +01:00
Rohit Das	c71ae7cee9	feat(ui): make chat area stretch to fill the screen (#1397 )	2023-12-17 12:02:13 +01:00
cognitivetech	2564f8d2bb	fix(settings): correct yaml multiline string (#1403 )	2023-12-16 19:02:46 +01:00
Eliott Bouhana	4e496e970a	docs: remove misleading comment about pgpt working with python 3.12 (#1394 ) I was misled into believing I could install using python 3.12 whereas the pyproject.toml explicitly states otherwise. This PR only removes this comment to make sure other people are not also trapped 😄	2023-12-15 21:35:02 +01:00
Federico Grandi	3582764801	ci: fix preview docs checkout ref (#1393 )	2023-12-12 20:33:34 +01:00
Federico Grandi	1d28ae2915	docs: fix minor capitalization typo (#1392 )	2023-12-12 20:31:38 +01:00
github-actions[bot]	e8ac51bba4	chore(main): release 0.2.0 (#1387 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2023-12-10 20:08:12 +01:00
3ly-13	145f3ec9f4	feat(ui): Allows User to Set System Prompt via "Additional Options" in Chat Interface (#1353 )	2023-12-10 19:45:14 +01:00
3ly-13	a072a40a7c	Allow setting OpenAI model in settings (#1386 ) feat(settings): Allow setting openai model to be used. Default to GPT 3.5	2023-12-09 20:13:00 +01:00
Louis Melchior	a3ed14c58f	feat(llm): drop default_system_prompt (#1385 ) As discussed on Discord, the decision has been made to remove the system prompts by default, to better segregate the API and the UI usages. A concurrent PR (#1353) is enabling the dynamic setting of a system prompt in the UI. Therefore, if UI users want to use a custom system prompt, they can specify one directly in the UI. If the API users want to use a custom prompt, they can pass it directly into their messages that they are passing to the API. In the highlight of the two use case above, it becomes clear that default system_prompt does not need to exist.	2023-12-08 23:13:51 +01:00
Iván Martínez	f235c50be9	Delete old docs (#1384 )	2023-12-08 22:39:23 +01:00
EEmlan	9302620eac	Adding german speaking model to documentation (#1374 )	2023-12-08 11:26:25 +01:00
Max Zangs	9cf972563e	Add setup option to Makefile (#1368 )	2023-12-08 10:34:12 +01:00
@@ -1 +1 @@
 .1.0
 .5.0