V3 docs max (#2488)

* new skeleton Signed-off-by: Max Cembalest <max@nomic.ai> * v3 docs Signed-off-by: Max Cembalest <max@nomic.ai> --------- Signed-off-by: Max Cembalest <max@nomic.ai>
2025-09-02 09:06:03 +00:00 · 2024-07-01 13:00:14 -04:00
parent bd307abfe6
commit 5306595176
57 changed files with 865 additions and 170 deletions
--- a/gpt4all-bindings/python/docs/old/gpt4all_chat.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_chat.md
@@ -0,0 +1,140 @@
+# GPT4All Chat UI
+
+The [GPT4All Chat Client](https://gpt4all.io) lets you easily interact with any local large language model.
+
+It is optimized to run 7-13B parameter LLMs on the CPU's of any computer running OSX/Windows/Linux.
+
+## Running LLMs on CPU
+The GPT4All Chat UI supports models from all newer versions of `llama.cpp` with `GGUF` models including the `Mistral`, `LLaMA2`, `LLaMA`, `OpenLLaMa`, `Falcon`, `MPT`, `Replit`, `Starcoder`, and `Bert` architectures
+
+GPT4All maintains an official list of recommended models located in [models3.json](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models3.json). You can pull request new models to it and if accepted they will show up in the official download dialog.
+
+#### Sideloading any GGUF model
+If a model is compatible with the gpt4all-backend, you can sideload it into GPT4All Chat by:
+
+1. Downloading your model in GGUF format. It should be a 3-8 GB file similar to the ones [here](https://huggingface.co/TheBloke/Orca-2-7B-GGUF/tree/main).
+2. Identifying your GPT4All model downloads folder. This is the path listed at the bottom of the downloads dialog.
+3. Placing your downloaded model inside GPT4All's model downloads folder.
+4. Restarting your GPT4ALL app. Your model should appear in the model selection list.
+
+## Plugins
+GPT4All Chat Plugins allow you to expand the capabilities of Local LLMs.
+
+### LocalDocs Plugin (Chat With Your Data)
+LocalDocs is a GPT4All feature that allows you to chat with your local files and data.
+It allows you to utilize powerful local LLMs to chat with private data without any data leaving your computer or server.
+When using LocalDocs, your LLM will cite the sources that most likely contributed to a given output. Note, even an LLM equipped with LocalDocs can hallucinate. The LocalDocs plugin will utilize your documents to help answer prompts and you will see references appear below the response.
+
+<p align="center">
+  <img width="70%" src="https://github.com/nomic-ai/gpt4all/assets/10168/fe5dd3c0-b3cc-4701-98d3-0280dfbcf26f">
+</p>
+
+#### Enabling LocalDocs
+1. Install the latest version of GPT4All Chat from [GPT4All Website](https://gpt4all.io).
+2. Go to `Settings > LocalDocs tab`.
+3. Download the SBert model
+4. Configure a collection (folder) on your computer that contains the files your LLM should have access to. You can alter the contents of the folder/directory at anytime. As you
+add more files to your collection, your LLM will dynamically be able to access them.
+5. Spin up a chat session with any LLM (including external ones like ChatGPT but warning data will leave your machine!)
+6. At the top right, click the database icon and select which collection you want your LLM to know about during your chat session.
+7. You can begin searching with your localdocs even before the collection has completed indexing, but note the search will not include those parts of the collection yet to be indexed.
+
+#### LocalDocs Capabilities
+LocalDocs allows your LLM to have context about the contents of your documentation collection.
+
+LocalDocs **can**:
+
+- Query your documents based upon your prompt / question. Your documents will be searched for snippets that can be used to provide context for an answer. The most relevant snippets will be inserted into your prompts context, but it will be up to the underlying model to decide how best to use the provided context.
+
+LocalDocs **cannot**:
+
+- Answer general metadata queries (e.g. `What documents do you know about?`, `Tell me about my documents`)
+- Summarize a single document (e.g. `Summarize my magna carta PDF.`)
+
+See the Troubleshooting section for common issues.
+
+#### How LocalDocs Works
+LocalDocs works by maintaining an index of all data in the directory your collection is linked to. This index
+consists of small chunks of each document that the LLM can receive as additional input when you ask it a question.
+The general technique this plugin uses is called [Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401).
+
+These document chunks help your LLM respond to queries with knowledge about the contents of your data.
+The number of chunks and the size of each chunk can be configured in the LocalDocs plugin settings tab.
+
+LocalDocs currently supports plain text files (`.txt`, `.md`, and `.rst`) and PDF files (`.pdf`).
+
+#### Troubleshooting and FAQ
+*My LocalDocs plugin isn't using my documents*
+
+- Make sure LocalDocs is enabled for your chat session (the DB icon on the top-right should have a border)
+- If your document collection is large, wait 1-2 minutes for it to finish indexing.
+
+
+#### LocalDocs Roadmap
+- Customize model fine-tuned with retrieval in the loop.
+- Plugin compatibility with chat client server mode.
+
+## Server Mode
+
+GPT4All Chat comes with a built-in server mode allowing you to programmatically interact
+with any supported local LLM through a *very familiar* HTTP API. You can find the API documentation [here](https://platform.openai.com/docs/api-reference/completions).
+
+Enabling server mode in the chat client will spin-up on an HTTP server running on `localhost` port
+`4891` (the reverse of 1984). You can enable the webserver via `GPT4All Chat > Settings > Enable web server`.
+
+Begin using local LLMs in your AI powered apps by changing a single line of code: the base path for requests.
+
+```python
+import openai
+
+openai.api_base = "http://localhost:4891/v1"
+#openai.api_base = "https://api.openai.com/v1"
+
+openai.api_key = "not needed for a local LLM"
+
+# Set up the prompt and other parameters for the API request
+prompt = "Who is Michael Jordan?"
+
+# model = "gpt-3.5-turbo"
+#model = "mpt-7b-chat"
+model = "gpt4all-j-v1.3-groovy"
+
+# Make the API request
+response = openai.Completion.create(
+    model=model,
+    prompt=prompt,
+    max_tokens=50,
+    temperature=0.28,
+    top_p=0.95,
+    n=1,
+    echo=True,
+    stream=False
+)
+
+# Print the generated completion
+print(response)
+```
+
+which gives the following response
+
+```json
+{
+  "choices": [
+    {
+      "finish_reason": "stop",
+      "index": 0,
+      "logprobs": null,
+      "text": "Who is Michael Jordan?\nMichael Jordan is a former professional basketball player who played for the Chicago Bulls in the NBA. He was born on December 30, 1963, and retired from playing basketball in 1998."
+    }
+  ],
+  "created": 1684260896,
+  "id": "foobarbaz",
+  "model": "gpt4all-j-v1.3-groovy",
+  "object": "text_completion",
+  "usage": {
+    "completion_tokens": 35,
+    "prompt_tokens": 39,
+    "total_tokens": 74
+  }
+}
+```
--- a/gpt4all-bindings/python/docs/old/gpt4all_cli.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_cli.md
@@ -0,0 +1,198 @@
+# GPT4All CLI
+
+The GPT4All command-line interface (CLI) is a Python script which is built on top of the
+[Python bindings][docs-bindings-python] ([repository][repo-bindings-python]) and the [typer]
+package. The source code, README, and local build instructions can be found
+[here][repo-bindings-cli].
+
+[docs-bindings-python]: gpt4all_python.md
+[repo-bindings-python]: https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python
+[repo-bindings-cli]: https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/cli
+[typer]: https://typer.tiangolo.com/
+
+## Installation
+### The Short Version
+
+The CLI is a Python script called [app.py]. If you're already familiar with Python best practices,
+the short version is to [download app.py][app.py-download] into a folder of your choice, install
+the two required dependencies with some variant of:
+```shell
+pip install gpt4all typer
+```
+
+Then run it with a variant of:
+```shell
+python app.py repl
+```
+In case you're wondering, _REPL_ is an acronym for [read-eval-print loop][wiki-repl].
+
+[app.py]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-bindings/cli/app.py
+[app.py-download]: https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-bindings/cli/app.py
+[wiki-repl]: https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop
+
+### Recommendations & The Long Version
+
+Especially if you have several applications/libraries which depend on Python, to avoid descending
+into dependency hell at some point, you should:
+- Consider to always install into some kind of [_virtual environment_][venv].
+- On a _Unix-like_ system, don't use `sudo` for anything other than packages provided by the system
+  package manager, i.e. never with `pip`.
+
+[venv]: https://docs.python.org/3/library/venv.html
+
+There are several ways and tools available to do this, so below are descriptions on how to install
+with a _virtual environment_ (recommended) or a user installation on all three main platforms.
+
+Different platforms can have slightly different ways to start the Python interpreter itself.
+
+Note: _Typer_ has an optional dependency for more fanciful output. If you want that, replace `typer`
+with `typer[all]` in the pip-install instructions below.
+
+#### Virtual Environment Installation
+You can name your _virtual environment_ folder for the CLI whatever you like. In the following,
+`gpt4all-cli` is used throughout.
+
+##### macOS
+
+There are at least three ways to have a Python installation on _macOS_, and possibly not all of them
+provide a full installation of Python and its tools. When in doubt, try the following:
+```shell
+python3 -m venv --help
+python3 -m pip --help
+```
+Both should print the help for the `venv` and `pip` commands, respectively. If they don't, consult
+the documentation of your Python installation on how to enable them, or download a separate Python
+variant, for example try an [unified installer package from python.org][python.org-downloads].
+
+[python.org-downloads]: https://www.python.org/downloads/
+
+Once ready, do:
+```shell
+python3 -m venv gpt4all-cli
+. gpt4all-cli/bin/activate
+python3 -m pip install gpt4all typer
+```
+
+##### Windows
+
+Download the [official installer from python.org][python.org-downloads] if Python isn't already
+present on your system.
+
+A _Windows_ installation should already provide all the components for a _virtual environment_. Run:
+```shell
+py -3 -m venv gpt4all-cli
+gpt4all-cli\Scripts\activate
+py -m pip install gpt4all typer
+```
+
+##### Linux
+
+On Linux, a Python installation is often split into several packages and not all are necessarily
+installed by default. For example, on Debian/Ubuntu and derived distros, you will want to ensure
+their presence with the following:
+```shell
+sudo apt-get install python3-venv python3-pip
+```
+The next steps are similar to the other platforms:
+```shell
+python3 -m venv gpt4all-cli
+. gpt4all-cli/bin/activate
+python3 -m pip install gpt4all typer
+```
+On other distros, the situation might be different. Especially the package names can vary a lot.
+You'll have to look it up in the documentation, software directory, or package search.
+
+#### User Installation
+##### macOS
+
+There are at least three ways to have a Python installation on _macOS_, and possibly not all of them
+provide a full installation of Python and its tools. When in doubt, try the following:
+```shell
+python3 -m pip --help
+```
+That should print the help for the `pip` command. If it doesn't, consult the documentation of your
+Python installation on how to enable them, or download a separate Python variant, for example try an
+[unified installer package from python.org][python.org-downloads].
+
+Once ready, do:
+```shell
+python3 -m pip install --user --upgrade gpt4all typer
+```
+
+##### Windows
+
+Download the [official installer from python.org][python.org-downloads] if Python isn't already
+present on your system. It includes all the necessary components. Run:
+```shell
+py -3 -m pip install --user --upgrade gpt4all typer
+```
+
+##### Linux
+
+On Linux, a Python installation is often split into several packages and not all are necessarily
+installed by default. For example, on Debian/Ubuntu and derived distros, you will want to ensure
+their presence with the following:
+```shell
+sudo apt-get install python3-pip
+```
+The next steps are similar to the other platforms:
+```shell
+python3 -m pip install --user --upgrade gpt4all typer
+```
+On other distros, the situation might be different. Especially the package names can vary a lot.
+You'll have to look it up in the documentation, software directory, or package search.
+
+## Running the CLI
+
+The CLI is a self-contained script called [app.py]. As such, you can [download][app.py-download]
+and save it anywhere you like, as long as the Python interpreter has access to the mentioned
+dependencies.
+
+Note: different platforms can have slightly different ways to start Python. Whereas below the
+interpreter command is written as `python` you typically want to type instead:
+- On _Unix-like_ systems: `python3`
+- On _Windows_: `py -3`
+
+The simplest way to start the CLI is:
+```shell
+python app.py repl
+```
+This automatically selects the [groovy] model and downloads it into the `.cache/gpt4all/` folder
+of your home directory, if not already present.
+
+[groovy]: https://huggingface.co/nomic-ai/gpt4all-j#model-details
+
+If you want to use a different model, you can do so with the `-m`/`--model` parameter. If only a
+model file name is provided, it will again check in `.cache/gpt4all/` and might start downloading.
+If instead given a path to an existing model, the command could for example look like this:
+```shell
+python app.py repl --model /home/user/my-gpt4all-models/gpt4all-13b-snoozy-q4_0.gguf
+```
+
+When you're done and want to end a session, simply type `/exit`.
+
+To get help and information on all the available commands and options on the command-line, run:
+```shell
+python app.py --help
+```
+And while inside the running _REPL_, write `/help`.
+
+Note that if you've installed the required packages into a _virtual environment_, you don't need
+to activate that every time you want to run the CLI. Instead, you can just start it with the Python
+interpreter in the folder `gpt4all-cli/bin/` (_Unix-like_) or `gpt4all-cli/Script/` (_Windows_).
+
+That also makes it easy to set an alias e.g. in [Bash][bash-aliases] or [PowerShell][posh-aliases]:
+- Bash: `alias gpt4all="'/full/path/to/gpt4all-cli/bin/python' '/full/path/to/app.py' repl"`
+- PowerShell:
+  ```posh
+  Function GPT4All-Venv-CLI {"C:\full\path\to\gpt4all-cli\Scripts\python.exe" "C:\full\path\to\app.py" repl}
+  Set-Alias -Name gpt4all -Value GPT4All-Venv-CLI
+  ```
+
+Don't forget to save these in the start-up file of your shell.
+
+[bash-aliases]: https://www.gnu.org/software/bash/manual/html_node/Aliases.html
+[posh-aliases]: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/set-alias
+
+Finally, if on _Windows_ you see a box instead of an arrow `⇢` as the prompt character, you should
+change the console font to one which offers better Unicode support.
--- a/gpt4all-bindings/python/docs/old/gpt4all_faq.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_faq.md
@@ -0,0 +1,100 @@
+# GPT4All FAQ
+
+## What models are supported by the GPT4All ecosystem?
+
+Currently, there are six different model architectures that are supported:
+
+1. GPT-J - Based off of the GPT-J architecture with examples found [here](https://huggingface.co/EleutherAI/gpt-j-6b)
+2. LLaMA - Based off of the LLaMA architecture with examples found [here](https://huggingface.co/models?sort=downloads&search=llama)
+3. MPT - Based off of Mosaic ML's MPT architecture with examples found [here](https://huggingface.co/mosaicml/mpt-7b)
+4. Replit - Based off of Replit Inc.'s Replit architecture with examples found [here](https://huggingface.co/replit/replit-code-v1-3b)
+5. Falcon - Based off of TII's Falcon architecture with examples found [here](https://huggingface.co/tiiuae/falcon-40b)
+6. StarCoder - Based off of BigCode's StarCoder architecture with examples found [here](https://huggingface.co/bigcode/starcoder)
+
+## Why so many different architectures? What differentiates them?
+
+One of the major differences is license. Currently, the LLaMA based models are subject to a non-commercial license, whereas the GPTJ and MPT base
+models allow commercial usage. However, its successor [Llama 2 is commercially licensable](https://ai.meta.com/llama/license/), too. In the early
+advent of the recent explosion of activity in open source local models, the LLaMA models have generally been seen as performing better, but that is
+changing quickly. Every week - even every day! - new models are released with some of the GPTJ and MPT models competitive in performance/quality with
+LLaMA. What's more, there are some very nice architectural innovations with the MPT models that could lead to new performance/quality gains.
+
+## How does GPT4All make these models available for CPU inference?
+
+By leveraging the ggml library written by Georgi Gerganov and a growing community of developers. There are currently multiple different versions of
+this library. The original GitHub repo can be found [here](https://github.com/ggerganov/ggml), but the developer of the library has also created a
+LLaMA based version [here](https://github.com/ggerganov/llama.cpp). Currently, this backend is using the latter as a submodule.
+
+## Does that mean GPT4All is compatible with all llama.cpp models and vice versa?
+
+Yes!
+
+The upstream [llama.cpp](https://github.com/ggerganov/llama.cpp) project has introduced several [compatibility breaking] quantization methods recently.
+This is a breaking change that renders all previous models (including the ones that GPT4All uses) inoperative with newer versions of llama.cpp since
+that change.
+
+Fortunately, we have engineered a submoduling system allowing us to dynamically load different versions of the underlying library so that
+GPT4All just works.
+
+[compatibility breaking]: https://github.com/ggerganov/llama.cpp/commit/b9fd7eee57df101d4a3e3eabc9fd6c2cb13c9ca1
+
+## What are the system requirements?
+
+Your CPU needs to support [AVX or AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) and you need enough RAM to load a model into memory.
+
+## What about GPU inference?
+
+In newer versions of llama.cpp, there has been some added support for NVIDIA GPU's for inference. We're investigating how to incorporate this into our downloadable installers.
+
+## Ok, so bottom line... how do I make my model on Hugging Face compatible with GPT4All ecosystem right now?
+
+1. Check to make sure the Hugging Face model is available in one of our three supported architectures
+2. If it is, then you can use the conversion script inside of our pinned llama.cpp submodule for GPTJ and LLaMA based models
+3. Or if your model is an MPT model you can use the conversion script located directly in this backend directory under the scripts subdirectory 
+
+## Language Bindings
+
+#### There's a problem with the download
+
+Some bindings can download a model, if allowed to do so. For example, in Python or TypeScript if `allow_download=True`
+or `allowDownload=true` (default), a model is automatically downloaded into `.cache/gpt4all/` in the user's home folder,
+unless it already exists.
+
+In case of connection issues or errors during the download, you might want to manually verify the model file's MD5
+checksum by comparing it with the one listed in [models3.json].
+
+As an alternative to the basic downloader built into the bindings, you can choose to download from the 
+<https://gpt4all.io/> website instead. Scroll down to 'Model Explorer' and pick your preferred model.
+
+[models3.json]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models3.json
+
+#### I need the chat GUI and bindings to behave the same
+
+The chat GUI and bindings are based on the same backend. You can make them behave the same way by following these steps:
+
+- First of all, ensure that all parameters in the chat GUI settings match those passed to the generating API, e.g.:
+
+    === "Python"
+        ``` py
+        from gpt4all import GPT4All
+        model = GPT4All(...)
+        model.generate("prompt text", temp=0, ...)  # adjust parameters
+        ```
+    === "TypeScript"
+        ``` ts
+        import { createCompletion, loadModel } from '../src/gpt4all.js'
+        const ll = await loadModel(...);
+        const messages = ...
+        const re = await createCompletion(ll, messages, { temp: 0, ... });  // adjust parameters
+        ```
+
+- To make comparing the output easier, set _Temperature_ in both to 0 for now. This will make the output deterministic.
+
+- Next you'll have to compare the templates, adjusting them as necessary, based on how you're using the bindings.
+    - Specifically, in Python:
+        - With simple `generate()` calls, the input has to be surrounded with system and prompt templates.
+        - When using a chat session, it depends on whether the bindings are allowed to download [models3.json]. If yes,
+          and in the chat GUI the default templates are used, it'll be handled automatically. If no, use
+          `chat_session()` template parameters to customize them.
+
+- Once you're done, remember to reset _Temperature_ to its previous value in both chat GUI and your custom code.
--- a/gpt4all-bindings/python/docs/old/gpt4all_monitoring.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_monitoring.md
@@ -0,0 +1,70 @@
+# Monitoring
+
+Leverage OpenTelemetry to perform real-time monitoring of your LLM application and GPUs using [OpenLIT](https://github.com/openlit/openlit). This tool helps you easily collect data on user interactions, performance metrics, along with GPU Performance metrics, which can assist in enhancing the functionality and dependability of your GPT4All based LLM application.
+
+## How it works?
+
+OpenLIT adds automatic OTel instrumentation to the GPT4All SDK. It covers the `generate` and `embedding` functions, helping to track LLM usage by gathering inputs and outputs. This allows users to monitor and evaluate the performance and behavior of their LLM application in different environments. OpenLIT also provides OTel auto-instrumentation for monitoring GPU metrics like utilization, temperature, power usage, and memory usage.
+
+Additionally, you have the flexibility to view and analyze the generated traces and metrics either in the OpenLIT UI or by exporting them to widely used observability tools like Grafana and DataDog for more comprehensive analysis and visualization.
+
+## Getting Started
+
+Here’s a straightforward guide to help you set up and start monitoring your application:
+
+### 1. Install the OpenLIT SDK
+Open your terminal and run:
+
+```shell
+pip install openlit
+```
+
+### 2. Setup Monitoring for your Application
+In your application, initiate OpenLIT as outlined below:
+
+```python
+from gpt4all import GPT4All
+import openlit
+
+openlit.init()  # Initialize OpenLIT monitoring
+
+model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
+
+# Start a chat session and send queries
+with model.chat_session():
+    response1 = model.generate(prompt='hello', temp=0)
+    response2 = model.generate(prompt='write me a short poem', temp=0)
+    response3 = model.generate(prompt='thank you', temp=0)
+
+    print(model.current_chat_session)
+```
+This setup wraps your gpt4all model interactions, capturing valuable data about each request and response.
+
+### 3. (Optional) Enable GPU Monitoring
+
+If your application runs on NVIDIA GPUs, you can enable GPU stats collection in the OpenLIT SDK by adding `collect_gpu_stats=True`. This collects GPU metrics like utilization, temperature, power usage, and memory-related performance metrics. The collected metrics are OpenTelemetry gauges.
+
+```python
+from gpt4all import GPT4All
+import openlit
+
+openlit.init(collect_gpu_stats=True)  # Initialize OpenLIT monitoring
+
+model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
+
+# Start a chat session and send queries
+with model.chat_session():
+    response1 = model.generate(prompt='hello', temp=0)
+    response2 = model.generate(prompt='write me a short poem', temp=0)
+    response3 = model.generate(prompt='thank you', temp=0)
+
+    print(model.current_chat_session)
+```
+
+### Visualize
+
+Once you've set up data collection with [OpenLIT](https://github.com/openlit/openlit), you can visualize and analyze this information to better understand your application's performance:
+
+- **Using OpenLIT UI:** Connect to OpenLIT's UI to start exploring performance metrics. Visit the OpenLIT [Quickstart Guide](https://docs.openlit.io/latest/quickstart) for step-by-step details.
+
+- **Integrate with existing Observability Tools:** If you use tools like Grafana or DataDog, you can integrate the data collected by OpenLIT. For instructions on setting up these connections, check the OpenLIT [Connections Guide](https://docs.openlit.io/latest/connections/intro).
--- a/gpt4all-bindings/python/docs/old/gpt4all_nodejs.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_nodejs.md
--- a/gpt4all-bindings/python/docs/old/gpt4all_python.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_python.md
@@ -0,0 +1,268 @@
+# GPT4All Python Generation API
+The `GPT4All` python package provides bindings to our C/C++ model backend libraries.
+The source code and local build instructions can be found [here](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python).
+
+
+## Quickstart
+```bash
+pip install gpt4all
+```
+
+``` py
+from gpt4all import GPT4All
+model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
+```
+
+This will:
+
+- Instantiate `GPT4All`,  which is the primary public API to your large language model (LLM).
+- Automatically download the given model to `~/.cache/gpt4all/` if not already present.
+
+Read further to see how to chat with this model.
+
+
+### Chatting with GPT4All
+To start chatting with a local LLM, you will need to start a chat session. Within a chat session, the model will be
+prompted with the appropriate template, and history will be preserved between successive calls to `generate()`.
+
+=== "GPT4All Example"
+    ``` py
+    model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
+    with model.chat_session():
+        response1 = model.generate(prompt='hello', temp=0)
+        response2 = model.generate(prompt='write me a short poem', temp=0)
+        response3 = model.generate(prompt='thank you', temp=0)
+        print(model.current_chat_session)
+    ```
+=== "Output"
+    ``` json
+    [
+       {
+          'role': 'user',
+          'content': 'hello'
+       },
+       {
+          'role': 'assistant',
+          'content': 'What is your name?'
+       },
+       {
+          'role': 'user',
+          'content': 'write me a short poem'
+       },
+       {
+          'role': 'assistant',
+          'content': "I would love to help you with that! Here's a short poem I came up with:\nBeneath the autumn leaves,\nThe wind whispers through the trees.\nA gentle breeze, so at ease,\nAs if it were born to play.\nAnd as the sun sets in the sky,\nThe world around us grows still."
+       },
+       {
+          'role': 'user',
+          'content': 'thank you'
+       },
+       {
+          'role': 'assistant',
+          'content': "You're welcome! I hope this poem was helpful or inspiring for you. Let me know if there is anything else I can assist you with."
+       }
+    ]
+    ```
+
+When using GPT4All models in the `chat_session()` context:
+
+- Consecutive chat exchanges are taken into account and not discarded until the session ends; as long as the model has capacity.
+- A system prompt is inserted into the beginning of the model's context.
+- Each prompt passed to `generate()` is wrapped in the appropriate prompt template. If you pass `allow_download=False`
+  to GPT4All or are using a model that is not from the official models list, you must pass a prompt template using the
+  `prompt_template` parameter of `chat_session()`.
+
+NOTE: If you do not use `chat_session()`, calls to `generate()` will not be wrapped in a prompt template. This will
+cause the model to *continue* the prompt instead of *answering* it. When in doubt, use a chat session, as many newer
+models are designed to be used exclusively with a prompt template.
+
+[models3.json]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models3.json
+
+
+### Streaming Generations
+To interact with GPT4All responses as the model generates, use the `streaming=True` flag during generation.
+
+=== "GPT4All Streaming Example"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
+    tokens = []
+    with model.chat_session():
+        for token in model.generate("What is the capital of France?", streaming=True):
+            tokens.append(token)
+    print(tokens)
+    ```
+=== "Output"
+    ```
+    [' The', ' capital', ' of', ' France', ' is', ' Paris', '.']
+    ```
+
+
+### The Generate Method API
+::: gpt4all.gpt4all.GPT4All.generate
+
+
+## Examples & Explanations
+### Influencing Generation
+The three most influential parameters in generation are _Temperature_ (`temp`), _Top-p_ (`top_p`) and _Top-K_ (`top_k`).
+In a nutshell, during the process of selecting the next token, not just one or a few are considered, but every single
+token in the vocabulary is given a probability. The parameters can change the field of candidate tokens.
+
+- **Temperature** makes the process either more or less random. A _Temperature_ above 1 increasingly "levels the playing
+  field", while at a _Temperature_ between 0 and 1 the likelihood of the best token candidates grows even more. A
+  _Temperature_ of 0 results in selecting the best token, making the output deterministic. A _Temperature_ of 1
+  represents a neutral setting with regard to randomness in the process.
+
+- _Top-p_ and _Top-K_ both narrow the field:
+    - **Top-K** limits candidate tokens to a fixed number after sorting by probability. Setting it higher than the
+      vocabulary size deactivates this limit.
+    - **Top-p** selects tokens based on their total probabilities. For example, a value of 0.8 means "include the best
+      tokens, whose accumulated probabilities reach or just surpass 80%". Setting _Top-p_ to 1, which is 100%,
+      effectively disables it.
+
+The recommendation is to keep at least one of _Top-K_ and _Top-p_ active. Other parameters can also influence
+generation; be sure to review all their descriptions.
+
+
+### Specifying the Model Folder
+The model folder can be set with the `model_path` parameter when creating a `GPT4All` instance. The example below is
+is the same as if it weren't provided; that is, `~/.cache/gpt4all/` is the default folder.
+
+``` py
+from pathlib import Path
+from gpt4all import GPT4All
+model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf', model_path=Path.home() / '.cache' / 'gpt4all')
+```
+
+If you want to point it at the chat GUI's default folder, it should be:
+=== "macOS"
+    ``` py
+    from pathlib import Path
+    from gpt4all import GPT4All
+
+    model_name = 'orca-mini-3b-gguf2-q4_0.gguf'
+    model_path = Path.home() / 'Library' / 'Application Support' / 'nomic.ai' / 'GPT4All'
+    model = GPT4All(model_name, model_path)
+    ```
+=== "Windows"
+    ``` py
+    from pathlib import Path
+    from gpt4all import GPT4All
+    import os
+    model_name = 'orca-mini-3b-gguf2-q4_0.gguf'
+    model_path = Path(os.environ['LOCALAPPDATA']) / 'nomic.ai' / 'GPT4All'
+    model = GPT4All(model_name, model_path)
+    ```
+=== "Linux"
+    ``` py
+    from pathlib import Path
+    from gpt4all import GPT4All
+
+    model_name = 'orca-mini-3b-gguf2-q4_0.gguf'
+    model_path = Path.home() / '.local' / 'share' / 'nomic.ai' / 'GPT4All'
+    model = GPT4All(model_name, model_path)
+    ```
+
+Alternatively, you could also change the module's default model directory:
+
+``` py
+from pathlib import Path
+from gpt4all import GPT4All, gpt4all
+gpt4all.DEFAULT_MODEL_DIRECTORY = Path.home() / 'my' / 'models-directory'
+model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
+```
+
+
+### Managing Templates
+When using a `chat_session()`, you may customize the system prompt, and set the prompt template if necessary:
+
+=== "GPT4All Custom Session Templates Example"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All('wizardlm-13b-v1.2.Q4_0.gguf')
+    system_template = 'A chat between a curious user and an artificial intelligence assistant.\n'
+    # many models use triple hash '###' for keywords, Vicunas are simpler:
+    prompt_template = 'USER: {0}\nASSISTANT: '
+    with model.chat_session(system_template, prompt_template):
+        response1 = model.generate('why is the grass green?')
+        print(response1)
+        print()
+        response2 = model.generate('why is the sky blue?')
+        print(response2)
+    ```
+=== "Possible Output"
+    ```
+    The color of grass can be attributed to its chlorophyll content, which allows it
+    to absorb light energy from sunlight through photosynthesis. Chlorophyll absorbs
+    blue and red wavelengths of light while reflecting other colors such as yellow
+    and green. This is why the leaves appear green to our eyes.
+
+    The color of the sky appears blue due to a phenomenon called Rayleigh scattering,
+    which occurs when sunlight enters Earth's atmosphere and interacts with air
+    molecules such as nitrogen and oxygen. Blue light has shorter wavelength than
+    other colors in the visible spectrum, so it is scattered more easily by these
+    particles, making the sky appear blue to our eyes.
+    ```
+
+
+### Without Online Connectivity
+To prevent GPT4All from accessing online resources, instantiate it with `allow_download=False`. When using this flag,
+there will be no default system prompt by default, and you must specify the prompt template yourself.
+
+You can retrieve a model's default system prompt and prompt template with an online instance of GPT4All:
+
+=== "Prompt Template Retrieval"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
+    print(repr(model.config['systemPrompt']))
+    print(repr(model.config['promptTemplate']))
+    ```
+=== "Output"
+    ```py
+    '### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n'
+    '### User:\n{0}\n### Response:\n'
+    ```
+
+Then you can pass them explicitly when creating an offline instance:
+
+``` py
+from gpt4all import GPT4All
+model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf', allow_download=False)
+
+system_prompt = '### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n'
+prompt_template = '### User:\n{0}\n\n### Response:\n'
+
+with model.chat_session(system_prompt=system_prompt, prompt_template=prompt_template):
+    ...
+```
+
+### Interrupting Generation
+The simplest way to stop generation is to set a fixed upper limit with the `max_tokens` parameter.
+
+If you know exactly when a model should stop responding, you can add a custom callback, like so:
+
+=== "GPT4All Custom Stop Callback"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
+
+    def stop_on_token_callback(token_id, token_string):
+        # one sentence is enough:
+        if '.' in token_string:
+            return False
+        else:
+            return True
+
+    response = model.generate('Blue Whales are the biggest animal to ever inhabit the Earth.',
+                              temp=0, callback=stop_on_token_callback)
+    print(response)
+    ```
+=== "Output"
+    ```
+     They can grow up to 100 feet (30 meters) long and weigh as much as 20 tons (18 metric tons).
+    ```
+
+
+## API Documentation
+::: gpt4all.gpt4all.GPT4All
--- a/gpt4all-bindings/python/docs/old/gpt4all_python_embedding.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_python_embedding.md
@@ -0,0 +1,176 @@
+# Embeddings
+GPT4All supports generating high quality embeddings of arbitrary length text using any embedding model supported by llama.cpp.
+
+An embedding is a vector representation of a piece of text. Embeddings are useful for tasks such as retrieval for
+question answering (including retrieval augmented generation or *RAG*), semantic similarity search, classification, and
+topic clustering.
+
+## Supported Embedding Models
+
+The following models have built-in support in Embed4All:
+
+| Name               | Embed4All `model_name`                               | Context Length | Embedding Length | File Size |
+|--------------------|------------------------------------------------------|---------------:|-----------------:|----------:|
+| [SBert]            | all&#x2011;MiniLM&#x2011;L6&#x2011;v2.gguf2.f16.gguf |            512 |              384 |    44 MiB |
+| [Nomic Embed v1]   | nomic&#x2011;embed&#x2011;text&#x2011;v1.f16.gguf    |           2048 |              768 |   262 MiB |
+| [Nomic Embed v1.5] | nomic&#x2011;embed&#x2011;text&#x2011;v1.5.f16.gguf  |           2048 |           64-768 |   262 MiB |
+
+The context length is the maximum number of word pieces, or *tokens*, that a model can embed at once. Embedding texts
+longer than a model's context length requires some kind of strategy; see [Embedding Longer Texts] for more information.
+
+The embedding length is the size of the vector returned by `Embed4All.embed`.
+
+[SBert]: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+[Nomic Embed v1]: https://huggingface.co/nomic-ai/nomic-embed-text-v1
+[Nomic Embed v1.5]: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
+[Embedding Longer Texts]: #embedding-longer-texts
+
+## Quickstart
+```bash
+pip install gpt4all
+```
+
+### Generating Embeddings
+By default, embeddings will be generated on the CPU using all-MiniLM-L6-v2.
+
+=== "Embed4All Example"
+    ```py
+    from gpt4all import Embed4All
+    text = 'The quick brown fox jumps over the lazy dog'
+    embedder = Embed4All()
+    output = embedder.embed(text)
+    print(output)
+    ```
+=== "Output"
+    ```
+    [0.034696947783231735, -0.07192722707986832, 0.06923297047615051, ...]
+    ```
+
+You can also use the GPU to accelerate the embedding model by specifying the `device` parameter. See the [GPT4All
+constructor] for more information.
+
+=== "GPU Example"
+    ```py
+    from gpt4all import Embed4All
+    text = 'The quick brown fox jumps over the lazy dog'
+    embedder = Embed4All(device='gpu')
+    output = embedder.embed(text)
+    print(output)
+    ```
+=== "Output"
+    ```
+    [0.034696947783231735, -0.07192722707986832, 0.06923297047615051, ...]
+    ```
+
+[GPT4All constructor]: gpt4all_python.md#gpt4all.gpt4all.GPT4All.__init__
+
+### Nomic Embed
+
+Embed4All has built-in support for Nomic's open-source embedding model, [Nomic Embed]. When using this model, you must
+specify the task type using the `prefix` argument. This may be one of `search_query`, `search_document`,
+`classification`, or `clustering`. For retrieval applications, you should prepend `search_document` for all of your
+documents and `search_query` for your queries. See the [Nomic Embedding Guide] for more info.
+
+=== "Nomic Embed Example"
+    ```py
+    from gpt4all import Embed4All
+    text = 'Who is Laurens van der Maaten?'
+    embedder = Embed4All('nomic-embed-text-v1.f16.gguf')
+    output = embedder.embed(text, prefix='search_query')
+    print(output)
+    ```
+=== "Output"
+    ```
+    [-0.013357644900679588, 0.027070969343185425, -0.0232995692640543, ...]
+    ```
+
+[Nomic Embed]: https://blog.nomic.ai/posts/nomic-embed-text-v1
+[Nomic Embedding Guide]: https://docs.nomic.ai/atlas/guides/embeddings#embedding-task-types
+
+### Embedding Longer Texts
+
+Embed4All accepts a parameter called `long_text_mode`. This controls the behavior of Embed4All for texts longer than the
+context length of the embedding model.
+
+In the default mode of "mean", Embed4All will break long inputs into chunks and average their embeddings to compute the
+final result.
+
+To change this behavior, you can set the `long_text_mode` parameter to "truncate", which will truncate the input to the
+sequence length of the model before generating a single embedding.
+
+=== "Truncation Example"
+    ```py
+    from gpt4all import Embed4All
+    text = 'The ' * 512 + 'The quick brown fox jumps over the lazy dog'
+    embedder = Embed4All()
+    output = embedder.embed(text, long_text_mode="mean")
+    print(output)
+    print()
+    output = embedder.embed(text, long_text_mode="truncate")
+    print(output)
+    ```
+=== "Output"
+    ```
+    [0.0039850445464253426, 0.04558328539133072, 0.0035536508075892925, ...]
+
+    [-0.009771130047738552, 0.034792833030223846, -0.013273917138576508, ...]
+    ```
+
+
+### Batching
+
+You can send multiple texts to Embed4All in a single call. This can give faster results when individual texts are
+significantly smaller than `n_ctx` tokens. (`n_ctx` defaults to 2048.)
+
+=== "Batching Example"
+    ```py
+    from gpt4all import Embed4All
+    texts = ['The quick brown fox jumps over the lazy dog', 'Foo bar baz']
+    embedder = Embed4All()
+    output = embedder.embed(texts)
+    print(output[0])
+    print()
+    print(output[1])
+    ```
+=== "Output"
+    ```
+    [0.03551332652568817, 0.06137588247656822, 0.05281158909201622, ...]
+
+    [-0.03879690542817116, 0.00013223080895841122, 0.023148687556385994, ...]
+    ```
+
+The number of texts that can be embedded in one pass of the model is proportional to the `n_ctx` parameter of Embed4All.
+Increasing it may increase batched embedding throughput if you have a fast GPU, at the cost of VRAM.
+```py
+embedder = Embed4All(n_ctx=4096, device='gpu')
+```
+
+
+### Resizable Dimensionality
+
+The embedding dimension of Nomic Embed v1.5 can be resized using the `dimensionality` parameter. This parameter supports
+any value between 64 and 768.
+
+Shorter embeddings use less storage, memory, and bandwidth with a small performance cost. See the [blog post] for more
+info.
+
+[blog post]: https://blog.nomic.ai/posts/nomic-embed-matryoshka
+
+=== "Matryoshka Example"
+    ```py
+    from gpt4all import Embed4All
+    text = 'The quick brown fox jumps over the lazy dog'
+    embedder = Embed4All('nomic-embed-text-v1.5.f16.gguf')
+    output = embedder.embed(text, dimensionality=64)
+    print(len(output))
+    print(output)
+    ```
+=== "Output"
+    ```
+    64
+    [-0.03567073494195938, 0.1301717758178711, -0.4333043396472931, ...]
+    ```
+
+
+### API documentation
+::: gpt4all.gpt4all.Embed4All
--- a/gpt4all-bindings/python/docs/old/index.md
+++ b/gpt4all-bindings/python/docs/old/index.md
@@ -0,0 +1,71 @@
+# GPT4All
+Welcome to the GPT4All documentation LOCAL EDIT
+
+GPT4All is an open-source software ecosystem for anyone to run large language models (LLMs) **privately** on **everyday laptop & desktop computers**. No API calls or GPUs required.
+
+The GPT4All Desktop Application is a touchpoint to interact with LLMs and integrate them with your local docs & local data for RAG (retrieval-augmented generation). No coding is required, just install the application, download the models of your choice, and you are ready to use your LLM.
+
+Your local data is **yours**. GPT4All handles the retrieval privately and on-device to fetch relevant data to support your queries to your LLM.
+
+Nomic AI oversees contributions to GPT4All to ensure quality, security, and maintainability. Additionally, Nomic AI has open-sourced code for training and deploying your own customized LLMs internally.
+
+GPT4All software is optimized to run inference of 3-13 billion parameter large language models on the CPUs of laptops, desktops and servers.
+
+=== "GPT4All Example"
+    ``` py
+    from gpt4all import GPT4All
+    model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
+    output = model.generate("The capital of France is ", max_tokens=3)
+    print(output)
+    ```
+=== "Output"
+    ```
+    1. Paris
+    ```
+See [Python Bindings](gpt4all_python.md) to use GPT4All.
+
+### Navigating the Documentation
+In an effort to ensure cross-operating-system and cross-language compatibility, the [GPT4All software ecosystem](https://github.com/nomic-ai/gpt4all)
+is organized as a monorepo with the following structure:
+
+- **gpt4all-backend**: The GPT4All backend maintains and exposes a universal, performance optimized C API for running inference with multi-billion parameter Transformer Decoders.
+This C API is then bound to any higher level programming language such as C++, Python, Go, etc.
+- **gpt4all-bindings**: GPT4All bindings contain a variety of high-level programming languages that implement the C API. Each directory is a bound programming language. The [CLI](gpt4all_cli.md) is included here, as well.
+- **gpt4all-chat**: GPT4All Chat is an OS native chat application that runs on macOS, Windows and Linux. It is the easiest way to run local, privacy aware chat assistants on everyday hardware. You can download it on the [GPT4All Website](https://gpt4all.io) and read its source code in the monorepo.
+
+Explore detailed documentation for the backend, bindings and chat client in the sidebar.
+## Models
+The GPT4All software ecosystem is compatible with the following Transformer architectures:
+
+- `Falcon`
+- `LLaMA` (including `OpenLLaMA`)
+- `MPT` (including `Replit`)
+- `GPT-J`
+
+You can find an exhaustive list of supported models on the [website](https://gpt4all.io) or in the [models directory](https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models3.json)
+
+
+GPT4All models are artifacts produced through a process known as neural network quantization.
+A multi-billion parameter Transformer Decoder usually takes 30+ GB of VRAM to execute a forward pass.
+Most people do not have such a powerful computer or access to GPU hardware. By running trained LLMs through quantization algorithms, 
+some GPT4All models can run on your laptop using only 4-8GB of RAM enabling their wide-spread usage.
+Bigger models might still require more RAM, however.
+
+Any model trained with one of these architectures can be quantized and run locally with all GPT4All bindings and in the
+chat client. You can add new variants by contributing to the gpt4all-backend.
+
+## Frequently Asked Questions
+Find answers to frequently asked questions by searching the [Github issues](https://github.com/nomic-ai/gpt4all/issues) or in the [documentation FAQ](gpt4all_faq.md).
+
+## Getting the most of your local LLM
+
+**Inference Speed**
+of a local LLM depends on two factors: model size and the number of tokens given as input. 
+It is not advised to prompt local LLMs with large chunks of context as their inference speed will heavily degrade.
+You will likely want to run GPT4All models on GPU if you would like to utilize context windows larger than 750 tokens. Native GPU support for GPT4All models is planned.
+
+**Inference Performance:**
+Which model is best? That question depends on your use-case. The ability of an LLM to faithfully follow instructions is conditioned
+on the quantity and diversity of the pre-training data it trained on and the diversity, quality and factuality of the data the LLM
+was fine-tuned on. A goal of GPT4All is to bring the most powerful local assistant model to your desktop and Nomic AI is actively
+working on efforts to improve their performance and quality.