V3 docs max (#2488)

* new skeleton

Signed-off-by: Max Cembalest <max@nomic.ai>

* v3 docs

Signed-off-by: Max Cembalest <max@nomic.ai>

---------

Signed-off-by: Max Cembalest <max@nomic.ai>
This commit is contained in:
mcembalest
2024-07-01 13:00:14 -04:00
committed by GitHub
parent bd307abfe6
commit 5306595176
57 changed files with 865 additions and 170 deletions

View File

@@ -0,0 +1,140 @@
# GPT4All Chat UI
The [GPT4All Chat Client](https://gpt4all.io) lets you easily interact with any local large language model.
It is optimized to run 7-13B parameter LLMs on the CPU's of any computer running OSX/Windows/Linux.
## Running LLMs on CPU
The GPT4All Chat UI supports models from all newer versions of `llama.cpp` with `GGUF` models including the `Mistral`, `LLaMA2`, `LLaMA`, `OpenLLaMa`, `Falcon`, `MPT`, `Replit`, `Starcoder`, and `Bert` architectures
GPT4All maintains an official list of recommended models located in [models3.json](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models3.json). You can pull request new models to it and if accepted they will show up in the official download dialog.
#### Sideloading any GGUF model
If a model is compatible with the gpt4all-backend, you can sideload it into GPT4All Chat by:
1. Downloading your model in GGUF format. It should be a 3-8 GB file similar to the ones [here](https://huggingface.co/TheBloke/Orca-2-7B-GGUF/tree/main).
2. Identifying your GPT4All model downloads folder. This is the path listed at the bottom of the downloads dialog.
3. Placing your downloaded model inside GPT4All's model downloads folder.
4. Restarting your GPT4ALL app. Your model should appear in the model selection list.
## Plugins
GPT4All Chat Plugins allow you to expand the capabilities of Local LLMs.
### LocalDocs Plugin (Chat With Your Data)
LocalDocs is a GPT4All feature that allows you to chat with your local files and data.
It allows you to utilize powerful local LLMs to chat with private data without any data leaving your computer or server.
When using LocalDocs, your LLM will cite the sources that most likely contributed to a given output. Note, even an LLM equipped with LocalDocs can hallucinate. The LocalDocs plugin will utilize your documents to help answer prompts and you will see references appear below the response.
<p align="center">
<img width="70%" src="https://github.com/nomic-ai/gpt4all/assets/10168/fe5dd3c0-b3cc-4701-98d3-0280dfbcf26f">
</p>
#### Enabling LocalDocs
1. Install the latest version of GPT4All Chat from [GPT4All Website](https://gpt4all.io).
2. Go to `Settings > LocalDocs tab`.
3. Download the SBert model
4. Configure a collection (folder) on your computer that contains the files your LLM should have access to. You can alter the contents of the folder/directory at anytime. As you
add more files to your collection, your LLM will dynamically be able to access them.
5. Spin up a chat session with any LLM (including external ones like ChatGPT but warning data will leave your machine!)
6. At the top right, click the database icon and select which collection you want your LLM to know about during your chat session.
7. You can begin searching with your localdocs even before the collection has completed indexing, but note the search will not include those parts of the collection yet to be indexed.
#### LocalDocs Capabilities
LocalDocs allows your LLM to have context about the contents of your documentation collection.
LocalDocs **can**:
- Query your documents based upon your prompt / question. Your documents will be searched for snippets that can be used to provide context for an answer. The most relevant snippets will be inserted into your prompts context, but it will be up to the underlying model to decide how best to use the provided context.
LocalDocs **cannot**:
- Answer general metadata queries (e.g. `What documents do you know about?`, `Tell me about my documents`)
- Summarize a single document (e.g. `Summarize my magna carta PDF.`)
See the Troubleshooting section for common issues.
#### How LocalDocs Works
LocalDocs works by maintaining an index of all data in the directory your collection is linked to. This index
consists of small chunks of each document that the LLM can receive as additional input when you ask it a question.
The general technique this plugin uses is called [Retrieval Augmented Generation](https://arxiv.org/abs/2005.11401).
These document chunks help your LLM respond to queries with knowledge about the contents of your data.
The number of chunks and the size of each chunk can be configured in the LocalDocs plugin settings tab.
LocalDocs currently supports plain text files (`.txt`, `.md`, and `.rst`) and PDF files (`.pdf`).
#### Troubleshooting and FAQ
*My LocalDocs plugin isn't using my documents*
- Make sure LocalDocs is enabled for your chat session (the DB icon on the top-right should have a border)
- If your document collection is large, wait 1-2 minutes for it to finish indexing.
#### LocalDocs Roadmap
- Customize model fine-tuned with retrieval in the loop.
- Plugin compatibility with chat client server mode.
## Server Mode
GPT4All Chat comes with a built-in server mode allowing you to programmatically interact
with any supported local LLM through a *very familiar* HTTP API. You can find the API documentation [here](https://platform.openai.com/docs/api-reference/completions).
Enabling server mode in the chat client will spin-up on an HTTP server running on `localhost` port
`4891` (the reverse of 1984). You can enable the webserver via `GPT4All Chat > Settings > Enable web server`.
Begin using local LLMs in your AI powered apps by changing a single line of code: the base path for requests.
```python
import openai
openai.api_base = "http://localhost:4891/v1"
#openai.api_base = "https://api.openai.com/v1"
openai.api_key = "not needed for a local LLM"
# Set up the prompt and other parameters for the API request
prompt = "Who is Michael Jordan?"
# model = "gpt-3.5-turbo"
#model = "mpt-7b-chat"
model = "gpt4all-j-v1.3-groovy"
# Make the API request
response = openai.Completion.create(
model=model,
prompt=prompt,
max_tokens=50,
temperature=0.28,
top_p=0.95,
n=1,
echo=True,
stream=False
)
# Print the generated completion
print(response)
```
which gives the following response
```json
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": "Who is Michael Jordan?\nMichael Jordan is a former professional basketball player who played for the Chicago Bulls in the NBA. He was born on December 30, 1963, and retired from playing basketball in 1998."
}
],
"created": 1684260896,
"id": "foobarbaz",
"model": "gpt4all-j-v1.3-groovy",
"object": "text_completion",
"usage": {
"completion_tokens": 35,
"prompt_tokens": 39,
"total_tokens": 74
}
}
```

View File

@@ -0,0 +1,198 @@
# GPT4All CLI
The GPT4All command-line interface (CLI) is a Python script which is built on top of the
[Python bindings][docs-bindings-python] ([repository][repo-bindings-python]) and the [typer]
package. The source code, README, and local build instructions can be found
[here][repo-bindings-cli].
[docs-bindings-python]: gpt4all_python.md
[repo-bindings-python]: https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python
[repo-bindings-cli]: https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/cli
[typer]: https://typer.tiangolo.com/
## Installation
### The Short Version
The CLI is a Python script called [app.py]. If you're already familiar with Python best practices,
the short version is to [download app.py][app.py-download] into a folder of your choice, install
the two required dependencies with some variant of:
```shell
pip install gpt4all typer
```
Then run it with a variant of:
```shell
python app.py repl
```
In case you're wondering, _REPL_ is an acronym for [read-eval-print loop][wiki-repl].
[app.py]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-bindings/cli/app.py
[app.py-download]: https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-bindings/cli/app.py
[wiki-repl]: https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop
### Recommendations & The Long Version
Especially if you have several applications/libraries which depend on Python, to avoid descending
into dependency hell at some point, you should:
- Consider to always install into some kind of [_virtual environment_][venv].
- On a _Unix-like_ system, don't use `sudo` for anything other than packages provided by the system
package manager, i.e. never with `pip`.
[venv]: https://docs.python.org/3/library/venv.html
There are several ways and tools available to do this, so below are descriptions on how to install
with a _virtual environment_ (recommended) or a user installation on all three main platforms.
Different platforms can have slightly different ways to start the Python interpreter itself.
Note: _Typer_ has an optional dependency for more fanciful output. If you want that, replace `typer`
with `typer[all]` in the pip-install instructions below.
#### Virtual Environment Installation
You can name your _virtual environment_ folder for the CLI whatever you like. In the following,
`gpt4all-cli` is used throughout.
##### macOS
There are at least three ways to have a Python installation on _macOS_, and possibly not all of them
provide a full installation of Python and its tools. When in doubt, try the following:
```shell
python3 -m venv --help
python3 -m pip --help
```
Both should print the help for the `venv` and `pip` commands, respectively. If they don't, consult
the documentation of your Python installation on how to enable them, or download a separate Python
variant, for example try an [unified installer package from python.org][python.org-downloads].
[python.org-downloads]: https://www.python.org/downloads/
Once ready, do:
```shell
python3 -m venv gpt4all-cli
. gpt4all-cli/bin/activate
python3 -m pip install gpt4all typer
```
##### Windows
Download the [official installer from python.org][python.org-downloads] if Python isn't already
present on your system.
A _Windows_ installation should already provide all the components for a _virtual environment_. Run:
```shell
py -3 -m venv gpt4all-cli
gpt4all-cli\Scripts\activate
py -m pip install gpt4all typer
```
##### Linux
On Linux, a Python installation is often split into several packages and not all are necessarily
installed by default. For example, on Debian/Ubuntu and derived distros, you will want to ensure
their presence with the following:
```shell
sudo apt-get install python3-venv python3-pip
```
The next steps are similar to the other platforms:
```shell
python3 -m venv gpt4all-cli
. gpt4all-cli/bin/activate
python3 -m pip install gpt4all typer
```
On other distros, the situation might be different. Especially the package names can vary a lot.
You'll have to look it up in the documentation, software directory, or package search.
#### User Installation
##### macOS
There are at least three ways to have a Python installation on _macOS_, and possibly not all of them
provide a full installation of Python and its tools. When in doubt, try the following:
```shell
python3 -m pip --help
```
That should print the help for the `pip` command. If it doesn't, consult the documentation of your
Python installation on how to enable them, or download a separate Python variant, for example try an
[unified installer package from python.org][python.org-downloads].
Once ready, do:
```shell
python3 -m pip install --user --upgrade gpt4all typer
```
##### Windows
Download the [official installer from python.org][python.org-downloads] if Python isn't already
present on your system. It includes all the necessary components. Run:
```shell
py -3 -m pip install --user --upgrade gpt4all typer
```
##### Linux
On Linux, a Python installation is often split into several packages and not all are necessarily
installed by default. For example, on Debian/Ubuntu and derived distros, you will want to ensure
their presence with the following:
```shell
sudo apt-get install python3-pip
```
The next steps are similar to the other platforms:
```shell
python3 -m pip install --user --upgrade gpt4all typer
```
On other distros, the situation might be different. Especially the package names can vary a lot.
You'll have to look it up in the documentation, software directory, or package search.
## Running the CLI
The CLI is a self-contained script called [app.py]. As such, you can [download][app.py-download]
and save it anywhere you like, as long as the Python interpreter has access to the mentioned
dependencies.
Note: different platforms can have slightly different ways to start Python. Whereas below the
interpreter command is written as `python` you typically want to type instead:
- On _Unix-like_ systems: `python3`
- On _Windows_: `py -3`
The simplest way to start the CLI is:
```shell
python app.py repl
```
This automatically selects the [groovy] model and downloads it into the `.cache/gpt4all/` folder
of your home directory, if not already present.
[groovy]: https://huggingface.co/nomic-ai/gpt4all-j#model-details
If you want to use a different model, you can do so with the `-m`/`--model` parameter. If only a
model file name is provided, it will again check in `.cache/gpt4all/` and might start downloading.
If instead given a path to an existing model, the command could for example look like this:
```shell
python app.py repl --model /home/user/my-gpt4all-models/gpt4all-13b-snoozy-q4_0.gguf
```
When you're done and want to end a session, simply type `/exit`.
To get help and information on all the available commands and options on the command-line, run:
```shell
python app.py --help
```
And while inside the running _REPL_, write `/help`.
Note that if you've installed the required packages into a _virtual environment_, you don't need
to activate that every time you want to run the CLI. Instead, you can just start it with the Python
interpreter in the folder `gpt4all-cli/bin/` (_Unix-like_) or `gpt4all-cli/Script/` (_Windows_).
That also makes it easy to set an alias e.g. in [Bash][bash-aliases] or [PowerShell][posh-aliases]:
- Bash: `alias gpt4all="'/full/path/to/gpt4all-cli/bin/python' '/full/path/to/app.py' repl"`
- PowerShell:
```posh
Function GPT4All-Venv-CLI {"C:\full\path\to\gpt4all-cli\Scripts\python.exe" "C:\full\path\to\app.py" repl}
Set-Alias -Name gpt4all -Value GPT4All-Venv-CLI
```
Don't forget to save these in the start-up file of your shell.
[bash-aliases]: https://www.gnu.org/software/bash/manual/html_node/Aliases.html
[posh-aliases]: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/set-alias
Finally, if on _Windows_ you see a box instead of an arrow `⇢` as the prompt character, you should
change the console font to one which offers better Unicode support.

View File

@@ -0,0 +1,100 @@
# GPT4All FAQ
## What models are supported by the GPT4All ecosystem?
Currently, there are six different model architectures that are supported:
1. GPT-J - Based off of the GPT-J architecture with examples found [here](https://huggingface.co/EleutherAI/gpt-j-6b)
2. LLaMA - Based off of the LLaMA architecture with examples found [here](https://huggingface.co/models?sort=downloads&search=llama)
3. MPT - Based off of Mosaic ML's MPT architecture with examples found [here](https://huggingface.co/mosaicml/mpt-7b)
4. Replit - Based off of Replit Inc.'s Replit architecture with examples found [here](https://huggingface.co/replit/replit-code-v1-3b)
5. Falcon - Based off of TII's Falcon architecture with examples found [here](https://huggingface.co/tiiuae/falcon-40b)
6. StarCoder - Based off of BigCode's StarCoder architecture with examples found [here](https://huggingface.co/bigcode/starcoder)
## Why so many different architectures? What differentiates them?
One of the major differences is license. Currently, the LLaMA based models are subject to a non-commercial license, whereas the GPTJ and MPT base
models allow commercial usage. However, its successor [Llama 2 is commercially licensable](https://ai.meta.com/llama/license/), too. In the early
advent of the recent explosion of activity in open source local models, the LLaMA models have generally been seen as performing better, but that is
changing quickly. Every week - even every day! - new models are released with some of the GPTJ and MPT models competitive in performance/quality with
LLaMA. What's more, there are some very nice architectural innovations with the MPT models that could lead to new performance/quality gains.
## How does GPT4All make these models available for CPU inference?
By leveraging the ggml library written by Georgi Gerganov and a growing community of developers. There are currently multiple different versions of
this library. The original GitHub repo can be found [here](https://github.com/ggerganov/ggml), but the developer of the library has also created a
LLaMA based version [here](https://github.com/ggerganov/llama.cpp). Currently, this backend is using the latter as a submodule.
## Does that mean GPT4All is compatible with all llama.cpp models and vice versa?
Yes!
The upstream [llama.cpp](https://github.com/ggerganov/llama.cpp) project has introduced several [compatibility breaking] quantization methods recently.
This is a breaking change that renders all previous models (including the ones that GPT4All uses) inoperative with newer versions of llama.cpp since
that change.
Fortunately, we have engineered a submoduling system allowing us to dynamically load different versions of the underlying library so that
GPT4All just works.
[compatibility breaking]: https://github.com/ggerganov/llama.cpp/commit/b9fd7eee57df101d4a3e3eabc9fd6c2cb13c9ca1
## What are the system requirements?
Your CPU needs to support [AVX or AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) and you need enough RAM to load a model into memory.
## What about GPU inference?
In newer versions of llama.cpp, there has been some added support for NVIDIA GPU's for inference. We're investigating how to incorporate this into our downloadable installers.
## Ok, so bottom line... how do I make my model on Hugging Face compatible with GPT4All ecosystem right now?
1. Check to make sure the Hugging Face model is available in one of our three supported architectures
2. If it is, then you can use the conversion script inside of our pinned llama.cpp submodule for GPTJ and LLaMA based models
3. Or if your model is an MPT model you can use the conversion script located directly in this backend directory under the scripts subdirectory
## Language Bindings
#### There's a problem with the download
Some bindings can download a model, if allowed to do so. For example, in Python or TypeScript if `allow_download=True`
or `allowDownload=true` (default), a model is automatically downloaded into `.cache/gpt4all/` in the user's home folder,
unless it already exists.
In case of connection issues or errors during the download, you might want to manually verify the model file's MD5
checksum by comparing it with the one listed in [models3.json].
As an alternative to the basic downloader built into the bindings, you can choose to download from the
<https://gpt4all.io/> website instead. Scroll down to 'Model Explorer' and pick your preferred model.
[models3.json]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models3.json
#### I need the chat GUI and bindings to behave the same
The chat GUI and bindings are based on the same backend. You can make them behave the same way by following these steps:
- First of all, ensure that all parameters in the chat GUI settings match those passed to the generating API, e.g.:
=== "Python"
``` py
from gpt4all import GPT4All
model = GPT4All(...)
model.generate("prompt text", temp=0, ...) # adjust parameters
```
=== "TypeScript"
``` ts
import { createCompletion, loadModel } from '../src/gpt4all.js'
const ll = await loadModel(...);
const messages = ...
const re = await createCompletion(ll, messages, { temp: 0, ... }); // adjust parameters
```
- To make comparing the output easier, set _Temperature_ in both to 0 for now. This will make the output deterministic.
- Next you'll have to compare the templates, adjusting them as necessary, based on how you're using the bindings.
- Specifically, in Python:
- With simple `generate()` calls, the input has to be surrounded with system and prompt templates.
- When using a chat session, it depends on whether the bindings are allowed to download [models3.json]. If yes,
and in the chat GUI the default templates are used, it'll be handled automatically. If no, use
`chat_session()` template parameters to customize them.
- Once you're done, remember to reset _Temperature_ to its previous value in both chat GUI and your custom code.

View File

@@ -0,0 +1,70 @@
# Monitoring
Leverage OpenTelemetry to perform real-time monitoring of your LLM application and GPUs using [OpenLIT](https://github.com/openlit/openlit). This tool helps you easily collect data on user interactions, performance metrics, along with GPU Performance metrics, which can assist in enhancing the functionality and dependability of your GPT4All based LLM application.
## How it works?
OpenLIT adds automatic OTel instrumentation to the GPT4All SDK. It covers the `generate` and `embedding` functions, helping to track LLM usage by gathering inputs and outputs. This allows users to monitor and evaluate the performance and behavior of their LLM application in different environments. OpenLIT also provides OTel auto-instrumentation for monitoring GPU metrics like utilization, temperature, power usage, and memory usage.
Additionally, you have the flexibility to view and analyze the generated traces and metrics either in the OpenLIT UI or by exporting them to widely used observability tools like Grafana and DataDog for more comprehensive analysis and visualization.
## Getting Started
Heres a straightforward guide to help you set up and start monitoring your application:
### 1. Install the OpenLIT SDK
Open your terminal and run:
```shell
pip install openlit
```
### 2. Setup Monitoring for your Application
In your application, initiate OpenLIT as outlined below:
```python
from gpt4all import GPT4All
import openlit
openlit.init() # Initialize OpenLIT monitoring
model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
# Start a chat session and send queries
with model.chat_session():
response1 = model.generate(prompt='hello', temp=0)
response2 = model.generate(prompt='write me a short poem', temp=0)
response3 = model.generate(prompt='thank you', temp=0)
print(model.current_chat_session)
```
This setup wraps your gpt4all model interactions, capturing valuable data about each request and response.
### 3. (Optional) Enable GPU Monitoring
If your application runs on NVIDIA GPUs, you can enable GPU stats collection in the OpenLIT SDK by adding `collect_gpu_stats=True`. This collects GPU metrics like utilization, temperature, power usage, and memory-related performance metrics. The collected metrics are OpenTelemetry gauges.
```python
from gpt4all import GPT4All
import openlit
openlit.init(collect_gpu_stats=True) # Initialize OpenLIT monitoring
model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
# Start a chat session and send queries
with model.chat_session():
response1 = model.generate(prompt='hello', temp=0)
response2 = model.generate(prompt='write me a short poem', temp=0)
response3 = model.generate(prompt='thank you', temp=0)
print(model.current_chat_session)
```
### Visualize
Once you've set up data collection with [OpenLIT](https://github.com/openlit/openlit), you can visualize and analyze this information to better understand your application's performance:
- **Using OpenLIT UI:** Connect to OpenLIT's UI to start exploring performance metrics. Visit the OpenLIT [Quickstart Guide](https://docs.openlit.io/latest/quickstart) for step-by-step details.
- **Integrate with existing Observability Tools:** If you use tools like Grafana or DataDog, you can integrate the data collected by OpenLIT. For instructions on setting up these connections, check the OpenLIT [Connections Guide](https://docs.openlit.io/latest/connections/intro).

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,268 @@
# GPT4All Python Generation API
The `GPT4All` python package provides bindings to our C/C++ model backend libraries.
The source code and local build instructions can be found [here](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python).
## Quickstart
```bash
pip install gpt4all
```
``` py
from gpt4all import GPT4All
model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
```
This will:
- Instantiate `GPT4All`, which is the primary public API to your large language model (LLM).
- Automatically download the given model to `~/.cache/gpt4all/` if not already present.
Read further to see how to chat with this model.
### Chatting with GPT4All
To start chatting with a local LLM, you will need to start a chat session. Within a chat session, the model will be
prompted with the appropriate template, and history will be preserved between successive calls to `generate()`.
=== "GPT4All Example"
``` py
model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
with model.chat_session():
response1 = model.generate(prompt='hello', temp=0)
response2 = model.generate(prompt='write me a short poem', temp=0)
response3 = model.generate(prompt='thank you', temp=0)
print(model.current_chat_session)
```
=== "Output"
``` json
[
{
'role': 'user',
'content': 'hello'
},
{
'role': 'assistant',
'content': 'What is your name?'
},
{
'role': 'user',
'content': 'write me a short poem'
},
{
'role': 'assistant',
'content': "I would love to help you with that! Here's a short poem I came up with:\nBeneath the autumn leaves,\nThe wind whispers through the trees.\nA gentle breeze, so at ease,\nAs if it were born to play.\nAnd as the sun sets in the sky,\nThe world around us grows still."
},
{
'role': 'user',
'content': 'thank you'
},
{
'role': 'assistant',
'content': "You're welcome! I hope this poem was helpful or inspiring for you. Let me know if there is anything else I can assist you with."
}
]
```
When using GPT4All models in the `chat_session()` context:
- Consecutive chat exchanges are taken into account and not discarded until the session ends; as long as the model has capacity.
- A system prompt is inserted into the beginning of the model's context.
- Each prompt passed to `generate()` is wrapped in the appropriate prompt template. If you pass `allow_download=False`
to GPT4All or are using a model that is not from the official models list, you must pass a prompt template using the
`prompt_template` parameter of `chat_session()`.
NOTE: If you do not use `chat_session()`, calls to `generate()` will not be wrapped in a prompt template. This will
cause the model to *continue* the prompt instead of *answering* it. When in doubt, use a chat session, as many newer
models are designed to be used exclusively with a prompt template.
[models3.json]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models3.json
### Streaming Generations
To interact with GPT4All responses as the model generates, use the `streaming=True` flag during generation.
=== "GPT4All Streaming Example"
``` py
from gpt4all import GPT4All
model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
tokens = []
with model.chat_session():
for token in model.generate("What is the capital of France?", streaming=True):
tokens.append(token)
print(tokens)
```
=== "Output"
```
[' The', ' capital', ' of', ' France', ' is', ' Paris', '.']
```
### The Generate Method API
::: gpt4all.gpt4all.GPT4All.generate
## Examples & Explanations
### Influencing Generation
The three most influential parameters in generation are _Temperature_ (`temp`), _Top-p_ (`top_p`) and _Top-K_ (`top_k`).
In a nutshell, during the process of selecting the next token, not just one or a few are considered, but every single
token in the vocabulary is given a probability. The parameters can change the field of candidate tokens.
- **Temperature** makes the process either more or less random. A _Temperature_ above 1 increasingly "levels the playing
field", while at a _Temperature_ between 0 and 1 the likelihood of the best token candidates grows even more. A
_Temperature_ of 0 results in selecting the best token, making the output deterministic. A _Temperature_ of 1
represents a neutral setting with regard to randomness in the process.
- _Top-p_ and _Top-K_ both narrow the field:
- **Top-K** limits candidate tokens to a fixed number after sorting by probability. Setting it higher than the
vocabulary size deactivates this limit.
- **Top-p** selects tokens based on their total probabilities. For example, a value of 0.8 means "include the best
tokens, whose accumulated probabilities reach or just surpass 80%". Setting _Top-p_ to 1, which is 100%,
effectively disables it.
The recommendation is to keep at least one of _Top-K_ and _Top-p_ active. Other parameters can also influence
generation; be sure to review all their descriptions.
### Specifying the Model Folder
The model folder can be set with the `model_path` parameter when creating a `GPT4All` instance. The example below is
is the same as if it weren't provided; that is, `~/.cache/gpt4all/` is the default folder.
``` py
from pathlib import Path
from gpt4all import GPT4All
model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf', model_path=Path.home() / '.cache' / 'gpt4all')
```
If you want to point it at the chat GUI's default folder, it should be:
=== "macOS"
``` py
from pathlib import Path
from gpt4all import GPT4All
model_name = 'orca-mini-3b-gguf2-q4_0.gguf'
model_path = Path.home() / 'Library' / 'Application Support' / 'nomic.ai' / 'GPT4All'
model = GPT4All(model_name, model_path)
```
=== "Windows"
``` py
from pathlib import Path
from gpt4all import GPT4All
import os
model_name = 'orca-mini-3b-gguf2-q4_0.gguf'
model_path = Path(os.environ['LOCALAPPDATA']) / 'nomic.ai' / 'GPT4All'
model = GPT4All(model_name, model_path)
```
=== "Linux"
``` py
from pathlib import Path
from gpt4all import GPT4All
model_name = 'orca-mini-3b-gguf2-q4_0.gguf'
model_path = Path.home() / '.local' / 'share' / 'nomic.ai' / 'GPT4All'
model = GPT4All(model_name, model_path)
```
Alternatively, you could also change the module's default model directory:
``` py
from pathlib import Path
from gpt4all import GPT4All, gpt4all
gpt4all.DEFAULT_MODEL_DIRECTORY = Path.home() / 'my' / 'models-directory'
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
```
### Managing Templates
When using a `chat_session()`, you may customize the system prompt, and set the prompt template if necessary:
=== "GPT4All Custom Session Templates Example"
``` py
from gpt4all import GPT4All
model = GPT4All('wizardlm-13b-v1.2.Q4_0.gguf')
system_template = 'A chat between a curious user and an artificial intelligence assistant.\n'
# many models use triple hash '###' for keywords, Vicunas are simpler:
prompt_template = 'USER: {0}\nASSISTANT: '
with model.chat_session(system_template, prompt_template):
response1 = model.generate('why is the grass green?')
print(response1)
print()
response2 = model.generate('why is the sky blue?')
print(response2)
```
=== "Possible Output"
```
The color of grass can be attributed to its chlorophyll content, which allows it
to absorb light energy from sunlight through photosynthesis. Chlorophyll absorbs
blue and red wavelengths of light while reflecting other colors such as yellow
and green. This is why the leaves appear green to our eyes.
The color of the sky appears blue due to a phenomenon called Rayleigh scattering,
which occurs when sunlight enters Earth's atmosphere and interacts with air
molecules such as nitrogen and oxygen. Blue light has shorter wavelength than
other colors in the visible spectrum, so it is scattered more easily by these
particles, making the sky appear blue to our eyes.
```
### Without Online Connectivity
To prevent GPT4All from accessing online resources, instantiate it with `allow_download=False`. When using this flag,
there will be no default system prompt by default, and you must specify the prompt template yourself.
You can retrieve a model's default system prompt and prompt template with an online instance of GPT4All:
=== "Prompt Template Retrieval"
``` py
from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
print(repr(model.config['systemPrompt']))
print(repr(model.config['promptTemplate']))
```
=== "Output"
```py
'### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n'
'### User:\n{0}\n### Response:\n'
```
Then you can pass them explicitly when creating an offline instance:
``` py
from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf', allow_download=False)
system_prompt = '### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n\n'
prompt_template = '### User:\n{0}\n\n### Response:\n'
with model.chat_session(system_prompt=system_prompt, prompt_template=prompt_template):
...
```
### Interrupting Generation
The simplest way to stop generation is to set a fixed upper limit with the `max_tokens` parameter.
If you know exactly when a model should stop responding, you can add a custom callback, like so:
=== "GPT4All Custom Stop Callback"
``` py
from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf')
def stop_on_token_callback(token_id, token_string):
# one sentence is enough:
if '.' in token_string:
return False
else:
return True
response = model.generate('Blue Whales are the biggest animal to ever inhabit the Earth.',
temp=0, callback=stop_on_token_callback)
print(response)
```
=== "Output"
```
They can grow up to 100 feet (30 meters) long and weigh as much as 20 tons (18 metric tons).
```
## API Documentation
::: gpt4all.gpt4all.GPT4All

View File

@@ -0,0 +1,176 @@
# Embeddings
GPT4All supports generating high quality embeddings of arbitrary length text using any embedding model supported by llama.cpp.
An embedding is a vector representation of a piece of text. Embeddings are useful for tasks such as retrieval for
question answering (including retrieval augmented generation or *RAG*), semantic similarity search, classification, and
topic clustering.
## Supported Embedding Models
The following models have built-in support in Embed4All:
| Name | Embed4All `model_name` | Context Length | Embedding Length | File Size |
|--------------------|------------------------------------------------------|---------------:|-----------------:|----------:|
| [SBert] | all&#x2011;MiniLM&#x2011;L6&#x2011;v2.gguf2.f16.gguf | 512 | 384 | 44 MiB |
| [Nomic Embed v1] | nomic&#x2011;embed&#x2011;text&#x2011;v1.f16.gguf | 2048 | 768 | 262 MiB |
| [Nomic Embed v1.5] | nomic&#x2011;embed&#x2011;text&#x2011;v1.5.f16.gguf | 2048 | 64-768 | 262 MiB |
The context length is the maximum number of word pieces, or *tokens*, that a model can embed at once. Embedding texts
longer than a model's context length requires some kind of strategy; see [Embedding Longer Texts] for more information.
The embedding length is the size of the vector returned by `Embed4All.embed`.
[SBert]: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[Nomic Embed v1]: https://huggingface.co/nomic-ai/nomic-embed-text-v1
[Nomic Embed v1.5]: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
[Embedding Longer Texts]: #embedding-longer-texts
## Quickstart
```bash
pip install gpt4all
```
### Generating Embeddings
By default, embeddings will be generated on the CPU using all-MiniLM-L6-v2.
=== "Embed4All Example"
```py
from gpt4all import Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All()
output = embedder.embed(text)
print(output)
```
=== "Output"
```
[0.034696947783231735, -0.07192722707986832, 0.06923297047615051, ...]
```
You can also use the GPU to accelerate the embedding model by specifying the `device` parameter. See the [GPT4All
constructor] for more information.
=== "GPU Example"
```py
from gpt4all import Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All(device='gpu')
output = embedder.embed(text)
print(output)
```
=== "Output"
```
[0.034696947783231735, -0.07192722707986832, 0.06923297047615051, ...]
```
[GPT4All constructor]: gpt4all_python.md#gpt4all.gpt4all.GPT4All.__init__
### Nomic Embed
Embed4All has built-in support for Nomic's open-source embedding model, [Nomic Embed]. When using this model, you must
specify the task type using the `prefix` argument. This may be one of `search_query`, `search_document`,
`classification`, or `clustering`. For retrieval applications, you should prepend `search_document` for all of your
documents and `search_query` for your queries. See the [Nomic Embedding Guide] for more info.
=== "Nomic Embed Example"
```py
from gpt4all import Embed4All
text = 'Who is Laurens van der Maaten?'
embedder = Embed4All('nomic-embed-text-v1.f16.gguf')
output = embedder.embed(text, prefix='search_query')
print(output)
```
=== "Output"
```
[-0.013357644900679588, 0.027070969343185425, -0.0232995692640543, ...]
```
[Nomic Embed]: https://blog.nomic.ai/posts/nomic-embed-text-v1
[Nomic Embedding Guide]: https://docs.nomic.ai/atlas/guides/embeddings#embedding-task-types
### Embedding Longer Texts
Embed4All accepts a parameter called `long_text_mode`. This controls the behavior of Embed4All for texts longer than the
context length of the embedding model.
In the default mode of "mean", Embed4All will break long inputs into chunks and average their embeddings to compute the
final result.
To change this behavior, you can set the `long_text_mode` parameter to "truncate", which will truncate the input to the
sequence length of the model before generating a single embedding.
=== "Truncation Example"
```py
from gpt4all import Embed4All
text = 'The ' * 512 + 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All()
output = embedder.embed(text, long_text_mode="mean")
print(output)
print()
output = embedder.embed(text, long_text_mode="truncate")
print(output)
```
=== "Output"
```
[0.0039850445464253426, 0.04558328539133072, 0.0035536508075892925, ...]
[-0.009771130047738552, 0.034792833030223846, -0.013273917138576508, ...]
```
### Batching
You can send multiple texts to Embed4All in a single call. This can give faster results when individual texts are
significantly smaller than `n_ctx` tokens. (`n_ctx` defaults to 2048.)
=== "Batching Example"
```py
from gpt4all import Embed4All
texts = ['The quick brown fox jumps over the lazy dog', 'Foo bar baz']
embedder = Embed4All()
output = embedder.embed(texts)
print(output[0])
print()
print(output[1])
```
=== "Output"
```
[0.03551332652568817, 0.06137588247656822, 0.05281158909201622, ...]
[-0.03879690542817116, 0.00013223080895841122, 0.023148687556385994, ...]
```
The number of texts that can be embedded in one pass of the model is proportional to the `n_ctx` parameter of Embed4All.
Increasing it may increase batched embedding throughput if you have a fast GPU, at the cost of VRAM.
```py
embedder = Embed4All(n_ctx=4096, device='gpu')
```
### Resizable Dimensionality
The embedding dimension of Nomic Embed v1.5 can be resized using the `dimensionality` parameter. This parameter supports
any value between 64 and 768.
Shorter embeddings use less storage, memory, and bandwidth with a small performance cost. See the [blog post] for more
info.
[blog post]: https://blog.nomic.ai/posts/nomic-embed-matryoshka
=== "Matryoshka Example"
```py
from gpt4all import Embed4All
text = 'The quick brown fox jumps over the lazy dog'
embedder = Embed4All('nomic-embed-text-v1.5.f16.gguf')
output = embedder.embed(text, dimensionality=64)
print(len(output))
print(output)
```
=== "Output"
```
64
[-0.03567073494195938, 0.1301717758178711, -0.4333043396472931, ...]
```
### API documentation
::: gpt4all.gpt4all.Embed4All

View File

@@ -0,0 +1,71 @@
# GPT4All
Welcome to the GPT4All documentation LOCAL EDIT
GPT4All is an open-source software ecosystem for anyone to run large language models (LLMs) **privately** on **everyday laptop & desktop computers**. No API calls or GPUs required.
The GPT4All Desktop Application is a touchpoint to interact with LLMs and integrate them with your local docs & local data for RAG (retrieval-augmented generation). No coding is required, just install the application, download the models of your choice, and you are ready to use your LLM.
Your local data is **yours**. GPT4All handles the retrieval privately and on-device to fetch relevant data to support your queries to your LLM.
Nomic AI oversees contributions to GPT4All to ensure quality, security, and maintainability. Additionally, Nomic AI has open-sourced code for training and deploying your own customized LLMs internally.
GPT4All software is optimized to run inference of 3-13 billion parameter large language models on the CPUs of laptops, desktops and servers.
=== "GPT4All Example"
``` py
from gpt4all import GPT4All
model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
output = model.generate("The capital of France is ", max_tokens=3)
print(output)
```
=== "Output"
```
1. Paris
```
See [Python Bindings](gpt4all_python.md) to use GPT4All.
### Navigating the Documentation
In an effort to ensure cross-operating-system and cross-language compatibility, the [GPT4All software ecosystem](https://github.com/nomic-ai/gpt4all)
is organized as a monorepo with the following structure:
- **gpt4all-backend**: The GPT4All backend maintains and exposes a universal, performance optimized C API for running inference with multi-billion parameter Transformer Decoders.
This C API is then bound to any higher level programming language such as C++, Python, Go, etc.
- **gpt4all-bindings**: GPT4All bindings contain a variety of high-level programming languages that implement the C API. Each directory is a bound programming language. The [CLI](gpt4all_cli.md) is included here, as well.
- **gpt4all-chat**: GPT4All Chat is an OS native chat application that runs on macOS, Windows and Linux. It is the easiest way to run local, privacy aware chat assistants on everyday hardware. You can download it on the [GPT4All Website](https://gpt4all.io) and read its source code in the monorepo.
Explore detailed documentation for the backend, bindings and chat client in the sidebar.
## Models
The GPT4All software ecosystem is compatible with the following Transformer architectures:
- `Falcon`
- `LLaMA` (including `OpenLLaMA`)
- `MPT` (including `Replit`)
- `GPT-J`
You can find an exhaustive list of supported models on the [website](https://gpt4all.io) or in the [models directory](https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models3.json)
GPT4All models are artifacts produced through a process known as neural network quantization.
A multi-billion parameter Transformer Decoder usually takes 30+ GB of VRAM to execute a forward pass.
Most people do not have such a powerful computer or access to GPU hardware. By running trained LLMs through quantization algorithms,
some GPT4All models can run on your laptop using only 4-8GB of RAM enabling their wide-spread usage.
Bigger models might still require more RAM, however.
Any model trained with one of these architectures can be quantized and run locally with all GPT4All bindings and in the
chat client. You can add new variants by contributing to the gpt4all-backend.
## Frequently Asked Questions
Find answers to frequently asked questions by searching the [Github issues](https://github.com/nomic-ai/gpt4all/issues) or in the [documentation FAQ](gpt4all_faq.md).
## Getting the most of your local LLM
**Inference Speed**
of a local LLM depends on two factors: model size and the number of tokens given as input.
It is not advised to prompt local LLMs with large chunks of context as their inference speed will heavily degrade.
You will likely want to run GPT4All models on GPU if you would like to utilize context windows larger than 750 tokens. Native GPU support for GPT4All models is planned.
**Inference Performance:**
Which model is best? That question depends on your use-case. The ability of an LLM to faithfully follow instructions is conditioned
on the quantity and diversity of the pre-training data it trained on and the diversity, quality and factuality of the data the LLM
was fine-tuned on. A goal of GPT4All is to bring the most powerful local assistant model to your desktop and Nomic AI is actively
working on efforts to improve their performance and quality.