community[minor]: Adds Llamafile as an LLM (#17431)

* **Description:** Adds a simple LLM implementation for interacting with [llamafile](https://github.com/Mozilla-Ocho/llamafile)-based models. * **Dependencies:** N/A * **Issue:** N/A **Detail** [llamafile](https://github.com/Mozilla-Ocho/llamafile) lets you run LLMs locally from a single file on most computers without installing any dependencies. To use the llamafile LLM implementation, the user needs to: 1. Download a llamafile e.g. https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile?download=true 2. Make the file executable. 3. Run the llamafile in 'server mode'. (All llamafiles come packaged with a lightweight server; by default, the server listens at `http://localhost:8080`.) ```bash wget https://url/of/model.llamafile chmod +x model.llamafile ./model.llamafile --server --nobrowser ``` Now, the user can invoke the LLM via the LangChain client: ```python from langchain_community.llms.llamafile import Llamafile llm = Llamafile() llm.invoke("Tell me a joke.") ```
2025-07-21 12:01:47 +00:00 · 2024-02-14 14:15:24 -05:00 · 2024-02-14 14:15:24 -05:00 · 0bc4a9b3fc
commit 0bc4a9b3fc
parent 5ce1827d31
4 changed files with 655 additions and 0 deletions
--- a/docs/docs/integrations/llms/llamafile.ipynb
+++ b/docs/docs/integrations/llms/llamafile.ipynb
@ -0,0 +1,133 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Llamafile\n",
    "\n",
    "[Llamafile](https://github.com/Mozilla-Ocho/llamafile) lets you distribute and run LLMs with a single file.\n",
    "\n",
    "Llamafile does this by combining [llama.cpp](https://github.com/ggerganov/llama.cpp) with [Cosmopolitan Libc](https://github.com/jart/cosmopolitan) into one framework that collapses all the complexity of LLMs down to a single-file executable (called a \"llamafile\") that runs locally on most computers, with no installation.\n",
    "\n",
    "## Setup\n",
    "\n",
    "1. Download a llamafile for the model you'd like to use. You can find many models in llamafile format on [HuggingFace](https://huggingface.co/models?other=llamafile). In this guide, we will download a small one, `TinyLlama-1.1B-Chat-v1.0.Q5_K_M`. Note: if you don't have `wget`, you can just download the model via this [link](https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile?download=true).\n",
    "\n",
    "```bash\n",
    "wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile\n",
    "```\n",
    "\n",
    "2. Make the llamafile executable. First, if you haven't done so already, open a terminal. **If you're using MacOS, Linux, or BSD,** you'll need to grant permission for your computer to execute this new file using `chmod` (see below). **If you're on Windows,** rename the file by adding \".exe\" to the end (model file should be named `TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile.exe`).\n",
    "\n",
    "\n",
    "```bash\n",
    "chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile  # run if you're on MacOS, Linux, or BSD\n",
    "```\n",
    "\n",
    "3. Run the llamafile in \"server mode\":\n",
    "\n",
    "```bash\n",
    "./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile --server --nobrowser\n",
    "```\n",
    "\n",
    "Now you can make calls to the llamafile's REST API. By default, the llamafile server listens at http://localhost:8080. You can find full server documentation [here](https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md#api-endpoints). You can interact with the llamafile directly via the REST API, but here we'll show how to interact with it using LangChain.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'? \\nI\\'ve got a thing for pink, but you know that.\\n\"Can we not talk about work anymore?\" - What did she say?\\nI don\\'t want to be a burden on you.\\nIt\\'s hard to keep a good thing going.\\nYou can\\'t tell me what I want, I have a life too!'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_community.llms.llamafile import Llamafile\n",
    "\n",
    "llm = Llamafile()\n",
    "\n",
    "llm.invoke(\"Tell me a joke\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To stream tokens, use the `.stream(...)` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".\n",
      "- She said, \"I’m tired of my life. What should I do?\"\n",
      "- The man replied, \"I hear you. But don’t worry. Life is just like a joke. It has its funny parts too.\"\n",
      "- The woman looked at him, amazed and happy to hear his wise words. - \"Thank you for your wisdom,\" she said, smiling. - He replied, \"Any time. But it doesn't come easy. You have to laugh and keep moving forward in life.\"\n",
      "- She nodded, thanking him again. - The man smiled wryly. \"Life can be tough. Sometimes it seems like you’re never going to get out of your situation.\"\n",
      "- He said, \"I know that. But the key is not giving up. Life has many ups and downs, but in the end, it will turn out okay.\"\n",
      "- The woman's eyes softened. \"Thank you for your advice. It's so important to keep moving forward in life,\" she said. - He nodded once again. \"You’re welcome. I hope your journey is filled with laughter and joy.\"\n",
      "- They both smiled and left the bar, ready to embark on their respective adventures.\n"
     ]
    }
   ],
   "source": [
    "query = \"Tell me a joke\"\n",
    "\n",
    "for chunks in llm.stream(query):\n",
    "    print(chunks, end=\"\")\n",
    "\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To learn more about the LangChain Expressive Language and the available methods on an LLM, see the [LCEL Interface](https://python.langchain.com/docs/expression_language/interface)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/libs/community/langchain_community/llms/llamafile.py
+++ b/libs/community/langchain_community/llms/llamafile.py
@ -0,0 +1,318 @@
 from __future__ import annotations
 import json
 from io import StringIO
 from typing import Any, Dict, Iterator, List, Optional
 import requests
 from langchain_core.callbacks.manager import CallbackManagerForLLMRun
 from langchain_core.language_models.llms import LLM
 from langchain_core.outputs import GenerationChunk
 from langchain_core.pydantic_v1 import Extra
 from langchain_core.utils import get_pydantic_field_names
 class Llamafile(LLM):
    """Llamafile lets you distribute and run large language models with a
    single file.
    To get started, see: https://github.com/Mozilla-Ocho/llamafile
    To use this class, you will need to first:
    1. Download a llamafile.
    2. Make the downloaded file executable: `chmod +x path/to/model.llamafile`
    3. Start the llamafile in server mode:
        `./path/to/model.llamafile --server --nobrowser`
    Example:
        .. code-block:: python
            from langchain_community.llms import Llamafile
            llm = Llamafile()
            llm.invoke("Tell me a joke.")
    """
    base_url: str = "http://localhost:8080"
    """Base url where the llamafile server is listening."""
    request_timeout: Optional[int] = None
    """Timeout for server requests"""
    streaming: bool = False
    """Allows receiving each predicted token in real-time instead of
    waiting for the completion to finish. To enable this, set to true."""
    # Generation options
    seed: int = -1
    """Random Number Generator (RNG) seed. A random seed is used if this is 
    less than zero. Default: -1"""
    temperature: float = 0.8
    """Temperature. Default: 0.8"""
    top_k: int = 40
    """Limit the next token selection to the K most probable tokens. 
    Default: 40."""
    top_p: float = 0.95
    """Limit the next token selection to a subset of tokens with a cumulative 
    probability above a threshold P. Default: 0.95."""
    min_p: float = 0.05
    """The minimum probability for a token to be considered, relative to 
    the probability of the most likely token. Default: 0.05."""
    n_predict: int = -1
    """Set the maximum number of tokens to predict when generating text. 
    Note: May exceed the set limit slightly if the last token is a partial 
    multibyte character. When 0, no tokens will be generated but the prompt 
    is evaluated into the cache. Default: -1 = infinity."""
    n_keep: int = 0
    """Specify the number of tokens from the prompt to retain when the 
    context size is exceeded and tokens need to be discarded. By default, 
    this value is set to 0 (meaning no tokens are kept). Use -1 to retain all 
    tokens from the prompt."""
    tfs_z: float = 1.0
    """Enable tail free sampling with parameter z. Default: 1.0 = disabled."""
    typical_p: float = 1.0
    """Enable locally typical sampling with parameter p. 
    Default: 1.0 = disabled."""
    repeat_penalty: float = 1.1
    """Control the repetition of token sequences in the generated text. 
    Default: 1.1"""
    repeat_last_n: int = 64
    """Last n tokens to consider for penalizing repetition. Default: 64, 
    0 = disabled, -1 = ctx-size."""
    penalize_nl: bool = True
    """Penalize newline tokens when applying the repeat penalty. 
    Default: true."""
    presence_penalty: float = 0.0
    """Repeat alpha presence penalty. Default: 0.0 = disabled."""
    frequency_penalty: float = 0.0
    """Repeat alpha frequency penalty. Default: 0.0 = disabled"""
    mirostat: int = 0
    """Enable Mirostat sampling, controlling perplexity during text 
    generation. 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0. 
    Default: disabled."""
    mirostat_tau: float = 5.0
    """Set the Mirostat target entropy, parameter tau. Default: 5.0."""
    mirostat_eta: float = 0.1
    """Set the Mirostat learning rate, parameter eta. Default: 0.1."""
    class Config:
        """Configuration for this pydantic object."""
        extra = Extra.forbid
    @property
    def _llm_type(self) -> str:
        return "llamafile"
    @property
    def _param_fieldnames(self) -> List[str]:
        # Return the list of fieldnames that will be passed as configurable
        # generation options to the llamafile server. Exclude 'builtin' fields
        # from the BaseLLM class like 'metadata' as well as fields that should
        # not be passed in requests (base_url, request_timeout).
        ignore_keys = [
            "base_url",
            "cache",
            "callback_manager",
            "callbacks",
            "metadata",
            "name",
            "request_timeout",
            "streaming",
            "tags",
            "verbose",
        ]
        attrs = [
            k for k in get_pydantic_field_names(self.__class__) if k not in ignore_keys
        ]
        return attrs
    @property
    def _default_params(self) -> Dict[str, Any]:
        params = {}
        for fieldname in self._param_fieldnames:
            params[fieldname] = getattr(self, fieldname)
        return params
    def _get_parameters(
        self, stop: Optional[List[str]] = None, **kwargs: Any
    ) -> Dict[str, Any]:
        params = self._default_params
        # Only update keys that are already present in the default config.
        # This way, we don't accidentally post unknown/unhandled key/values
        # in the request to the llamafile server
        for k, v in kwargs.items():
            if k in params:
                params[k] = v
        if stop is not None and len(stop) > 0:
            params["stop"] = stop
        if self.streaming:
            params["stream"] = True
        return params
    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        """Request prompt completion from the llamafile server and return the
        output.
        Args:
            prompt: The prompt to use for generation.
            stop: A list of strings to stop generation when encountered.
            run_manager:
            **kwargs: Any additional options to pass as part of the
            generation request.
        Returns:
            The string generated by the model.
        """
        if self.streaming:
            with StringIO() as buff:
                for chunk in self._stream(
                    prompt, stop=stop, run_manager=run_manager, **kwargs
                ):
                    buff.write(chunk.text)
                text = buff.getvalue()
            return text
        else:
            params = self._get_parameters(stop=stop, **kwargs)
            payload = {"prompt": prompt, **params}
            try:
                response = requests.post(
                    url=f"{self.base_url}/completion",
                    headers={
                        "Content-Type": "application/json",
                    },
                    json=payload,
                    stream=False,
                    timeout=self.request_timeout,
                )
            except requests.exceptions.ConnectionError:
                raise requests.exceptions.ConnectionError(
                    f"Could not connect to Llamafile server. Please make sure "
                    f"that a server is running at {self.base_url}."
                )
            response.raise_for_status()
            response.encoding = "utf-8"
            text = response.json()["content"]
            return text
    def _stream(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[GenerationChunk]:
        """Yields results objects as they are generated in real time.
        It also calls the callback manager's on_llm_new_token event with
        similar parameters to the OpenAI LLM class method of the same name.
        Args:
            prompt: The prompts to pass into the model.
            stop: Optional list of stop words to use when generating.
            run_manager:
            **kwargs: Any additional options to pass as part of the
            generation request.
        Returns:
            A generator representing the stream of tokens being generated.
        Yields:
            Dictionary-like objects each containing a token
        Example:
        .. code-block:: python
            from langchain_community.llms import Llamafile
            llm = Llamafile(
                temperature = 0.0
            )
            for chunk in llm.stream("Ask 'Hi, how are you?' like a pirate:'",
                    stop=["'","\n"]):
                result = chunk["choices"][0]
                print(result["text"], end='', flush=True)
        """
        params = self._get_parameters(stop=stop, **kwargs)
        if "stream" not in params:
            params["stream"] = True
        payload = {"prompt": prompt, **params}
        try:
            response = requests.post(
                url=f"{self.base_url}/completion",
                headers={
                    "Content-Type": "application/json",
                },
                json=payload,
                stream=True,
                timeout=self.request_timeout,
            )
        except requests.exceptions.ConnectionError:
            raise requests.exceptions.ConnectionError(
                f"Could not connect to Llamafile server. Please make sure "
                f"that a server is running at {self.base_url}."
            )
        response.encoding = "utf8"
        for raw_chunk in response.iter_lines(decode_unicode=True):
            content = self._get_chunk_content(raw_chunk)
            chunk = GenerationChunk(text=content)
            yield chunk
            if run_manager:
                run_manager.on_llm_new_token(token=chunk.text)
    def _get_chunk_content(self, chunk: str) -> str:
        """When streaming is turned on, llamafile server returns lines like:
        'data: {"content":" They","multimodal":true,"slot_id":0,"stop":false}'
        Here, we convert this to a dict and return the value of the 'content'
        field
        """
        if chunk.startswith("data:"):
            cleaned = chunk.lstrip("data: ")
            data = json.loads(cleaned)
            return data["content"]
        else:
            return chunk
--- a/libs/community/tests/integration_tests/llms/test_llamafile.py
+++ b/libs/community/tests/integration_tests/llms/test_llamafile.py
@ -0,0 +1,46 @@
 import os
 from typing import Generator
 import pytest
 import requests
 from requests.exceptions import ConnectionError, HTTPError
 from langchain_community.llms.llamafile import Llamafile
 LLAMAFILE_SERVER_BASE_URL = os.getenv(
    "LLAMAFILE_SERVER_BASE_URL", "http://localhost:8080"
 )
 def _ping_llamafile_server() -> bool:
    try:
        response = requests.get(LLAMAFILE_SERVER_BASE_URL)
        response.raise_for_status()
    except (ConnectionError, HTTPError):
        return False
    return True
@pytest.mark.skipif(
    not _ping_llamafile_server(),
    reason=f"unable to find llamafile server at {LLAMAFILE_SERVER_BASE_URL}, "
    f"please start one and re-run this test",
 )
 def test_llamafile_call() -> None:
    llm = Llamafile()
    output = llm.invoke("Say foo:")
    assert isinstance(output, str)
@pytest.mark.skipif(
    not _ping_llamafile_server(),
    reason=f"unable to find llamafile server at {LLAMAFILE_SERVER_BASE_URL}, "
    f"please start one and re-run this test",
 )
 def test_llamafile_streaming() -> None:
    llm = Llamafile(streaming=True)
    generator = llm.stream("Tell me about Roman dodecahedrons.")
    assert isinstance(generator, Generator)
    for token in generator:
        assert isinstance(token, str)
--- a/libs/community/tests/unit_tests/llms/test_llamafile.py
+++ b/libs/community/tests/unit_tests/llms/test_llamafile.py
@ -0,0 +1,158 @@
 import json
 from collections import deque
 from typing import Any, Dict
 import pytest
 import requests
 from pytest import MonkeyPatch
 from langchain_community.llms.llamafile import Llamafile
 def default_generation_params() -> Dict[str, Any]:
    return {
        "temperature": 0.8,
        "seed": -1,
        "top_k": 40,
        "top_p": 0.95,
        "min_p": 0.05,
        "n_predict": -1,
        "n_keep": 0,
        "tfs_z": 1.0,
        "typical_p": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1,
    }
 def mock_response() -> requests.Response:
    contents = json.dumps({"content": "the quick brown fox"})
    response = requests.Response()
    response.status_code = 200
    response._content = str.encode(contents)
    return response
 def mock_response_stream():  # type: ignore[no-untyped-def]
    mock_response = deque(
        [
            b'data: {"content":"the","multimodal":false,"slot_id":0,"stop":false}\n\n',  # noqa
            b'data: {"content":" quick","multimodal":false,"slot_id":0,"stop":false}\n\n',  # noqa
        ]
    )
    class MockRaw:
        def read(self, chunk_size):  # type: ignore[no-untyped-def]
            try:
                return mock_response.popleft()
            except IndexError:
                return None
    response = requests.Response()
    response.status_code = 200
    response.raw = MockRaw()
    return response
 def test_call(monkeypatch: MonkeyPatch) -> None:
    """
    Test basic functionality of the `invoke` method
    """
    llm = Llamafile(
        base_url="http://llamafile-host:8080",
    )
    def mock_post(url, headers, json, stream, timeout):  # type: ignore[no-untyped-def]
        assert url == "http://llamafile-host:8080/completion"
        assert headers == {
            "Content-Type": "application/json",
        }
        # 'unknown' kwarg should be ignored
        assert json == {"prompt": "Test prompt", **default_generation_params()}
        assert stream is False
        assert timeout is None
        return mock_response()
    monkeypatch.setattr(requests, "post", mock_post)
    out = llm.invoke("Test prompt")
    assert out == "the quick brown fox"
 def test_call_with_kwargs(monkeypatch: MonkeyPatch) -> None:
    """
    Test kwargs passed to `invoke` override the default values and are passed
    to the endpoint correctly. Also test that any 'unknown' kwargs that are not
    present in the LLM class attrs are ignored.
    """
    llm = Llamafile(
        base_url="http://llamafile-host:8080",
    )
    def mock_post(url, headers, json, stream, timeout):  # type: ignore[no-untyped-def]
        assert url == "http://llamafile-host:8080/completion"
        assert headers == {
            "Content-Type": "application/json",
        }
        # 'unknown' kwarg should be ignored
        expected = {"prompt": "Test prompt", **default_generation_params()}
        expected["seed"] = 0
        assert json == expected
        assert stream is False
        assert timeout is None
        return mock_response()
    monkeypatch.setattr(requests, "post", mock_post)
    out = llm.invoke(
        "Test prompt",
        unknown="unknown option",  # should be ignored
        seed=0,  # should override the default
    )
    assert out == "the quick brown fox"
 def test_call_raises_exception_on_missing_server(monkeypatch: MonkeyPatch) -> None:
    """
    Test that the LLM raises a ConnectionError when no llamafile server is
    listening at the base_url.
    """
    llm = Llamafile(
        # invalid url, nothing should actually be running here
        base_url="http://llamafile-host:8080",
    )
    with pytest.raises(requests.exceptions.ConnectionError):
        llm.invoke("Test prompt")
 def test_streaming(monkeypatch: MonkeyPatch) -> None:
    """
    Test basic functionality of `invoke` with streaming enabled.
    """
    llm = Llamafile(
        base_url="http://llamafile-hostname:8080",
        streaming=True,
    )
    def mock_post(url, headers, json, stream, timeout):  # type: ignore[no-untyped-def]
        assert url == "http://llamafile-hostname:8080/completion"
        assert headers == {
            "Content-Type": "application/json",
        }
        # 'unknown' kwarg should be ignored
        assert "unknown" not in json
        expected = {"prompt": "Test prompt", **default_generation_params()}
        expected["stream"] = True
        assert json == expected
        assert stream is True
        assert timeout is None
        return mock_response_stream()
    monkeypatch.setattr(requests, "post", mock_post)
    out = llm.invoke("Test prompt")
    assert out == "the quick"