Takeoff integration (#9045)

## Description: This PR adds the Titan Takeoff Server to the available LLMs in LangChain. Titan Takeoff is an inference server created by [TitanML](https://www.titanml.co/) that allows you to deploy large language models locally on your hardware in a single command. Most generative model architectures are included, such as Falcon, Llama 2, GPT2, T5 and many more. Read more about Titan Takeoff here: - [Blog](https://medium.com/@TitanML/introducing-titan-takeoff-6c30e55a8e1e) - [Docs](https://docs.titanml.co/docs/titan-takeoff/getting-started) #### Testing As Titan Takeoff runs locally on port 8000 by default, no network access is needed. Responses are mocked for testing. - [x] Make Lint - [x] Make Format - [x] Make Test #### Dependencies No new dependencies are introduced. However, users will need to install the titan-iris package in their local environment and start the Titan Takeoff inferencing server in order to use the Titan Takeoff integration. Thanks for your help and please let me know if you have any questions. cc: @hwchase17 @baskaryan
2025-08-18 09:01:03 +00:00 · 2023-08-10 18:56:06 +01:00 · 2023-08-10 18:56:06 +01:00 · 8d351bfc20
commit 8d351bfc20
parent 3bdc273ab3
4 changed files with 347 additions and 0 deletions
--- a/docs/extras/integrations/llms/titan_takeoff.ipynb
+++ b/docs/extras/integrations/llms/titan_takeoff.ipynb
@ -0,0 +1,169 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Titan Takeoff\n",
+    "\n",
+    "TitanML helps businesses build and deploy better, smaller, cheaper, and faster NLP models through our training, compression, and inference optimization platform. \n",
+    "\n",
+    "Our inference server, [Titan Takeoff](https://docs.titanml.co/docs/titan-takeoff/getting-started) enables deployment of LLMs locally on your hardware in a single command. Most generative model architectures are supported, such as Falcon, Llama 2, GPT2, T5 and many more."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Installation\n",
+    "\n",
+    "To get started with Iris Takeoff, all you need is to have docker and python installed on your local system. If you wish to use the server with gpu suport, then you will need to install docker with cuda support.\n",
+    "\n",
+    "For Mac and Windows users, make sure you have the docker daemon running! You can check this by running docker ps in your terminal. To start the daemon, open the docker desktop app.\n",
+    "\n",
+    "Run the following command to install the Iris CLI that will enable you to run the takeoff server:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "pip install titan-iris"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Choose a Model\n",
+    "Iris Takeoff supports many of the most powerful generative text models, such as Falcon, MPT, and Llama. See the [supported models](https://docs.titanml.co/docs/titan-takeoff/supported-models) for more information. For information about using your own models, see the [custom models](https://docs.titanml.co/docs/titan-takeoff/Advanced/custom-models).\n",
+    "\n",
+    "Going forward in this demo we will be using the falcon 7B instruct model. This is a good open source model that is trained to follow instructions, and is small enough to easily inference even on CPUs.\n",
+    "\n",
+    "## Taking off\n",
+    "Models are referred to by their model id on HuggingFace. Takeoff uses port 8000 by default, but can be configured to use another port. There is also support to use a Nvidia GPU by specifing cuda for the device flag.\n",
+    "\n",
+    "To start the takeoff server, run:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "iris takeoff --model tiiuae/falcon-7b-instruct --device cpu\n",
+    "iris takeoff --model tiiuae/falcon-7b-instruct --device cuda # Nvidia GPU required\n",
+    "iris takeoff --model tiiuae/falcon-7b-instruct --device cpu --port 5000 # run on port 5000 (default: 8000)\n",
+    "```"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You will then be directed to a login page, where you will need to create an account to proceed.\n",
+    "After logging in, run the command onscreen to check whether the server is ready. When it is ready, you can start using the Takeoff integration\n",
+    "\n",
+    "## Inferencing your model\n",
+    "To access your LLM, use the TitanTakeoff LLM wrapper:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.llms import TitanTakeoff\n",
+    "\n",
+    "llm = TitanTakeoff(\n",
+    "    port=8000,\n",
+    "    generate_max_length=128,\n",
+    "    temperature=1.0\n",
+    ")\n",
+    "\n",
+    "prompt = \"What is the largest planet in the solar system?\"\n",
+    "\n",
+    "llm(prompt)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "No parameters are needed by default, but a port can be specified and [generation parameters](https://docs.titanml.co/docs/titan-takeoff/Advanced/generation-parameters) can be supplied.\n",
+    "\n",
+    "### Streaming\n",
+    "Streaming is also supported via the streaming flag:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "\n",
+    "llm = TitanTakeoff(port=8000, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), streaming=True)\n",
+    "\n",
+    "prompt = \"What is the capital of France?\"\n",
+    "\n",
+    "llm(prompt)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Integration with LLMChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import PromptTemplate, LLMChain\n",
+    "\n",
+    "llm = TitanTakeoff()\n",
+    "\n",
+    "template = \"What is the capital of {country}\"\n",
+    "\n",
+    "prompt = PromptTemplate(template=template, input_variables=[\"country\"])\n",
+    "\n",
+    "llm_chain = LLMChain(llm=llm, prompt=prompt)\n",
+    "\n",
+    "generated = llm_chain.run(country=\"Belgium\")\n",
+    "print(generated)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/libs/langchain/langchain/llms/init.py
+++ b/libs/langchain/langchain/llms/init.py
@ -76,6 +76,7 @@ from langchain.llms.self_hosted_hugging_face import SelfHostedHuggingFaceLLM
 from langchain.llms.stochasticai import StochasticAI
 from langchain.llms.symblai_nebula import Nebula
 from langchain.llms.textgen import TextGen
+from langchain.llms.titan_takeoff import TitanTakeoff
 from langchain.llms.tongyi import Tongyi
 from langchain.llms.vertexai import VertexAI
 from langchain.llms.vllm import VLLM
@ -142,6 +143,7 @@ __all__ = [
    "SelfHostedHuggingFaceLLM",
    "SelfHostedPipeline",
    "StochasticAI",
+    "TitanTakeoff",
    "Tongyi",
    "VertexAI",
    "VLLM",
@ -203,6 +205,7 @@ type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
    "self_hosted_hugging_face": SelfHostedHuggingFaceLLM,
    "stochasticai": StochasticAI,
    "tongyi": Tongyi,
+    "titan_takeoff": TitanTakeoff,
    "vertexai": VertexAI,
    "openllm": OpenLLM,
    "openllm_client": OpenLLM,
--- a/libs/langchain/langchain/llms/titan_takeoff.py
+++ b/libs/langchain/langchain/llms/titan_takeoff.py
@ -0,0 +1,157 @@
+from typing import Any, Iterator, List, Mapping, Optional
+
+import requests
+from requests.exceptions import ConnectionError
+
+from langchain.callbacks.manager import CallbackManagerForLLMRun
+from langchain.llms.base import LLM
+from langchain.llms.utils import enforce_stop_tokens
+from langchain.schema.output import GenerationChunk
+
+
+class TitanTakeoff(LLM):
+    port: int = 8000
+    """Specifies the port to use for the Titan Takeoff API. Default = 8000."""
+
+    generate_max_length: int = 128
+    """Maximum generation length. Default = 128."""
+
+    sampling_topk: int = 1
+    """Sample predictions from the top K most probable candidates. Default = 1."""
+
+    sampling_topp: float = 1.0
+    """Sample from predictions whose cumulative probability exceeds this value.
+    Default = 1.0.
+    """
+
+    sampling_temperature: float = 1.0
+    """Sample with randomness. Bigger temperatures are associated with 
+    more randomness and 'creativity'. Default = 1.0.
+    """
+
+    repetition_penalty: float = 1.0
+    """Penalise the generation of tokens that have been generated before. 
+    Set to > 1 to penalize. Default = 1 (no penalty).
+    """
+
+    no_repeat_ngram_size: int = 0
+    """Prevent repetitions of ngrams of this size. Default = 0 (turned off)."""
+
+    streaming: bool = False
+    """Whether to stream the output. Default = False."""
+
+    @property
+    def _default_params(self) -> Mapping[str, Any]:
+        """Get the default parameters for calling Titan Takeoff Server."""
+        params = {
+            "generate_max_length": self.generate_max_length,
+            "sampling_topk": self.sampling_topk,
+            "sampling_topp": self.sampling_topp,
+            "sampling_temperature": self.sampling_temperature,
+            "repetition_penalty": self.repetition_penalty,
+            "no_repeat_ngram_size": self.no_repeat_ngram_size,
+        }
+        return params
+
+    @property
+    def _llm_type(self) -> str:
+        """Return type of llm."""
+        return "titan_takeoff"
+
+    def _call(
+        self,
+        prompt: str,
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> str:
+        """Call out to Titan Takeoff generate endpoint.
+
+        Args:
+            prompt: The prompt to pass into the model.
+            stop: Optional list of stop words to use when generating.
+
+        Returns:
+            The string generated by the model.
+
+        Example:
+            .. code-block:: python
+
+                prompt = "What is the capital of the United Kingdom?"
+                response = model(prompt)
+
+        """
+        try:
+            if self.streaming:
+                text_output = ""
+                for chunk in self._stream(
+                    prompt=prompt,
+                    stop=stop,
+                    run_manager=run_manager,
+                ):
+                    text_output += chunk.text
+                return text_output
+
+            url = f"http://localhost:{self.port}/generate"
+            params = {"text": prompt, **self._default_params}
+
+            response = requests.post(url, json=params)
+            response.raise_for_status()
+            response.encoding = "utf-8"
+            text = ""
+
+            if "message" in response.json():
+                text = response.json()["message"]
+            else:
+                raise ValueError("Something went wrong.")
+            if stop is not None:
+                text = enforce_stop_tokens(text, stop)
+            return text
+        except ConnectionError:
+            raise ConnectionError(
+                "Could not connect to Titan Takeoff server. \
+                Please make sure that the server is running."
+            )
+
+    def _stream(
+        self,
+        prompt: str,
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> Iterator[GenerationChunk]:
+        """Call out to Titan Takeoff stream endpoint.
+
+        Args:
+            prompt: The prompt to pass into the model.
+            stop: Optional list of stop words to use when generating.
+
+        Returns:
+            The string generated by the model.
+
+        Yields:
+            A dictionary like object containing a string token.
+
+        Example:
+            .. code-block:: python
+
+                prompt = "What is the capital of the United Kingdom?"
+                response = model(prompt)
+
+        """
+        url = f"http://localhost:{self.port}/generate_stream"
+        params = {"text": prompt, **self._default_params}
+
+        response = requests.post(url, json=params, stream=True)
+        response.encoding = "utf-8"
+        for text in response.iter_content(chunk_size=1, decode_unicode=True):
+            if text:
+                chunk = GenerationChunk(text=text)
+                yield chunk
+                if run_manager:
+                    run_manager.on_llm_new_token(token=chunk.text)
+
+    @property
+    def _identifying_params(self) -> Mapping[str, Any]:
+        """Get the identifying parameters."""
+        return {"port": self.port, **{}, **self._default_params}
--- a/libs/langchain/tests/integration_tests/llms/test_titan_takeoff.py
+++ b/libs/langchain/tests/integration_tests/llms/test_titan_takeoff.py
@ -0,0 +1,18 @@
+"""Test Titan Takeoff wrapper."""
+
+
+import responses
+
+from langchain.llms.titan_takeoff import TitanTakeoff
+
+
+@responses.activate
+def test_titan_takeoff_call() -> None:
+    """Test valid call to Titan Takeoff."""
+    url = "http://localhost:8000/generate"
+    responses.add(responses.POST, url, json={"message": "2 + 2 is 4"}, status=200)
+
+    # response = requests.post(url)
+    llm = TitanTakeoff()
+    output = llm("What is 2 + 2?")
+    assert isinstance(output, str)