DOCS: integrations/text_embeddings/ cleanup (#13476)

Updated several notebooks: - fixed titles which are inconsistent or break the ToC sorting order. - added missed soruce descriptions and links - fixed formatting
2025-10-30 23:29:54 +00:00 · 2023-11-16 11:56:53 -08:00
parent 6030ab9779
commit b1fcf5b481
7 changed files with 154 additions and 109 deletions
--- a/docs/docs/integrations/text_embedding/ernie.ipynb
+++ b/docs/docs/integrations/text_embedding/ernie.ipynb
@@ -4,9 +4,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# ERNIE Embedding-V1\n",
+    "# ERNIE\n",
    "\n",
-    "[ERNIE Embedding-V1](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu) is a text representation model based on Baidu Wenxin's large-scale model technology, \n",
+    "[ERNIE Embedding-V1](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu) is a text representation model based on `Baidu Wenxin` large-scale model technology, \n",
    "which converts text into a vector form represented by numerical values, and is used in text retrieval, information recommendation, knowledge mining and other scenarios."
   ]
  },
@@ -53,8 +53,19 @@
   "language": "python",
   "name": "python3"
  },
-  "orig_nbformat": 4
+  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/docs/docs/integrations/text_embedding/fastembed.ipynb
+++ b/docs/docs/integrations/text_embedding/fastembed.ipynb
@@ -5,14 +5,14 @@
   "id": "900fbd04-f6aa-4813-868f-1c54e3265385",
   "metadata": {},
   "source": [
-    "# Qdrant FastEmbed\n",
+    "# FastEmbed by Qdrant\n",
    "\n",
-    "[FastEmbed](https://qdrant.github.io/fastembed/) is a lightweight, fast, Python library built for embedding generation. \n",
+    ">[FastEmbed](https://qdrant.github.io/fastembed/) from [Qdrant](https://qdrant.tech) is a lightweight, fast, Python library built for embedding generation. \n",
-    "\n",
+    ">\n",
-    "- Quantized model weights\n",
+    ">- Quantized model weights\n",
-    "- ONNX Runtime, no PyTorch dependency\n",
+    ">- ONNX Runtime, no PyTorch dependency\n",
-    "- CPU-first design\n",
+    ">- CPU-first design\n",
-    "- Data-parallelism for encoding of large datasets."
+    ">- Data-parallelism for encoding of large datasets."
   ]
  },
  {
@@ -154,7 +154,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.6"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
--- a/docs/docs/integrations/text_embedding/instruct_embeddings.ipynb
+++ b/docs/docs/integrations/text_embedding/instruct_embeddings.ipynb
@@ -5,8 +5,10 @@
   "id": "59428e05",
   "metadata": {},
   "source": [
-    "# InstructEmbeddings\n",
+    "# Instruct Embeddings on Hugging Face\n",
-    "Let's load the HuggingFace instruct Embeddings class."
+    "\n",
    ">[Hugging Face sentence-transformers](https://huggingface.co/sentence-transformers) is a Python framework for state-of-the-art sentence, text and image embeddings.\n",
    ">One of the instruct embedding models is used in the `HuggingFaceInstructEmbeddings` class.\n"
   ]
  },
  {
@@ -85,7 +87,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {
--- a/docs/docs/integrations/text_embedding/johnsnowlabs_embedding.ipynb
+++ b/docs/docs/integrations/text_embedding/johnsnowlabs_embedding.ipynb
@@ -2,183 +2,207 @@
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Johnsnowlabs Embedding\n",
    "\n",
    "### Loading the Johnsnowlabs embedding class to generate and query embeddings\n",
    "\n",
    "Models are loaded with [nlp.load](https://nlp.johnsnowlabs.com/docs/en/jsl/load_api) and spark session is started with [nlp.start()](https://nlp.johnsnowlabs.com/docs/en/jsl/start-a-sparksession) under the hood.\n",
    "For all 24.000+ models, see the [John Snow Labs Model Models Hub](https://nlp.johnsnowlabs.com/models)\n"
   ],
   "metadata": {
-    "collapsed": false
+    "collapsed": false,
-   }
+    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# John Snow Labs\n",
    "\n",
    ">[John Snow Labs](https://nlp.johnsnowlabs.com/) NLP & LLM ecosystem includes software libraries for state-of-the-art AI at scale, Responsible AI, No-Code AI, and access to over 20,000 models for Healthcare, Legal, Finance, etc.\n",
    ">\n",
    ">Models are loaded with [nlp.load](https://nlp.johnsnowlabs.com/docs/en/jsl/load_api) and spark session is started >with [nlp.start()](https://nlp.johnsnowlabs.com/docs/en/jsl/start-a-sparksession) under the hood.\n",
    ">For all 24.000+ models, see the [John Snow Labs Model Models Hub](https://nlp.johnsnowlabs.com/models)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "! pip install johnsnowlabs\n"
+    "## Setting up"
-   ],
+   ]
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "! pip install johnsnowlabs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# If you have a enterprise license, you can run this to install enterprise features\n",
    "# from johnsnowlabs import nlp\n",
    "# nlp.install()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "source": [
    "#### Import the necessary classes"
   ],
   "metadata": {
    "collapsed": false
   },
   "execution_count": 1,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found existing installation: langchain 0.0.189\n",
      "Uninstalling langchain-0.0.189:\n",
      "  Successfully uninstalled langchain-0.0.189\n"
     ]
    }
   ]
  },
  {
   "cell_type": "markdown",
-   "source": [],
+   "metadata": {},
-   "metadata": {
+   "source": [
-    "collapsed": false
+    "## Example"
-   }
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from langchain.embeddings.johnsnowlabs import JohnSnowLabsEmbeddings"
-   ],
+   ]
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Initialize Johnsnowlabs Embeddings and Spark Session"
   ],
   "metadata": {
-    "collapsed": false
+    "collapsed": false,
-   }
+    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Initialize Johnsnowlabs Embeddings and Spark Session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "embedder = JohnSnowLabsEmbeddings(\"en.embed_sentence.biobert.clinical_base_cased\")"
-   ],
+   ]
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Define some example texts . These could be any documents that you want to analyze - for example, news articles, social media posts, or product reviews."
   ],
   "metadata": {
-    "collapsed": false
+    "collapsed": false,
-   }
+    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Define some example texts . These could be any documents that you want to analyze - for example, news articles, social media posts, or product reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "texts = [\"Cancer is caused by smoking\", \"Antibiotics aren't painkiller\"]"
-   ],
+   ]
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Generate and print embeddings for the texts . The JohnSnowLabsEmbeddings class generates an embedding for each document, which is a numerical representation of the document's content. These embeddings can be used for various natural language processing tasks, such as document similarity comparison or text classification."
   ],
   "metadata": {
-    "collapsed": false
+    "collapsed": false,
-   }
+    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Generate and print embeddings for the texts . The JohnSnowLabsEmbeddings class generates an embedding for each document, which is a numerical representation of the document's content. These embeddings can be used for various natural language processing tasks, such as document similarity comparison or text classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "embeddings = embedder.embed_documents(texts)\n",
    "for i, embedding in enumerate(embeddings):\n",
    "    print(f\"Embedding for document {i+1}: {embedding}\")"
-   ],
+   ]
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Generate and print an embedding for a single piece of text. You can also generate an embedding for a single piece of text, such as a search query. This can be useful for tasks like information retrieval, where you want to find documents that are similar to a given query."
   ],
   "metadata": {
-    "collapsed": false
+    "collapsed": false,
-   }
+    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Generate and print an embedding for a single piece of text. You can also generate an embedding for a single piece of text, such as a search query. This can be useful for tasks like information retrieval, where you want to find documents that are similar to a given query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "query = \"Cancer is caused by smoking\"\n",
    "query_embedding = embedder.embed_query(query)\n",
    "print(f\"Embedding for query: {query_embedding}\")"
-   ],
+   ]
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
-    "version": 2
+    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
+   "pygments_lexer": "ipython3",
-   "version": "2.7.6"
+   "version": "3.10.12"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 4
 }
--- a/docs/docs/integrations/text_embedding/sentence_transformers.ipynb
+++ b/docs/docs/integrations/text_embedding/sentence_transformers.ipynb
@@ -5,11 +5,13 @@
   "id": "ed47bb62",
   "metadata": {},
   "source": [
-    "# Sentence Transformers\n",
+    "# Sentence Transformers on Hugging Face\n",
    "\n",
-    ">[SentenceTransformers](https://www.sbert.net/) embeddings are called using the `HuggingFaceEmbeddings` integration. We have also added an alias for `SentenceTransformerEmbeddings` for users who are more familiar with directly using that package.\n",
+    ">[Hugging Face sentence-transformers](https://huggingface.co/sentence-transformers) is a Python framework for state-of-the-art sentence, text and image embeddings.\n",
    ">One of the embedding models is used in the `HuggingFaceEmbeddings` class.\n",
    ">We have also added an alias for `SentenceTransformerEmbeddings` for users who are more familiar with directly using that package.\n",
    "\n",
-    "`SentenceTransformers` is a python package that can generate text and image embeddings, originating from [Sentence-BERT](https://arxiv.org/abs/1908.10084)"
+    "`sentence_transformers` package models are originating from [Sentence-BERT](https://arxiv.org/abs/1908.10084)"
   ]
  },
  {
--- a/docs/docs/integrations/text_embedding/tensorflowhub.ipynb
+++ b/docs/docs/integrations/text_embedding/tensorflowhub.ipynb
@@ -5,7 +5,11 @@
   "id": "fff4734f",
   "metadata": {},
   "source": [
-    "# TensorflowHub\n",
+    "# TensorFlow Hub\n",
    "\n",
    ">[TensorFlow Hub](https://www.tensorflow.org/hub) is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. Reuse trained models like `BERT` and `Faster R-CNN` with just a few lines of code.\n",
    ">\n",
    ">\n",
    "Let's load the TensorflowHub Embedding class."
   ]
  },
@@ -105,7 +109,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {
--- a/docs/docs/integrations/text_embedding/voyageai.ipynb
+++ b/docs/docs/integrations/text_embedding/voyageai.ipynb
@@ -7,6 +7,8 @@
   "source": [
    "# Voyage AI\n",
    "\n",
    ">[Voyage AI](https://www.voyageai.com/) provides cutting-edge embedding/vectorizations models.\n",
    "\n",
    "Let's load the Voyage Embedding class."
   ]
  },
@@ -215,7 +217,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.18"
+   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {