diff --git a/docs/docs/integrations/document_loaders/diffbot.ipynb b/docs/docs/integrations/document_loaders/diffbot.ipynb index 34e9eb9f788..b07f7413eed 100644 --- a/docs/docs/integrations/document_loaders/diffbot.ipynb +++ b/docs/docs/integrations/document_loaders/diffbot.ipynb @@ -3,74 +3,139 @@ { "cell_type": "markdown", "id": "2dfc4698", - "metadata": {}, + "metadata": { + "id": "2dfc4698" + }, "source": [ "# Diffbot\n", "\n", - ">Unlike traditional web scraping tools, [Diffbot](https://docs.diffbot.com/docs) doesn't require any rules to read the content on a page.\n", - ">It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type.\n", - ">The result is a website transformed into clean structured data (like JSON or CSV), ready for your application.\n", + ">[Diffbot](https://docs.diffbot.com/docs/getting-started-with-diffbot) is a suite of ML-based products that make it easy to structure web data.\n", "\n", - "This covers how to extract HTML documents from a list of URLs using the [Diffbot extract API](https://www.diffbot.com/products/extract/), into a document format that we can use downstream.\n" + ">Diffbot's [Extract API](https://docs.diffbot.com/reference/extract-introduction) is a service that structures and normalizes data from web pages.\n", + "\n", + ">Unlike traditional web scraping tools, `Diffbot Extract` doesn't require any rules to read the content on a page. It uses a computer vision model to classify a page into one of 20 possible types, and then transforms raw HTML markup into JSON. The resulting structured JSON follows a consistent [type-based ontology](https://docs.diffbot.com/docs/ontology), which makes it easy to extract data from multiple different web sources with the same schema.\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/document_loaders/diffbot.ipynb)\n" ] }, { - "cell_type": "code", - "execution_count": 4, - "id": "836fbac1", - "metadata": {}, - "outputs": [], + "cell_type": "markdown", + "id": "weuw9JFG4q97", + "metadata": { + "id": "weuw9JFG4q97" + }, "source": [ - "urls = [\n", - " \"https://python.langchain.com/en/latest/index.html\",\n", - "]" + "## Overview\n", + "This guide covers how to extract data from a list of URLs using the [Diffbot Extract API](https://www.diffbot.com/products/extract/) into structured JSON that we can use downstream." ] }, { "cell_type": "markdown", "id": "6fffec88", - "metadata": {}, + "metadata": { + "id": "6fffec88" + }, "source": [ - "The Diffbot Extract API Requires an API token. Once you have it, you can extract the data.\n", + "## Setting up\n", "\n", - "Read [instructions](https://docs.diffbot.com/reference/authentication) how to get the Diffbot API Token." + "Start by installing the required packages." ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, + "id": "ACzXAS352vRc", + "metadata": { + "id": "ACzXAS352vRc" + }, + "outputs": [], + "source": [ + "%pip install --upgrade --quiet langchain-community" + ] + }, + { + "cell_type": "markdown", + "id": "EaIggS702wUJ", + "metadata": { + "id": "EaIggS702wUJ" + }, + "source": [ + "Diffbot's Extract API requires an API token. Follow these instructions to [get a free API token](/docs/integrations/providers/diffbot#installation-and-setup) and then set an environment variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "836fbac1", + "metadata": { + "id": "836fbac1" + }, + "outputs": [], + "source": [ + "%env DIFFBOT_API_TOKEN REPLACE_WITH_YOUR_TOKEN" + ] + }, + { + "cell_type": "markdown", + "id": "qtQun216x6Wy", + "metadata": { + "id": "qtQun216x6Wy" + }, + "source": [ + "## Using the Document Loader\n", + "\n", + "Import the DiffbotLoader module and instantiate it with a list of URLs and your Diffbot token." + ] + }, + { + "cell_type": "code", + "execution_count": 10, "id": "00f46fda", - "metadata": {}, + "metadata": { + "id": "00f46fda" + }, "outputs": [], "source": [ "import os\n", "\n", "from langchain_community.document_loaders import DiffbotLoader\n", "\n", + "urls = [\n", + " \"https://python.langchain.com/\",\n", + "]\n", + "\n", "loader = DiffbotLoader(urls=urls, api_token=os.environ.get(\"DIFFBOT_API_TOKEN\"))" ] }, { "cell_type": "markdown", "id": "e0ce8c05", - "metadata": {}, + "metadata": { + "id": "e0ce8c05" + }, "source": [ "With the `.load()` method, you can see the documents loaded" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 11, "id": "b68a26b3", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "b68a26b3", + "outputId": "b97ab582-1c99-4b6c-f3fa-8e583177c4f7" + }, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an API, but will also:\\nBe data-aware: connect a language model to other sources of data\\nBe agentic: allow a language model to interact with its environment\\nThe LangChain framework is designed with the above principles in mind.\\nThis is the Python specific portion of the documentation. For a purely conceptual guide to LangChain, see here. For the JavaScript documentation, see here.\\nGetting Started\\nCheckout the below guide for a walkthrough of how to get started using LangChain to create an Language Model application.\\nGetting Started Documentation\\nModules\\nThere are several main modules that LangChain provides support for. For each module we provide some examples to get started, how-to guides, reference docs, and conceptual guides. These modules are, in increasing order of complexity:\\nModels: The various model types and model integrations LangChain supports.\\nPrompts: This includes prompt management, prompt optimization, and prompt serialization.\\nMemory: Memory is the concept of persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.\\nIndexes: Language models are often more powerful when combined with your own text data - this module covers best practices for doing exactly that.\\nChains: Chains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.\\nAgents: Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end to end agents.\\nUse Cases\\nThe above modules can be used in a variety of ways. LangChain also provides guidance and assistance in this. Below are some of the common use cases LangChain supports.\\nPersonal Assistants: The main LangChain use case. Personal assistants need to take actions, remember interactions, and have knowledge about your data.\\nQuestion Answering: The second big LangChain use case. Answering questions over specific documents, only utilizing the information in those documents to construct an answer.\\nChatbots: Since language models are good at producing text, that makes them ideal for creating chatbots.\\nQuerying Tabular Data: If you want to understand how to use LLMs to query data that is stored in a tabular format (csvs, SQL, dataframes, etc) you should read this page.\\nInteracting with APIs: Enabling LLMs to interact with APIs is extremely powerful in order to give them more up-to-date information and allow them to take actions.\\nExtraction: Extract structured information from text.\\nSummarization: Summarizing longer documents into shorter, more condensed chunks of information. A type of Data Augmented Generation.\\nEvaluation: Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.\\nReference Docs\\nAll of LangChain’s reference documentation, in one place. Full documentation on all methods, classes, installation methods, and integration setups for LangChain.\\nReference Documentation\\nLangChain Ecosystem\\nGuides for how other companies/products can be used with LangChain\\nLangChain Ecosystem\\nAdditional Resources\\nAdditional collection of resources we think may be useful as you develop your application!\\nLangChainHub: The LangChainHub is a place to share and explore other prompts, chains, and agents.\\nGlossary: A glossary of all related terms, papers, methods, etc. Whether implemented in LangChain or not!\\nGallery: A collection of our favorite projects that use LangChain. Useful for finding inspiration or seeing how things were done in other applications.\\nDeployments: A collection of instructions, code snippets, and template repositories for deploying LangChain apps.\\nTracing: A guide on using tracing in LangChain to visualize the execution of chains and agents.\\nModel Laboratory: Experimenting with different prompts, models, and chains is a big part of developing the best possible application. The ModelLaboratory makes it easy to do so.\\nDiscord: Join us on our Discord to discuss all things LangChain!\\nProduction Support: As you move your LangChains into production, we’d love to offer more comprehensive support. Please fill out this form and we’ll set up a dedicated support Slack channel.', metadata={'source': 'https://python.langchain.com/en/latest/index.html'})]" + "[Document(page_content=\"LangChain is a framework for developing applications powered by large language models (LLMs).\\nLangChain simplifies every stage of the LLM application lifecycle:\\nDevelopment: Build your applications using LangChain's open-source building blocks and components. Hit the ground running using third-party integrations and Templates.\\nProductionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.\\nDeployment: Turn any chain into an API with LangServe.\\nlangchain-core: Base abstractions and LangChain Expression Language.\\nlangchain-community: Third party integrations.\\nPartner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into their own lightweight packages that only depend on langchain-core.\\nlangchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.\\nlanggraph: Build robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.\\nlangserve: Deploy LangChain chains as REST APIs.\\nThe broader ecosystem includes:\\nLangSmith: A developer platform that lets you debug, test, evaluate, and monitor LLM applications and seamlessly integrates with LangChain.\\nGet started\\nWe recommend following our Quickstart guide to familiarize yourself with the framework by building your first LangChain application.\\nSee here for instructions on how to install LangChain, set up your environment, and start building.\\nnote\\nThese docs focus on the Python LangChain library. Head here for docs on the JavaScript LangChain library.\\nUse cases\\nIf you're looking to build something specific or are more of a hands-on learner, check out our use-cases. They're walkthroughs and techniques for common end-to-end tasks, such as:\\nQuestion answering with RAG\\nExtracting structured output\\nChatbots\\nand more!\\nExpression Language\\nLangChain Expression Language (LCEL) is the foundation of many of LangChain's components, and is a declarative way to compose chains. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains.\\nGet started: LCEL and its benefits\\nRunnable interface: The standard interface for LCEL objects\\nPrimitives: More on the primitives LCEL includes\\nand more!\\nEcosystem\\n🦜🛠️ LangSmith\\nTrace and evaluate your language model applications and intelligent agents to help you move from prototype to production.\\n🦜🕸️ LangGraph\\nBuild stateful, multi-actor applications with LLMs, built on top of (and intended to be used with) LangChain primitives.\\n🦜🏓 LangServe\\nDeploy LangChain runnables and chains as REST APIs.\\nSecurity\\nRead up on our Security best practices to make sure you're developing safely with LangChain.\\nAdditional resources\\nComponents\\nLangChain provides standard, extendable interfaces and integrations for many different components, including:\\nIntegrations\\nLangChain is part of a rich ecosystem of tools that integrate with our framework and build on top of it. Check out our growing list of integrations.\\nGuides\\nBest practices for developing with LangChain.\\nAPI reference\\nHead to the reference section for full documentation of all classes and methods in the LangChain and LangChain Experimental Python packages.\\nContributing\\nCheck out the developer's guide for guidelines on contributing and help getting your dev environment set up.\\nHelp us out by providing feedback on this documentation page:\", metadata={'source': 'https://python.langchain.com/'})]" ] }, - "execution_count": 6, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -78,9 +143,63 @@ "source": [ "loader.load()" ] + }, + { + "cell_type": "markdown", + "id": "c07U9jK45thF", + "metadata": { + "id": "c07U9jK45thF" + }, + "source": [ + "## Transform Extracted Text to a Graph Document\n", + "\n", + "Structured page content can be further processed with `DiffbotGraphTransformer` to extract entities and relationships into a graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "UtuZafDL6azi", + "metadata": { + "id": "UtuZafDL6azi" + }, + "outputs": [], + "source": [ + "%pip install --upgrade --quiet langchain-experimental" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "jS0bmVuE5FFJ", + "metadata": { + "id": "jS0bmVuE5FFJ" + }, + "outputs": [], + "source": [ + "from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer\n", + "\n", + "diffbot_nlp = DiffbotGraphTransformer(\n", + " diffbot_api_key=os.environ.get(\"DIFFBOT_API_TOKEN\")\n", + ")\n", + "graph_documents = diffbot_nlp.convert_to_graph_documents(loader.load())" + ] + }, + { + "cell_type": "markdown", + "id": "5nx2MXRe6_kr", + "metadata": { + "id": "5nx2MXRe6_kr" + }, + "source": [ + "To continue loading the data into a Knowledge Graph, follow the [`DiffbotGraphTransformer` guide](/docs/integrations/graphs/diffbot/#loading-the-data-into-a-knowledge-graph)." + ] } ], "metadata": { + "colab": { + "provenance": [] + }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", diff --git a/docs/docs/integrations/graphs/diffbot.ipynb b/docs/docs/integrations/graphs/diffbot.ipynb index 6d1e56dbaa7..7beb4b026ab 100644 --- a/docs/docs/integrations/graphs/diffbot.ipynb +++ b/docs/docs/integrations/graphs/diffbot.ipynb @@ -7,24 +7,21 @@ "source": [ "# Diffbot\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/graph/diffbot_graphtransformer.ipynb)\n", - "\n", - ">[Diffbot](https://docs.diffbot.com/docs/getting-started-with-diffbot) is a suite of products that make it easy to integrate and research data on the web.\n", + ">[Diffbot](https://docs.diffbot.com/docs/getting-started-with-diffbot) is a suite of ML-based products that make it easy to structure web data.\n", ">\n", - ">[The Diffbot Knowledge Graph](https://docs.diffbot.com/docs/getting-started-with-diffbot-knowledge-graph) is a self-updating graph database of the public web.\n", + ">Diffbot's [Natural Language Processing API](https://www.diffbot.com/products/natural-language/) allows for the extraction of entities, relationships, and semantic meaning from unstructured text data.", "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/graphs/diffbot.ipynb)\n", "\n", "## Use case\n", "\n", "Text data often contain rich relationships and insights used for various analytics, recommendation engines, or knowledge management applications.\n", "\n", - "`Diffbot's NLP API` allows for the extraction of entities, relationships, and semantic meaning from unstructured text data.\n", - "\n", "By coupling `Diffbot's NLP API` with `Neo4j`, a graph database, you can create powerful, dynamic graph structures based on the information extracted from text. These graph structures are fully queryable and can be integrated into various applications.\n", "\n", "This combination allows for use cases such as:\n", "\n", - "* Building knowledge graphs from textual documents, websites, or social media feeds.\n", + "* Building knowledge graphs (like [Diffbot's Knowledge Graph](https://www.diffbot.com/products/knowledge-graph/)) from textual documents, websites, or social media feeds.\n", "* Generating recommendations based on semantic relationships in the data.\n", "* Creating advanced search features that understand the relationships between entities.\n", "* Building analytics dashboards that allow users to explore the hidden relationships in data.\n", @@ -57,11 +54,11 @@ "id": "77718977-629e-46c2-b091-f9191b9ec569", "metadata": {}, "source": [ - "### Diffbot NLP Service\n", + "### Diffbot NLP API\n", "\n", - "`Diffbot's NLP` service is a tool for extracting entities, relationships, and semantic context from unstructured text data.\n", + "`Diffbot's NLP API` is a tool for extracting entities, relationships, and semantic context from unstructured text data.\n", "This extracted information can be used to construct a knowledge graph.\n", - "To use their service, you'll need to obtain an API key from [Diffbot](https://www.diffbot.com/products/natural-language/)." + "To use the API, you'll need to obtain a [free API token from Diffbot](https://app.diffbot.com/get-started/)." ] }, { @@ -73,8 +70,8 @@ "source": [ "from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer\n", "\n", - "diffbot_api_key = \"DIFFBOT_API_KEY\"\n", - "diffbot_nlp = DiffbotGraphTransformer(diffbot_api_key=diffbot_api_key)" + "diffbot_api_token = \"DIFFBOT_API_TOKEN\"\n", + "diffbot_nlp = DiffbotGraphTransformer(diffbot_api_token=diffbot_api_token)" ] }, { diff --git a/docs/docs/integrations/providers/diffbot.mdx b/docs/docs/integrations/providers/diffbot.mdx index da130e3cc1d..1a9e9934642 100644 --- a/docs/docs/integrations/providers/diffbot.mdx +++ b/docs/docs/integrations/providers/diffbot.mdx @@ -1,18 +1,29 @@ # Diffbot ->[Diffbot](https://docs.diffbot.com/docs) is a service to read web pages. Unlike traditional web scraping tools, -> `Diffbot` doesn't require any rules to read the content on a page. ->It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type. ->The result is a website transformed into clean-structured data (like JSON or CSV), ready for your application. +> [Diffbot](https://docs.diffbot.com/docs) is a suite of ML-based products that make it easy to structure and integrate web data. ## Installation and Setup -Read [instructions](https://docs.diffbot.com/reference/authentication) how to get the Diffbot API Token. +[Get a free Diffbot API token](https://app.diffbot.com/get-started/) and [follow these instructions](https://docs.diffbot.com/reference/authentication) to authenticate your requests. ## Document Loader +Diffbot's [Extract API](https://docs.diffbot.com/reference/extract-introduction) is a service that structures and normalizes data from web pages. + +Unlike traditional web scraping tools, `Diffbot Extract` doesn't require any rules to read the content on a page. It uses a computer vision model to classify a page into one of 20 possible types, and then transforms raw HTML markup into JSON. The resulting structured JSON follows a consistent [type-based ontology](https://docs.diffbot.com/docs/ontology), which makes it easy to extract data from multiple different web sources with the same schema. + See a [usage example](/docs/integrations/document_loaders/diffbot). ```python from langchain_community.document_loaders import DiffbotLoader ``` + +## Graphs + +Diffbot's [Natural Language Processing API](https://www.diffbot.com/products/natural-language/) allows for the extraction of entities, relationships, and semantic meaning from unstructured text data. + +See a [usage example](/docs/integrations/graphs/diffbot). + +```python +from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer +```