langchain/docs/extras/ecosystem/integrations
corranmac 20c6ade2fc
Grobid parser for Scientific Articles from PDF (#6729)
### Scientific Article PDF Parsing via Grobid

`Description:`
This change adds the GrobidParser class, which uses the Grobid library
to parse scientific articles into a universal XML format containing the
article title, references, sections, section text etc. The GrobidParser
uses a local Grobid server to return PDFs document as XML and parses the
XML to optionally produce documents of individual sentences or of whole
paragraphs. Metadata includes the text, paragraph number, pdf relative
bboxes, pages (text may overlap over two pages), section title
(Introduction, Methodology etc), section_number (i.e 1.1, 2.3), the
title of the paper and finally the file path.
      
Grobid parsing is useful beyond standard pdf parsing as it accurately
outputs sections and paragraphs within them. This allows for
post-fitering of results for specific sections i.e. limiting results to
the methodology section or results. While sections are split via
headings, ideally they could be classified specifically into
introduction, methodology, results, discussion, conclusion. I'm
currently experimenting with chatgpt-3.5 for this function, which could
later be implemented as a textsplitter.

`Dependencies:`
For use, the grobid repo must be cloned and Java must be installed, for
colab this is:

```
!apt-get install -y openjdk-11-jdk -q
!update-alternatives --set java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
!git clone https://github.com/kermitt2/grobid.git
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.chdir('grobid')
!./gradlew clean install
```

Once installed the server is ran on localhost:8070 via
```
get_ipython().system_raw('nohup ./gradlew run > grobid.log 2>&1 &')
```

@rlancemartin, @eyurtsev

Twitter Handle: @Corranmac

Grobid Demo Notebook is
[here](https://colab.research.google.com/drive/1X-St_mQRmmm8YWtct_tcJNtoktbdGBmd?usp=sharing).

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-29 14:29:29 -07:00
..
vectara docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
agent_with_wandb_tracing.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ai21.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aim_tracking.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airbyte.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aleph_alpha.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
alibabacloud_opensearch.md Add Alibaba Cloud OpenSearch as a new vector store (#6154) 2023-06-20 10:07:40 -07:00
amazon_api_gateway.mdx Amazon API Gateway hosted LLM (#6673) 2023-06-23 21:27:25 -07:00
analyticdb.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
annoy.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
anyscale.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
apify.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
argilla.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
arxiv.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
atlas.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
awadb.md docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
aws_s3.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azlyrics.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_cognitive_search_.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_openai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bananadev.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
baseten.md Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
beam.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bedrock.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bilibili.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blackboard.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
cassandra.mdx Cassandra support for chat history using CassIO library (#6771) 2023-06-29 10:50:34 -07:00
cerebriumai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
chroma.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
clearml_tracking.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
cohere.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
college_confidential.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
comet_tracking.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
confluence.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ctransformers.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
databerry.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
databricks.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
databricks.md docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
deepinfra.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
deeplake.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
diffbot.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
discord.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
docugami.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
duckdb.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
elasticsearch.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
evernote.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
facebook_chat.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
figma.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
forefrontai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
git.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
gitbook.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_bigquery.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_drive.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_search.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
google_serper.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
gooseai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
gpt4all.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
graphsignal.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
grobid.mdx Grobid parser for Scientific Articles from PDF (#6729) 2023-06-29 14:29:29 -07:00
gutenberg.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hacker_news.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hazy_research.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
helicone.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hologres.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
huggingface.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
ifixit.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
imsdb.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
infino.mdx Infino integration for simplified logs, metrics & search across LLM data & token usage (#6218) 2023-06-21 01:38:20 -07:00
jina.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
lancedb.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
langchain_decorators.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
llamacpp.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mediawikidump.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
metal.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_onedrive.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_powerpoint.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_word.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
milvus.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mlflow_tracking.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modal.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modelscope.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modern_treasury.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
momento.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
motherduck.mdx add motherduck docs (#6572) 2023-06-21 23:13:45 -07:00
myscale.mdx Fix Typo in LangChain MyScale Integration Doc (#6705) 2023-06-25 11:54:00 -07:00
nlpcloud.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
notion.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
obsidian.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
openai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
openllm.mdx fix(docs): broken link for OpenLLM (#6622) 2023-06-23 13:59:17 -07:00
opensearch.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
openweathermap.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
petals.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
pgvector.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
pinecone.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
pipelineai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
predictionguard.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
promptlayer.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
psychic.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
qdrant.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ray_serve.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
rebuff.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
reddit.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
redis.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
replicate.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
roam.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
rockset.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
runhouse.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
rwkv.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
sagemaker_endpoint.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
searx.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
serpapi.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
shaleprotocol.md Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
singlestoredb.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
sklearn.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
slack.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
spacy.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
spreedly.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
starrocks.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
stochasticai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
stripe.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tair.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
telegram.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tigris.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
tomarkdown.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
trello.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
twitter.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
typesense.mdx docs: vectorstore upgrades 2 (#6796) 2023-06-26 22:55:04 -07:00
unstructured.mdx Docs/unstructured api key (#6781) 2023-06-27 16:54:15 -07:00
vespa.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wandb_tracking.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
weather.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
weaviate.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
whatsapp.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
whylabs_profiling.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wikipedia.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wolfram_alpha.mdx docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
writer.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
yeagerai.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
zep.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
zilliz.mdx Doc refactor (#6300) 2023-06-16 11:52:56 -07:00