langchain/docs/docs/integrations/document_loaders/geopandas.ipynb
Erick Friis 7bc100fd43
docs: integration package pip installs (#15762)
More than 300 files - will fail check_diff. Will merge after Vercel
deploy succeeds

Still occurrences that need changing - will update more later
2024-01-09 11:13:10 -08:00

190 lines
6.2 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "ca4c8c2a",
"metadata": {},
"source": [
"# Geopandas\n",
"\n",
"[Geopandas](https://geopandas.org/en/stable/index.html) is an open-source project to make working with geospatial data in python easier. \n",
"\n",
"GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. \n",
"\n",
"Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and matplotlib for plotting.\n",
"\n",
"LLM applications (chat, QA) that utilize geospatial data are an interesting area for exploration."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00b3bf80",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet sodapy\n",
"%pip install --upgrade --quiet pandas\n",
"%pip install --upgrade --quiet geopandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cecc9320",
"metadata": {},
"outputs": [],
"source": [
"import ast\n",
"\n",
"import geopandas as gpd\n",
"import pandas as pd\n",
"from langchain_community.document_loaders import OpenCityDataLoader"
]
},
{
"cell_type": "markdown",
"id": "04981332",
"metadata": {},
"source": [
"Create a GeoPandas dataframe from [`Open City Data`](https://python.langchain.com/docs/integrations/document_loaders/open_city_data) as an example input."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e7de46b",
"metadata": {},
"outputs": [],
"source": [
"# Load Open City Data\n",
"dataset = \"tmnf-yvry\" # San Francisco crime data\n",
"loader = OpenCityDataLoader(city_id=\"data.sfgov.org\", dataset_id=dataset, limit=5000)\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "7cda2e38",
"metadata": {},
"outputs": [],
"source": [
"# Convert list of dictionaries to DataFrame\n",
"df = pd.DataFrame([ast.literal_eval(d.page_content) for d in docs])\n",
"\n",
"# Extract latitude and longitude\n",
"df[\"Latitude\"] = df[\"location\"].apply(lambda loc: loc[\"coordinates\"][1])\n",
"df[\"Longitude\"] = df[\"location\"].apply(lambda loc: loc[\"coordinates\"][0])\n",
"\n",
"# Create geopandas DF\n",
"gdf = gpd.GeoDataFrame(\n",
" df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude), crs=\"EPSG:4326\"\n",
")\n",
"\n",
"# Only keep valid longitudes and latitudes for San Francisco\n",
"gdf = gdf[\n",
" (gdf[\"Longitude\"] >= -123.173825)\n",
" & (gdf[\"Longitude\"] <= -122.281780)\n",
" & (gdf[\"Latitude\"] >= 37.623983)\n",
" & (gdf[\"Latitude\"] <= 37.929824)\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "030a535c",
"metadata": {},
"source": [
"Visualization of the sample of SF crime data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8148a63e",
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Load San Francisco map data\n",
"sf = gpd.read_file(\"https://data.sfgov.org/resource/3psu-pn9h.geojson\")\n",
"\n",
"# Plot the San Francisco map and the points\n",
"fig, ax = plt.subplots(figsize=(10, 10))\n",
"sf.plot(ax=ax, color=\"white\", edgecolor=\"black\")\n",
"gdf.plot(ax=ax, color=\"red\", markersize=5)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a081a9d1",
"metadata": {},
"source": [
"Load GeoPandas dataframe as a `Document` for downstream processing (embedding, chat, etc). \n",
"\n",
"The `geometry` will be the default `page_content` columns, and all other columns are placed in `metadata`.\n",
"\n",
"But, we can specify the `page_content_column`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "381a5f7b",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import GeoDataFrameLoader\n",
"\n",
"loader = GeoDataFrameLoader(data_frame=gdf, page_content_column=\"geometry\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "74baf6ee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='POINT (-122.420084075249 37.7083109744362)', metadata={'pdid': '4133422003074', 'incidntnum': '041334220', 'incident_code': '03074', 'category': 'ROBBERY', 'descript': 'ROBBERY, BODILY FORCE', 'dayofweek': 'Monday', 'date': '2004-11-22T00:00:00.000', 'time': '17:50', 'pddistrict': 'INGLESIDE', 'resolution': 'NONE', 'address': 'GENEVA AV / SANTOS ST', 'x': '-122.420084075249', 'y': '37.7083109744362', 'location': {'type': 'Point', 'coordinates': [-122.420084075249, 37.7083109744362]}, ':@computed_region_26cr_cadq': '9', ':@computed_region_rxqg_mtj9': '8', ':@computed_region_bh8s_q3mv': '309', ':@computed_region_6qbp_sg9q': nan, ':@computed_region_qgnn_b9vv': nan, ':@computed_region_ajp5_b2md': nan, ':@computed_region_yftq_j783': nan, ':@computed_region_p5aj_wyqh': nan, ':@computed_region_fyvs_ahh9': nan, ':@computed_region_6pnf_4xz7': nan, ':@computed_region_jwn9_ihcz': nan, ':@computed_region_9dfj_4gjx': nan, ':@computed_region_4isq_27mq': nan, ':@computed_region_pigm_ib2e': nan, ':@computed_region_9jxd_iqea': nan, ':@computed_region_6ezc_tdp2': nan, ':@computed_region_h4ep_8xdi': nan, ':@computed_region_n4xg_c4py': nan, ':@computed_region_fcz8_est8': nan, ':@computed_region_nqbw_i6c3': nan, ':@computed_region_2dwj_jsy4': nan, 'Latitude': 37.7083109744362, 'Longitude': -122.420084075249})"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}