I am no GIS expert. I am just a simple programmer with no actual, formal background in anything geospatial. All things that I write here are things that I am still grasping to understand myself. I guess I decided to write them down because I want to understand them better, and the best way to understand them better is to write them down the way I wish I had learned them in the first place.
What I’m going to talk about here is the geospatial data in vector type. There are at least 3 kinds of vector type data: point, line, and polygon. There are many file formats known in the geospatial field. The shapefile (SHP) format is probably one of the most well-known and most common, but there are also GeoJSON, KML, KMZ, and many others including those that I probably have never heard of (as I said, I am no expert). I’ll be showing you some of the ways I have learned to read these files in Python.
Fortunately, there are a lot of great libraries available in Python that can help you with this task. You can take a look at Shapely, Fiona, Geopandas, GeoTable, to name a few. Which library to use will totally depend on what data format you have in hand. Is it shapefile (SHP), is it Keyhole Markup Language (KML), is it Zipped KML (KMZ), or is it just simply an Excel file?
The simplest way to read geospatial data if your data are stored in CSV or tabular format is by using Pandas. Pandas is a powerful library for data processing. The data loaded from this file format is in Pandas DataFrame type. From here, you can convert them to any suitable format you need.
As an example, I am going to use the same code from my other article (originally on Medium/TowardsDataScience which you can read here) to read a CSV file with Pandas.
import pandas as pd df = pd.read_csv("Indonesian_volcanoes.csv") df.head()
The data in variable df are in Pandas DataFrame format. Depending on what you aim to do, this data format is enough for many tasks, including plotting them into a map using Folium (which I explain further in my article on the link above). However, in case you need to process the data further, you may want to convert them into a GeoDataFrame format, you can simply do the following.
Converting to Pandas DataFrame to GeoDataFrame Format
import geopandas as gpd gdf = gpd.DataFrame(df, crs="EPSG:4326")
CRS is short for Coordinate Reference System which is used as a reference for projection. CRS EPSG:4326 is used here because the data in my document are simply in longitude and latitude. You may need different CRS if your data are in a different format. You can read more about it on their official documentation here.
The code above wouldn’t turn up an error unless your data don’t have a geometry field in it. The geometry field is a field/column specialized to store geometry data (usually long, lat in Shapely Point, LineString, or Polygon data type). If your geometry data are still stored in string, you may need to do the following.
import geopandas as gpd import shapely from shapely.geometry import Point, Polygon geometry = gpd.points_from_xy(df['lon'], df['lat']) gdf = gpd.GeoDataFrame(df, geometry = geometry)
GeoPandas is a derivation of Pandas specifically made for processing geospatial data. The data you load using GeoPandas will already be in GeoDataFrame format, so you don’t need to convert them.
If you have a file in Shapefile (SHP) format, you can get GeoPandas to read that. Make sure that other accompanying files such as SBN, DBF, PRJ, XML, SHX, SBX, and CPG are present in the same directory as the SHP. You may not use them directly, but your SHP file needs them to complete the metadata.
import geopandas as gpd gdf = gpd.read_file("data/my_file.shp")
Generally, a document in SHP format already has a column called geometry where the geometry data are stored. This means that you won’t need to define them again like what you did when you converted a Pandas DataFrame to a GeoDataFrame in the previous example.
Similarly to above, you can also use GeoPandas to load files with GeoJSON and GPKG extension.
gdf = gpd.read_file("data/my_file.geojson") # OR gdf = gpd.read_file("data/my_file.gpkg")
You can read more about which file extensions can be read by GeoPandas on their documentation page here.
Sometimes, the geometry column doesn’t exist, and instead, you’re faced with WKT format. Here’s how you’re supposed to deal with it.
import shapely geometry = df['wkt'].apply(shapely.wkt.loads) df.drop('wkt', axis=1, inplace=True) gdf = gpd.GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)
Last but not least, there are probably many other ways you can read/load your geospatial data in Python (using GeoTable for example), but I guess the methods I explained above are about enough. If I find other great methods to do so, I’ll make sure to update this article in the future. In the meantime, I hope you find this writing helpful.
Thanks for reading!