Today a tremendous amount of geospatial knowledge is hidden in massive volumes of text data. To facilitate flexible and powerful geospatial analysis and applications, we introduce a new architecture: geospatial knowledge hypercube, a multi-scale, multidimensional knowledge structure that integrates information from geospatial dimensions, thematic themes and diverse application semantics, extracted and computed from spatial-related text data. To construct such a knowledge hypercube, weakly supervised language models are leveraged for automatic, dynamic and incremental extraction of heterogeneous geospatial data, thematic themes, latent connections and relationships, and application semantics, through combining a variety of information from unstructured text, structured tables, and maps. The hypercube lays a foundation for many knowledge discovery and in-depth spatial analysis, and other advanced applications. We have deployed a prototype web application of proposed geospatial knowledge hypercube for public access at https://hcwebapp.cigi.illinois.edu
This tutorial is designed for anyone who is interested in Natural Language Processing (NLP) in general, geospatial knowledge hypercube in particular, to gain hands-on experience with general-purpose NLP library as well as some state-of-the-art language models for solving information extraction (IE) and document classification tasks. The hypercube lays a foundation for knowledge discovery by combining text and geospatial data analytics. We hope this tutorial will bring you not only a general understanding about geospatial knowledge hypercube but more creative aspects on your own work.
The tutorial will guide you through the following steps:
BERT is abbreviation for Bidirectional Encoder Representations from Transformers, origianally from a research paper by Google AI Language in 2018 [1]. It resolutionized a wide variety of NLP tasks, including question anwsering (QA), natural language inference (NLI). Different from the recent popular GPT (short for Generative Pre-trained Transformers) model in a unidirectional autoregressive fashion, BERT is bidirectional language model and better at natural language understanding.
[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv'18
Figure below demonstrates how BERT handles natural language input, tokenizes words into embeddings (perceivable vectors by deep learning models), encodes and makes predictions for Named Entity Recognition (NER) task. As illustrated, BERT is trained to assign labels like B-PER (Beginning of Person), I-PER (Inside of Person), B-LOC (Beginning of Location) and O (not a named entity) etc.
Here we firstly show how to use SpaCy, a generic NLP library, to perform general purpose NER and visualize the results. For installation of SpaCy, please refer to SpaCy's offical document
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# download trained pipeline in "efficiency" mode
!python -m spacy download en_core_web_sm
import os
import json
import spacy
from spacy import displacy # visualizer
from spacy.tokens import Span
from tqdm import tqdm
from collections import defaultdict
# nlp = spacy.load("en_core_web_trf") # accuracy
nlp = spacy.load("en_core_web_sm") # efficiency
# set up output dir
out_dir = './output'
os.makedirs(out_dir, exist_ok=True)
data_list = []
with open('./data/news_samples.txt', encoding='utf-8') as f:
readin = f.readlines()
for line in tqdm(readin):
data_list.append(line.strip())
print(f'Length of date_list: {len(data_list)}')
for text in data_list:
print(text)
for i, text in enumerate(tqdm(data_list)):
doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)
entity_dict = defaultdict(int)
for entity in doc.ents:
if entity.label_ in ['LOC', 'GPE', 'FAC']: # LOCation, GeoPolitical Entity (i.e. countries, cities, states), FACility
entity_dict[entity.label_ + '_' + entity.text] += 1
# text_seq = entity.text.split()
# if text_seq[0].lower() in watershed_word or text_seq[-1].lower() in watershed_word:
# entity_dict[entity.text] += 1
with open(os.path.join(out_dir, f'{i}_out.txt'), 'w') as fout:
fout.write(json.dumps(entity_dict) + '\n')
import requests
import folium
import matplotlib.pyplot as plt
%matplotlib inline
# load target results
target_news = 1
with open(f'./output/{target_news}_out.txt') as f:
ner = f.read()
ner_list = ner.split('\n')
ner_num = ner_list[0]
ner_js = json.loads(ner_num)
ner_js
ner_class = {}
for key in ner_js.keys():
class_ = key[:3]
if class_ not in ner_class.keys():
ner_class[class_] = {}
# need Google Maps API key
my_Google_Maps_API_key = ''
for key in ner_js.keys():
class_, place_name = key.split('_')
if place_name not in ner_class[class_].keys():
response = requests.get(f'https://maps.googleapis.com/maps/api/geocode/json?address={place_name}&key={my_Google_Maps_API_key}')
if response.json()['results']:
ner_class[class_][place_name] = response.json()['results'][0]['geometry']['location']
m = folium.Map(
location=[38, -97],
tiles="cartodbpositron",
zoom_start=5,
)
# LOC
if 'LOC' in ner_class.keys():
for key in ner_class['LOC']:
lat, lon = ner_class['LOC'][key]['lat'], ner_class['LOC'][key]['lng']
folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='red'),).add_to(m)
# FAC
if 'FAC' in ner_class.keys():
for key in ner_class['FAC']:
lat, lon = ner_class['FAC'][key]['lat'], ner_class['FAC'][key]['lng']
folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='green'),).add_to(m)
# GPE
if 'GPE' in ner_class.keys():
for key in ner_class['GPE']:
lat, lon = ner_class['GPE'][key]['lat'], ner_class['GPE'][key]['lng']
folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='blue'),).add_to(m)
m
Although SpaCy offers a generic tool for general-purpose Named Entity Recognition (NER), it may not be the optimal solution to extract geographic information from text corpus. Specifically, there is a line of research focusing on extracting location-related information, or toponyms, from raw texts. This sub-task of information extraction (IE) is known as geoparsing. Different from general NER, there are toponyms that are fine-grained and not appearing in gazetteers (like location dictionaries), or locations that are described by phrase, abbreviation, or local landmarks. Here we implement a state-of-the-art (SOTA) geoparser, called TopoBERT [1] for geoparsing. By fine-tuning the pretrained language model (PLM) - BERT, TopoBERT yields SOTA performance on geoparsing task.
[1] TopoBERT: Plug and Play Toponym Recognition Module Harnessing Fine-tuned BERT. ArXiv'23
[2] Geospatial Knowledge Hypercube. SIGSPATIAL'23
Given the computational resource and time limitations, here we skip the training of TopoBERT model and directly load the pre-trained model weights. If you are interested in the fine-tuning process, please refer to the source codes in ./geoparser
import os
import json
import re
import queue
import nltk
from nltk import word_tokenize
from geoparser.backbone_models import *
from geoparser.dataset_process import *
nltk.download('punkt')
DEFAULT_TOPOBERT_PATH = './geoparser/topobert_cnn1d/'
class TopoBERT:
def __init__(self, model_dir: str = DEFAULT_TOPOBERT_PATH):
'''
Args:
model_dir (): Locate the dir that stores all the model files, or simply use the default path.
'''
try:
self.model , self.tokenizer, self.model_config, self.training_config = self.load_model(model_dir)
self.label_map = {"1": "O", "2": "B-LOC", "3": "I-LOC", "4": "[CLS]", "5": "[SEP]"}
self.max_seq_length = self.training_config["--max_seq_length"]
self.label_map = {int(k): v for k, v in self.label_map.items()}
#self.device = "cuda:3" if torch.cuda.is_available() else "cpu"
self.device = 'cpu'
# Add model to device and set eval mode:
self.model = self.model.to(self.device)
self.model.eval()
except Exception as e:
print(e)
def load_model(self, model_dir: str, model_config: str = "model_config.json"):
# Load model config:
model_config_file = os.path.join(model_dir, model_config)
current_model_config = json.load(open(model_config_file))
# Init model and load pretrained params:
#model = BertSimpleNer.from_pretrained(model_dir)
model = BertCNN1DNer.from_pretrained(model_dir, model_config=current_model_config)
# Get training config:
current_training_config = os.path.join(model_dir, 'train_config.json')
current_training_config = json.load(open(current_training_config))
# Init tokenizer:
tokenizer = BertTokenizer.from_pretrained(model_dir, do_lower_case=current_training_config['--do_lower_case'])
# Return all params:
return model, tokenizer, current_model_config, current_training_config
def tokenize(self, text: str):
""" tokenize input"""
words = word_tokenize(text)
tokens = []
valid_positions = []
for i, word in enumerate(words):
token = self.tokenizer.tokenize(word)
tokens.extend(token)
for i in range(len(token)):
if i == 0:
valid_positions.append(1)
else:
valid_positions.append(0)
return tokens, valid_positions
def preprocess(self, text: str):
""" preprocess """
tokens, valid_positions = self.tokenize(text)
## insert "[CLS]"
tokens.insert(0, "[CLS]")
valid_positions.insert(0, 1)
## insert "[SEP]"
tokens.append("[SEP]")
valid_positions.append(1)
segment_ids = []
for i in range(len(tokens)):
segment_ids.append(0)
input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < self.max_seq_length: # Padding
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
valid_positions.append(0)
return input_ids,input_mask,segment_ids,valid_positions
def prettify_result(self, org_result):
'''
Input the original result list from the TopoBERT and formulate the
Args:
org_result (str): list of the predicted result for each token.
Returns:
Prettified result output with extra information
'''
combined_addresses = [] # A list of all address
address_results = [] # A list of all predicted LOC
full_address = '' # Link all addresses in combined_addresses
tmp_queue = queue.Queue(maxsize=20)
tmp_address = ''
for index, content in enumerate(org_result):
if content['tag'] == 'B-LOC': # If B-LOC, clear, save and enqueue
# If not empty, empty it and save data:
if not tmp_queue.empty():
# Get all content out first:
while not tmp_queue.empty():
tmp_address += str(tmp_queue.get()) + ' '
tmp_address = tmp_address.strip()
combined_addresses.append(tmp_address)
tmp_address = ''
# Enqueue
tmp_queue.put(content['word'].strip())
# Save location entity:
address_results.append(content['word'].strip())
elif content['tag'] == 'I-LOC': # If I-LOC, enqueue directly
# Enqueue
tmp_queue.put(content['word'].strip())
# Save location entity:
address_results.append(content['word'].strip())
else: # Else, clear and save
if not tmp_queue.empty():
# Get all content out first:
while not tmp_queue.empty():
tmp_address += str(tmp_queue.get()) + ' '
tmp_address = tmp_address.strip()
combined_addresses.append(tmp_address)
tmp_address = ''
# Deal with remaining data:
if not tmp_queue.empty():
# Get all content out first:
while not tmp_queue.empty():
tmp_address += str(tmp_queue.get()) + ' '
tmp_address = tmp_address.strip()
combined_addresses.append(tmp_address)
tmp_address = ''
# Get Full address:
for add_content in combined_addresses:
full_address += ' ' + add_content
full_address = full_address.strip()
# Construct output result:
result_dict = {
'combined_addresses': combined_addresses,
'full_address': full_address,
'address_results': address_results
}
return result_dict
def predict(self, text: str):
input_ids, input_mask, segment_ids, valid_ids = self.preprocess(text)
input_ids = torch.tensor([input_ids], dtype=torch.long, device=self.device)
input_mask = torch.tensor([input_mask], dtype=torch.long, device=self.device)
segment_ids = torch.tensor([segment_ids], dtype=torch.long, device=self.device)
valid_ids = torch.tensor([valid_ids], dtype=torch.long, device=self.device)
with torch.no_grad():
logits = self.model(input_ids, segment_ids, input_mask, None, valid_ids, None)
logits = F.softmax(logits,dim=2)
logits_label = torch.argmax(logits,dim=2)
logits_label = logits_label.detach().cpu().numpy().tolist()[0]
logits_confidence = [values[label].item() for values, label in zip(logits[0], logits_label)]
logits = []
pos = 0
for index, mask in enumerate(valid_ids[0]):
if index == 0:
continue
if mask == 1:
logits.append((logits_label[index-pos], logits_confidence[index-pos]))
else:
pos += 1
logits.pop()
labels = [(self.label_map[label], confidence) for label, confidence in logits]
words = word_tokenize(text)
assert len(labels) == len(words)
output = [{"word":word, "tag":label, "confidence":confidence} for word, (label, confidence) in zip(words, labels)]
prettified_result = self.prettify_result(output)
result_dict = prettified_result
result_dict['org_result'] = output
return result_dict
HarveyTweet2017 is a labeled Twitter dataset, originally collected by University of North Texas. Each tweet contains some location description for disaster rescue. Below there are some examples in HarveyTweet2017.
test_text = '''12 Y/O BOY NEEDs RESCUED! 8100 Cypresswood Dr Spring TX 77379 They are trapped on second story! #houstonflood'''
current_geoparser = TopoBERT()
result = current_geoparser.predict(test_text)
print(result)
Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. Motivated by this, we implement a state-of-the-art (SOTA) weakly-supervised document classification model, namely LOTClass [1], which only uses the label name of each class to train classification models on unlabeled data, without using any labeled documents. Pre-trained lanuage model (PLM) - BERT is used as general linguistic knowledge sources for category understanding and as representation learning models for document classification.
[1] Text Classification using Label Names Only: A Language Model Self-training Approach. EMNLP'20
We collected a corpus of 188 Google News documents about dam failures under the path ./data/new_aging_dam.txt Our task here is to automatically discover the coherent topic of each chunk of the document so that we trained the LOTClass model given four labels in natural language:
Given the computational resource and time limitations, here we skip the training of LOTClass model and directly load the pre-trained model weights. If you are interested in the self-training process, please refer to the source codes in ./classifier
# break document into sentences
with open('./data/news_samples.txt', 'r', encoding='utf-8') as f:
for j, line in enumerate(f):
# with open(out_file, 'a') as out:
# out.write(str(j))
# out.write('\n')
doc = nlp(line)
for sentence in doc.sents:
with open('./data/news_sentences.txt', 'a', encoding='utf-8') as out:
out.write(sentence.text)
out.write('\n')
# check classification results
## load news - use first document as an example
with open('./data/news_samples.txt', 'r', encoding='utf-8') as f:
for j, line in enumerate(f):
print(line)
label_dict = {0: 'ecology',
1: 'human',
2: 'economy',
3: 'infrastructure'}
with open('./data/news_sentences.txt', 'r', encoding='utf-8') as doc:
with open('./classifier/out.txt', 'r', encoding='utf-8') as label:
for j, (sentence, lab) in enumerate(zip(doc, label)):
if sentence =='\n':
continue
print(j+1, sentence, label_dict[int(lab[0])], '\n')