Geospatial Knowledge Hypercube

Author: Zhaonan Wang, University of Illinois Urbana-Champaign

Summary

Today a tremendous amount of geospatial knowledge is hidden in massive volumes of text data. To facilitate flexible and powerful geospatial analysis and applications, we introduce a new architecture: geospatial knowledge hypercube, a multi-scale, multidimensional knowledge structure that integrates information from geospatial dimensions, thematic themes and diverse application semantics, extracted and computed from spatial-related text data. To construct such a knowledge hypercube, weakly supervised language models are leveraged for automatic, dynamic and incremental extraction of heterogeneous geospatial data, thematic themes, latent connections and relationships, and application semantics, through combining a variety of information from unstructured text, structured tables, and maps. The hypercube lays a foundation for many knowledge discovery and in-depth spatial analysis, and other advanced applications. We have deployed a prototype web application of proposed geospatial knowledge hypercube for public access at https://hcwebapp.cigi.illinois.edu

./figure/Knowledge_Hypercube.png

Learning Objectives

This tutorial is designed for anyone who is interested in Natural Language Processing (NLP) in general, geospatial knowledge hypercube in particular, to gain hands-on experience with general-purpose NLP library as well as some state-of-the-art language models for solving information extraction (IE) and document classification tasks. The hypercube lays a foundation for knowledge discovery by combining text and geospatial data analytics. We hope this tutorial will bring you not only a general understanding about geospatial knowledge hypercube but more creative aspects on your own work.

The tutorial will guide you through the following steps:

  • Import necessary libraries
  • Load sample corpus
  • Understand the pre-trained language model (PLM): BERT
  • Perform general-purpose Named Entity Recognition (NER)
  • Construct hypercube: geospatial dimension
  • Construct hypercube: topic dimension

Requirement

  • Python >= 3.7
  • NLTK (same as python)
  • SpaCy 3.4
  • PyTorch >= 1.7
  • Transformers (by Hugging Face) >= 4.0

Understand the Pre-trained Language Model (PLM): BERT

BERT is abbreviation for Bidirectional Encoder Representations from Transformers, origianally from a research paper by Google AI Language in 2018 [1]. It resolutionized a wide variety of NLP tasks, including question anwsering (QA), natural language inference (NLI). Different from the recent popular GPT (short for Generative Pre-trained Transformers) model in a unidirectional autoregressive fashion, BERT is bidirectional language model and better at natural language understanding.

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv'18

Figure below demonstrates how BERT handles natural language input, tokenizes words into embeddings (perceivable vectors by deep learning models), encodes and makes predictions for Named Entity Recognition (NER) task. As illustrated, BERT is trained to assign labels like B-PER (Beginning of Person), I-PER (Inside of Person), B-LOC (Beginning of Location) and O (not a named entity) etc.

BERT.png

Perform General-purpose Named Entity Recognition (NER)

Here we firstly show how to use SpaCy, a generic NLP library, to perform general purpose NER and visualize the results. For installation of SpaCy, please refer to SpaCy's offical document

In [ ]:
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# download trained pipeline in "efficiency" mode
!python -m spacy download en_core_web_sm
In [16]:
import os
import json
import spacy
from spacy import displacy    # visualizer
from spacy.tokens import Span
from tqdm import tqdm
from collections import defaultdict

# nlp = spacy.load("en_core_web_trf")     # accuracy
nlp = spacy.load("en_core_web_sm")    # efficiency

# set up output dir
out_dir = './output'
os.makedirs(out_dir, exist_ok=True)
In [58]:
data_list = []
with open('./data/news_samples.txt', encoding='utf-8') as f:
    readin = f.readlines()
    for line in tqdm(readin):
        data_list.append(line.strip())
print(f'Length of date_list: {len(data_list)}')
for text in data_list:
    print(text)
100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
Length of date_list: 3
Dozens, if not more than a hundred, Midland-area residents gathered to seek refuge within the walls of Midland High School Tuesday night after the Edenville Dam failed to hold back a deluge of water. Midland officials warned residents living near the Tittabawassee River to evacuate. They are concerned the Sanford Dam, located a few miles northwest of the city and downstream of the Edenville Dam, will also fail. Some drove to the school at 1301 Eastlawn Drive to seek shelter. Others were brought in by bus.
Videos and images captured by witnesses show just how much water was unleashed when Michigan's Edenville Dam failed. Officials had been warning nearby residents to evacuate all day Tuesday because of fears the hydroelectric dam holding back Wixom Lake would break. It was announced on Facebook around 6 p.m. Tuesday that the dam had failed -- and a torrent of water was rushing down the Tittabawassee River. The water's unrelenting flow continued overnight and daylight on Wednesday showed how little was left of the lake. An aerial image taken by a drone shows the Edenville dam breach on Wednesday.
Soaking rains from the remnants of Hurricane Ida prompted the evacuations of thousands of people Wednesday after water reached dangerous levels at a dam near Johnstown, PA. The storm moved east in the evening, with the National Weather Service confirming at least one tornado and social media posts showing homes blown to rubble and roofs torn from buildings in a southern New Jersey county just outside Philadelphia. Pennsylvania was blanketed with rain after high water drove some from their homes in Maryland and Virginia. The storm killed one person, two people were not accounted for, and a tornado was believed to have touched down along the Chesapeake Bay in Maryland. Ida caused countless school and business closures in Pennsylvania. About 150 roadways maintained by the Pennsylvania Department of Transportation were closed and many smaller roadways also were impassable.

In [59]:
for i, text in enumerate(tqdm(data_list)):
    doc = nlp(text)
    displacy.render(doc, style='ent', jupyter=True)
    
    entity_dict = defaultdict(int)
    for entity in doc.ents:
        if entity.label_ in ['LOC', 'GPE', 'FAC']:    # LOCation, GeoPolitical Entity (i.e. countries, cities, states), FACility
            entity_dict[entity.label_ + '_' + entity.text] += 1
            # text_seq = entity.text.split()
            # if text_seq[0].lower() in watershed_word or text_seq[-1].lower() in watershed_word:
            #     entity_dict[entity.text] += 1
    with open(os.path.join(out_dir, f'{i}_out.txt'), 'w') as fout:
        fout.write(json.dumps(entity_dict) + '\n')
  0%|                                                                                            | 0/3 [00:00<?, ?it/s]
Dozens CARDINAL , if not more than a hundred CARDINAL , Midland GPE -area residents gathered to seek refuge within the walls of Midland High School ORG Tuesday night TIME after the Edenville Dam ORG failed to hold back a deluge of water. Midland GPE officials warned residents living near the Tittabawassee River LOC to evacuate. They are concerned the Sanford Dam FAC , located a few miles QUANTITY northwest of the city and downstream of the Edenville Dam ORG , will also fail. Some drove to the school at 1301 DATE Eastlawn Drive FAC to seek shelter. Others were brought in by bus.
 33%|████████████████████████████                                                        | 1/3 [00:00<00:00,  8.80it/s]
Videos GPE and images captured by witnesses show just how much water was unleashed when Michigan GPE 's Edenville Dam FAC failed. Officials had been warning nearby residents to evacuate all day Tuesday DATE because of fears the hydroelectric dam holding back Wixom Lake PERSON would break. It was announced on Facebook around 6 p.m. TIME Tuesday DATE that the dam had failed -- and a torrent of water was rushing down the Tittabawassee River LOC . The water's unrelenting flow continued overnight and daylight on Wednesday DATE showed how little was left of the lake. An aerial image taken by a drone shows the Edenville ORG dam breach on Wednesday DATE .
 67%|████████████████████████████████████████████████████████                            | 2/3 [00:00<00:00,  9.26it/s]
Soaking rains from the remnants of Hurricane Ida ORG prompted the evacuations of thousands CARDINAL of people Wednesday DATE after water reached dangerous levels at a dam near Johnstown GPE , PA GPE . The storm moved east in the evening TIME , with the National Weather Service ORG confirming at least one CARDINAL tornado and social media posts showing homes blown to rubble and roofs torn from buildings in a southern New Jersey county GPE just outside Philadelphia GPE . Pennsylvania GPE was blanketed with rain after high water drove some from their homes in Maryland GPE and Virginia GPE . The storm killed one CARDINAL person, two CARDINAL people were not accounted for, and a tornado was believed to have touched down along the Chesapeake Bay LOC in Maryland GPE . Ida caused countless school and business closures in Pennsylvania GPE . About 150 CARDINAL roadways maintained by the Pennsylvania Department of Transportation ORG were closed and many smaller roadways also were impassable.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.59it/s]

Visualize extracted geo-entities

In [27]:
import requests
import folium
import matplotlib.pyplot as plt
%matplotlib inline
In [54]:
# load target results
target_news = 1
with open(f'./output/{target_news}_out.txt') as f:
    ner = f.read()
ner_list = ner.split('\n')
ner_num = ner_list[0]
ner_js = json.loads(ner_num)
ner_js
Out[54]:
{'GPE_Videos': 1,
 'GPE_Michigan': 1,
 'FAC_Edenville Dam': 1,
 'LOC_the Tittabawassee River': 1}
In [55]:
ner_class = {}
for key in ner_js.keys():
    class_ = key[:3]
    if class_ not in ner_class.keys():
        ner_class[class_] = {}

# need Google Maps API key
my_Google_Maps_API_key = ''
for key in ner_js.keys():
    class_, place_name = key.split('_')
    if place_name not in ner_class[class_].keys():
        response = requests.get(f'https://maps.googleapis.com/maps/api/geocode/json?address={place_name}&key={my_Google_Maps_API_key}')
        if response.json()['results']:
            ner_class[class_][place_name] = response.json()['results'][0]['geometry']['location']
In [60]:
m = folium.Map(
    location=[38, -97],
    tiles="cartodbpositron",
    zoom_start=5,
)

# LOC
if 'LOC' in ner_class.keys():
    for key in ner_class['LOC']:
        lat, lon = ner_class['LOC'][key]['lat'], ner_class['LOC'][key]['lng']
        folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='red'),).add_to(m)
# FAC
if 'FAC' in ner_class.keys():
    for key in ner_class['FAC']:
        lat, lon = ner_class['FAC'][key]['lat'], ner_class['FAC'][key]['lng']
        folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='green'),).add_to(m)
# GPE
if 'GPE' in ner_class.keys():
    for key in ner_class['GPE']:
        lat, lon = ner_class['GPE'][key]['lat'], ner_class['GPE'][key]['lng']
        folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='blue'),).add_to(m)
m
Out[60]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Construct Hypercube: Geospatial Dimension

Although SpaCy offers a generic tool for general-purpose Named Entity Recognition (NER), it may not be the optimal solution to extract geographic information from text corpus. Specifically, there is a line of research focusing on extracting location-related information, or toponyms, from raw texts. This sub-task of information extraction (IE) is known as geoparsing. Different from general NER, there are toponyms that are fine-grained and not appearing in gazetteers (like location dictionaries), or locations that are described by phrase, abbreviation, or local landmarks. Here we implement a state-of-the-art (SOTA) geoparser, called TopoBERT [1] for geoparsing. By fine-tuning the pretrained language model (PLM) - BERT, TopoBERT yields SOTA performance on geoparsing task.

[1] TopoBERT: Plug and Play Toponym Recognition Module Harnessing Fine-tuned BERT. ArXiv'23

[2] Geospatial Knowledge Hypercube. SIGSPATIAL'23

(Optional) Fine-tuning Geoparser

Given the computational resource and time limitations, here we skip the training of TopoBERT model and directly load the pre-trained model weights. If you are interested in the fine-tuning process, please refer to the source codes in ./geoparser

In [5]:
import os
import json
import re
import queue
import nltk
from nltk import word_tokenize
from geoparser.backbone_models import *
from geoparser.dataset_process import *
In [6]:
nltk.download('punkt')
DEFAULT_TOPOBERT_PATH = './geoparser/topobert_cnn1d/'
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\znwang\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
In [7]:
class TopoBERT:
    def __init__(self, model_dir: str = DEFAULT_TOPOBERT_PATH):
        '''

        Args:
            model_dir (): Locate the dir that stores all the model files, or simply use the default path.
        '''
        try:
            self.model , self.tokenizer, self.model_config, self.training_config = self.load_model(model_dir)
            self.label_map = {"1": "O", "2": "B-LOC", "3": "I-LOC", "4": "[CLS]", "5": "[SEP]"}
            self.max_seq_length = self.training_config["--max_seq_length"]
            self.label_map = {int(k): v for k, v in self.label_map.items()}
            #self.device = "cuda:3" if torch.cuda.is_available() else "cpu"
            self.device = 'cpu'
            # Add model to device and set eval mode:
            self.model = self.model.to(self.device)
            self.model.eval()
        except Exception as e:
            print(e)

    def load_model(self, model_dir: str, model_config: str = "model_config.json"):
        # Load model config:
        model_config_file = os.path.join(model_dir, model_config)
        current_model_config = json.load(open(model_config_file))
        # Init model and load pretrained params:
        #model = BertSimpleNer.from_pretrained(model_dir)
        model = BertCNN1DNer.from_pretrained(model_dir, model_config=current_model_config)
        # Get training config:
        current_training_config = os.path.join(model_dir, 'train_config.json')
        current_training_config = json.load(open(current_training_config))
        # Init tokenizer:
        tokenizer = BertTokenizer.from_pretrained(model_dir, do_lower_case=current_training_config['--do_lower_case'])

        # Return all params:
        return model, tokenizer, current_model_config, current_training_config


    def tokenize(self, text: str):
        """ tokenize input"""
        words = word_tokenize(text)
        tokens = []
        valid_positions = []
        for i, word in enumerate(words):
            token = self.tokenizer.tokenize(word)
            tokens.extend(token)
            for i in range(len(token)):
                if i == 0:
                    valid_positions.append(1)
                else:
                    valid_positions.append(0)
        return tokens, valid_positions


    def preprocess(self, text: str):
        """ preprocess """
        tokens, valid_positions = self.tokenize(text)
        ## insert "[CLS]"
        tokens.insert(0, "[CLS]")
        valid_positions.insert(0, 1)
        ## insert "[SEP]"
        tokens.append("[SEP]")
        valid_positions.append(1)
        segment_ids = []
        for i in range(len(tokens)):
            segment_ids.append(0)
        input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
        input_mask = [1] * len(input_ids)
        while len(input_ids) < self.max_seq_length: # Padding
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
            valid_positions.append(0)
        return input_ids,input_mask,segment_ids,valid_positions


    def prettify_result(self, org_result):
        '''
        Input the original result list from the TopoBERT and formulate the
        Args:
            org_result (str): list of the predicted result for each token.

        Returns:
            Prettified result output with extra information
        '''
        combined_addresses = []  # A list of all address
        address_results = []  # A list of all predicted LOC
        full_address = ''  # Link all addresses in combined_addresses
        tmp_queue = queue.Queue(maxsize=20)
        tmp_address = ''
        for index, content in enumerate(org_result):
            if content['tag'] == 'B-LOC':  # If B-LOC, clear, save and enqueue
                # If not empty, empty it and save data:
                if not tmp_queue.empty():
                    # Get all content out first:
                    while not tmp_queue.empty():
                        tmp_address += str(tmp_queue.get()) + ' '
                    tmp_address = tmp_address.strip()
                    combined_addresses.append(tmp_address)
                    tmp_address = ''
                # Enqueue
                tmp_queue.put(content['word'].strip())
                # Save location entity:
                address_results.append(content['word'].strip())
            elif content['tag'] == 'I-LOC':  # If I-LOC, enqueue directly
                # Enqueue
                tmp_queue.put(content['word'].strip())
                # Save location entity:
                address_results.append(content['word'].strip())
            else:  # Else, clear and save
                if not tmp_queue.empty():
                    # Get all content out first:
                    while not tmp_queue.empty():
                        tmp_address += str(tmp_queue.get()) + ' '
                    tmp_address = tmp_address.strip()
                    combined_addresses.append(tmp_address)
                    tmp_address = ''
        # Deal with remaining data:
        if not tmp_queue.empty():
            # Get all content out first:
            while not tmp_queue.empty():
                tmp_address += str(tmp_queue.get()) + ' '
            tmp_address = tmp_address.strip()
            combined_addresses.append(tmp_address)
            tmp_address = ''

        # Get Full address:
        for add_content in combined_addresses:
            full_address += ' ' + add_content
        full_address = full_address.strip()

        # Construct output result:
        result_dict = {
            'combined_addresses': combined_addresses,
            'full_address': full_address,
            'address_results': address_results
        }

        return result_dict


    def predict(self, text: str):
        input_ids, input_mask, segment_ids, valid_ids = self.preprocess(text)
        input_ids = torch.tensor([input_ids], dtype=torch.long, device=self.device)
        input_mask = torch.tensor([input_mask], dtype=torch.long, device=self.device)
        segment_ids = torch.tensor([segment_ids], dtype=torch.long, device=self.device)
        valid_ids = torch.tensor([valid_ids], dtype=torch.long, device=self.device)
        with torch.no_grad():
            logits = self.model(input_ids, segment_ids, input_mask, None, valid_ids, None)
        logits = F.softmax(logits,dim=2)
        logits_label = torch.argmax(logits,dim=2)
        logits_label = logits_label.detach().cpu().numpy().tolist()[0]

        logits_confidence = [values[label].item() for values, label in zip(logits[0], logits_label)]

        logits = []

        pos = 0
        for index, mask in enumerate(valid_ids[0]):
            if index == 0:
                continue
            if mask == 1:
                logits.append((logits_label[index-pos], logits_confidence[index-pos]))
            else:
                pos += 1
        logits.pop()

        labels = [(self.label_map[label], confidence) for label, confidence in logits]
        words = word_tokenize(text)
        assert len(labels) == len(words)
        output = [{"word":word, "tag":label, "confidence":confidence} for word, (label, confidence) in zip(words, labels)]
        prettified_result = self.prettify_result(output)
        result_dict = prettified_result
        result_dict['org_result'] = output

        return result_dict

HarveyTweet2017

HarveyTweet2017 is a labeled Twitter dataset, originally collected by University of North Texas. Each tweet contains some location description for disaster rescue. Below there are some examples in HarveyTweet2017.

  • “12 Y/O BOY NEEDs RESCUED! 8100 Cypresswood Dr Spring TX 77379 They are trapped on second story! #houstonflood”
  • “80 people stranded in a church!! 5547 Cavalcade St, Houston, TX 77026 #harveyrescue #hurricaneharvey”
  • “Rescue needed: 2907 Trinity Drive, Pearland, Tx. Need boat rescue 3 people, 2 elderly one is 90 not steady in her feet & cant swim. #Harvey”
  • “Community is responding at shelters in College Park High School and Magnolia High School #TheWoodlands #Harvey...”
  • “#Houston #HoustonFlood the intersection of I-45 & N. Main Street
In [17]:
test_text = '''12 Y/O BOY NEEDs RESCUED! 8100 Cypresswood Dr Spring TX 77379 They are trapped on second story! #houstonflood'''
current_geoparser = TopoBERT()
result = current_geoparser.predict(test_text)
print(result)
Some weights of the model checkpoint at bert-large-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
{'combined_addresses': [], 'full_address': '', 'address_results': [], 'org_result': [{'word': '12', 'tag': '[CLS]', 'confidence': 0.4642878472805023}, {'word': 'Y/O', 'tag': '[CLS]', 'confidence': 0.4642651379108429}, {'word': 'BOY', 'tag': '[CLS]', 'confidence': 0.46427053213119507}, {'word': 'NEEDs', 'tag': '[CLS]', 'confidence': 0.4642644226551056}, {'word': 'RESCUED', 'tag': '[CLS]', 'confidence': 0.4642713963985443}, {'word': '!', 'tag': '[CLS]', 'confidence': 0.46427756547927856}, {'word': '8100', 'tag': '[CLS]', 'confidence': 0.46427905559539795}, {'word': 'Cypresswood', 'tag': '[CLS]', 'confidence': 0.46427053213119507}, {'word': 'Dr', 'tag': '[CLS]', 'confidence': 0.4642679989337921}, {'word': 'Spring', 'tag': '[CLS]', 'confidence': 0.4642948508262634}, {'word': 'TX', 'tag': '[CLS]', 'confidence': 0.4642651379108429}, {'word': '77379', 'tag': '[CLS]', 'confidence': 0.464277982711792}, {'word': 'They', 'tag': '[CLS]', 'confidence': 0.4642740786075592}, {'word': 'are', 'tag': '[CLS]', 'confidence': 0.4642642140388489}, {'word': 'trapped', 'tag': '[CLS]', 'confidence': 0.46429166197776794}, {'word': 'on', 'tag': '[CLS]', 'confidence': 0.4642685353755951}, {'word': 'second', 'tag': '[CLS]', 'confidence': 0.46426165103912354}, {'word': 'story', 'tag': '[CLS]', 'confidence': 0.46427714824676514}, {'word': '!', 'tag': '[CLS]', 'confidence': 0.46427789330482483}, {'word': '#', 'tag': '[CLS]', 'confidence': 0.4642692506313324}, {'word': 'houstonflood', 'tag': '[CLS]', 'confidence': 0.4642810523509979}]}

Construct Hypercube: Topic Dimension

Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. Motivated by this, we implement a state-of-the-art (SOTA) weakly-supervised document classification model, namely LOTClass [1], which only uses the label name of each class to train classification models on unlabeled data, without using any labeled documents. Pre-trained lanuage model (PLM) - BERT is used as general linguistic knowledge sources for category understanding and as representation learning models for document classification.

[1] Text Classification using Label Names Only: A Language Model Self-training Approach. EMNLP'20

LOTClass.png

(Optional) Training Document Classifier on Aging Dam

We collected a corpus of 188 Google News documents about dam failures under the path ./data/new_aging_dam.txt Our task here is to automatically discover the coherent topic of each chunk of the document so that we trained the LOTClass model given four labels in natural language:

  • ecology
  • human
  • economy
  • infrastructure

Given the computational resource and time limitations, here we skip the training of LOTClass model and directly load the pre-trained model weights. If you are interested in the self-training process, please refer to the source codes in ./classifier

Word Expansion Results

  • ecology: 'ecosystem', 'ecosystems', 'wildlife', 'vegetation', 'forest', 'global', 'nature', 'park', 'climate', 'canopy', 'native', 'ecology', 'biodiversity', 'habitats', 'eco', 'fauna', 'rainforest', 'plant', 'indigenous', 'fragile', 'conservation', 'tree', 'society' ...
  • human: 'human', 'man', 'humans', 'individual', 'people', 'person', 'mankind', 'animal', 'normal', 'woman', 'civilian', 'humanity', 'civil', 'serious', 'player', 'moral', 'domestic', 'single', 'professional', 'mortal', 'live', 'american', 'male', 'internal', 'us', 'persons' ...
  • economy: 'economic', 'economically', 'financial', 'economical', 'economy', 'economics', 'commercial', 'agricultural', 'industrial', 'economies', 'business', 'uneven', 'international', 'entertainment', 'employment', 'impact', 'annual', 'income' ...
  • infrastructure: 'infrastructure', 'heritage', 'development', 'network', 'structure', 'equipment', 'technology', 'system', 'property', 'structures', 'architecture', 'construction', 'dam', 'education', 'embankment', 'grid', 'facilities', 'security', 'powerhouse', 'sanitation' ...
In [80]:
# break document into sentences

with open('./data/news_samples.txt', 'r', encoding='utf-8') as f:
    for j, line in enumerate(f):
        # with open(out_file, 'a') as out:
        #     out.write(str(j))
        #     out.write('\n')

        doc = nlp(line)
        for sentence in doc.sents:
            with open('./data/news_sentences.txt', 'a', encoding='utf-8') as out:
                out.write(sentence.text)
                out.write('\n')
In [81]:
# check classification results

## load news - use first document as an example
with open('./data/news_samples.txt', 'r', encoding='utf-8') as f:
    for j, line in enumerate(f):
        print(line)
Dozens, if not more than a hundred, Midland-area residents gathered to seek refuge within the walls of Midland High School Tuesday night after the Edenville Dam failed to hold back a deluge of water. Midland officials warned residents living near the Tittabawassee River to evacuate. They are concerned the Sanford Dam, located a few miles northwest of the city and downstream of the Edenville Dam, will also fail. Some drove to the school at 1301 Eastlawn Drive to seek shelter. Others were brought in by bus.

Videos and images captured by witnesses show just how much water was unleashed when Michigan's Edenville Dam failed. Officials had been warning nearby residents to evacuate all day Tuesday because of fears the hydroelectric dam holding back Wixom Lake would break. It was announced on Facebook around 6 p.m. Tuesday that the dam had failed -- and a torrent of water was rushing down the Tittabawassee River. The water's unrelenting flow continued overnight and daylight on Wednesday showed how little was left of the lake. An aerial image taken by a drone shows the Edenville dam breach on Wednesday.

Soaking rains from the remnants of Hurricane Ida prompted the evacuations of thousands of people Wednesday after water reached dangerous levels at a dam near Johnstown, PA. The storm moved east in the evening, with the National Weather Service confirming at least one tornado and social media posts showing homes blown to rubble and roofs torn from buildings in a southern New Jersey county just outside Philadelphia. Pennsylvania was blanketed with rain after high water drove some from their homes in Maryland and Virginia. The storm killed one person, two people were not accounted for, and a tornado was believed to have touched down along the Chesapeake Bay in Maryland. Ida caused countless school and business closures in Pennsylvania. About 150 roadways maintained by the Pennsylvania Department of Transportation were closed and many smaller roadways also were impassable.

In [82]:
label_dict = {0: 'ecology',
              1: 'human',
              2: 'economy',
              3: 'infrastructure'}

with open('./data/news_sentences.txt', 'r', encoding='utf-8') as doc:
    with open('./classifier/out.txt', 'r', encoding='utf-8') as label:
        for j, (sentence, lab) in enumerate(zip(doc, label)):
            if sentence =='\n':
                continue
            print(j+1, sentence, label_dict[int(lab[0])], '\n')
1 Dozens, if not more than a hundred, Midland-area residents gathered to seek refuge within the walls of Midland High School Tuesday night after the Edenville Dam failed to hold back a deluge of water.
 infrastructure 

2 Midland officials warned residents living near the Tittabawassee River to evacuate.
 human 

3 They are concerned the Sanford Dam, located a few miles northwest of the city and downstream of the Edenville Dam, will also fail.
 infrastructure 

4 Some drove to the school at 1301 Eastlawn Drive to seek shelter.
 infrastructure 

5 Others were brought in by bus.
 infrastructure 

7 Videos and images captured by witnesses show just how much water was unleashed when Michigan's Edenville Dam failed.
 infrastructure 

8 Officials had been warning nearby residents to evacuate all day Tuesday because of fears the hydroelectric dam holding back Wixom Lake would break.
 infrastructure 

9 It was announced on Facebook around 6 p.m. Tuesday that the dam had failed -- and a torrent of water was rushing down the Tittabawassee River.
 infrastructure 

10 The water's unrelenting flow continued overnight and daylight on Wednesday showed how little was left of the lake.
 ecology 

11 An aerial image taken by a drone shows the Edenville dam breach on Wednesday.
 infrastructure 

13 Soaking rains from the remnants of Hurricane Ida prompted the evacuations of thousands of people Wednesday after water reached dangerous levels at a dam near Johnstown, PA.
 human 

14 The storm moved east in the evening, with the National Weather Service confirming at least one tornado and social media posts showing homes blown to rubble and roofs torn from buildings in a southern New Jersey county just outside Philadelphia.
 human 

15 Pennsylvania was blanketed with rain after high water drove some from their homes in Maryland and Virginia.
 human 

16 The storm killed one person, two people were not accounted for, and a tornado was believed to have touched down along the Chesapeake Bay in Maryland.
 human 

17 Ida caused countless school and business closures in Pennsylvania.
 economy 

18 About 150 roadways maintained by the Pennsylvania Department of Transportation were closed and many smaller roadways also were impassable.
 infrastructure 

20 Dozens, if not more than a hundred, Midland-area residents gathered to seek refuge within the walls of Midland High School Tuesday night after the Edenville Dam failed to hold back a deluge of water.
 human 

21 Midland officials warned residents living near the Tittabawassee River to evacuate.
 human 

22 They are concerned the Sanford Dam, located a few miles northwest of the city and downstream of the Edenville Dam, will also fail.
 human 

23 Some drove to the school at 1301 Eastlawn Drive to seek shelter.
 human 

24 Others were brought in by bus.
 human 

26 Videos and images captured by witnesses show just how much water was unleashed when Michigan's Edenville Dam failed.
 human 

27 Officials had been warning nearby residents to evacuate all day Tuesday because of fears the hydroelectric dam holding back Wixom Lake would break.
 infrastructure 

28 It was announced on Facebook around 6 p.m. Tuesday that the dam had failed -- and a torrent of water was rushing down the Tittabawassee River.
 human 

29 The water's unrelenting flow continued overnight and daylight on Wednesday showed how little was left of the lake.
 human 

30 An aerial image taken by a drone shows the Edenville dam breach on Wednesday.
 human 

32 Soaking rains from the remnants of Hurricane Ida prompted the evacuations of thousands of people Wednesday after water reached dangerous levels at a dam near Johnstown, PA.
 human 

33 The storm moved east in the evening, with the National Weather Service confirming at least one tornado and social media posts showing homes blown to rubble and roofs torn from buildings in a southern New Jersey county just outside Philadelphia.
 human 

34 Pennsylvania was blanketed with rain after high water drove some from their homes in Maryland and Virginia.
 human 

35 The storm killed one person, two people were not accounted for, and a tornado was believed to have touched down along the Chesapeake Bay in Maryland.
 human 

36 Ida caused countless school and business closures in Pennsylvania.
 human 

37 About 150 roadways maintained by the Pennsylvania Department of Transportation were closed and many smaller roadways also were impassable.
 human 

In [ ]: