RIF Data Wrangling - GDELT¶

In this notebook, We will go through how to use Google BigQuery to access the GDELT project data, and some simple usage cases.

Useful References:

Google Cloud
- Google BigQuery: https://cloud.google.com/bigquery
- Authenticate with Google Cloud: https://hevodata.com/learn/bigquery-jupyter-notebook/#w5
- Google BigQuery Python Documantation: https://cloud.google.com/python/docs/reference/bigquery/latest/index.html
- Google BigQuery Python Samples:https://cloud.google.com/bigquery/docs/samples
GDELT
- Raw Data: http://data.gdeltproject.org/events/index.html
- A description of the data fields: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf
- The event codes are described in this document: http://gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf
- CAMEO country codes are listed here: https://www.gdeltproject.org/data/lookups/CAMEO.country.txt
- FIPS country codes are listed here: https://www.gdeltproject.org/data/lookups/FIPS.country.txt

!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'
!pip install db-dtypes
!pip install basemap
!pip install networkx

Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: google-cloud-bigquery[bqstorage,pandas] in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (3.2.0)
Requirement already satisfied, skipping upgrade: protobuf<4.0.0dev,>=3.12.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (3.12.2)
Requirement already satisfied, skipping upgrade: python-dateutil<3.0dev,>=2.7.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.8.1)
Requirement already satisfied, skipping upgrade: grpcio<2.0dev,>=1.38.1 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.48.2)
Requirement already satisfied, skipping upgrade: google-cloud-bigquery-storage<3.0.0dev,>=2.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.13.2)
Requirement already satisfied, skipping upgrade: google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.8.2)
Requirement already satisfied, skipping upgrade: pyarrow<9.0dev,>=3.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (6.0.1)
Requirement already satisfied, skipping upgrade: google-cloud-core<3.0.0dev,>=1.4.1 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.3.1)
Requirement already satisfied, skipping upgrade: packaging<22.0.0dev,>=14.3 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (20.4)
Requirement already satisfied, skipping upgrade: proto-plus<2.0.0dev,>=1.15.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.22.3)
Requirement already satisfied, skipping upgrade: requests<3.0.0dev,>=2.18.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.24.0)
Requirement already satisfied, skipping upgrade: google-resumable-media<3.0dev,>=0.6.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.3.3)
Requirement already satisfied, skipping upgrade: db-dtypes<2.0.0dev,>=0.3.0; extra == "pandas" in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.0.2)
Requirement already satisfied, skipping upgrade: pandas>=1.0.0; extra == "pandas" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.0.5)
Requirement already satisfied, skipping upgrade: setuptools in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from protobuf<4.0.0dev,>=3.12.0->google-cloud-bigquery[bqstorage,pandas]) (49.1.0.post20200704)
Requirement already satisfied, skipping upgrade: six>=1.9 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from protobuf<4.0.0dev,>=3.12.0->google-cloud-bigquery[bqstorage,pandas]) (1.15.0)
Requirement already satisfied, skipping upgrade: google-auth<3.0dev,>=1.25.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (2.22.0)
Requirement already satisfied, skipping upgrade: googleapis-common-protos<2.0dev,>=1.56.2 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (1.56.3)
Requirement already satisfied, skipping upgrade: grpcio-status<2.0dev,>=1.33.2; extra == "grpc" in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (1.59.0)
Requirement already satisfied, skipping upgrade: numpy>=1.16.6 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from pyarrow<9.0dev,>=3.0.0->google-cloud-bigquery[bqstorage,pandas]) (1.19.5)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from packaging<22.0.0dev,>=14.3->google-cloud-bigquery[bqstorage,pandas]) (2.4.7)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (1.25.9)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (2.10)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (2020.6.20)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (3.0.4)
Requirement already satisfied, skipping upgrade: google-crc32c<2.0dev,>=1.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery[bqstorage,pandas]) (1.3.0)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from pandas>=1.0.0; extra == "pandas"->google-cloud-bigquery[bqstorage,pandas]) (2020.1)
Requirement already satisfied, skipping upgrade: rsa<5,>=3.1.4 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (4.9)
Requirement already satisfied, skipping upgrade: cachetools<6.0,>=2.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (4.2.4)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (0.3.0)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.3 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from rsa<5,>=3.1.4->google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (0.5.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: db-dtypes in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (1.0.2)
Requirement already satisfied: packaging>=17.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from db-dtypes) (20.4)
Requirement already satisfied: pandas<2.0dev,>=0.24.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from db-dtypes) (1.0.5)
Requirement already satisfied: pyarrow<9.0dev,>=3.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from db-dtypes) (6.0.1)
Requirement already satisfied: numpy<2.0dev,>=1.16.6 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from db-dtypes) (1.19.5)
Requirement already satisfied: pyparsing>=2.0.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from packaging>=17.0->db-dtypes) (2.4.7)
Requirement already satisfied: six in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from packaging>=17.0->db-dtypes) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from pandas<2.0dev,>=0.24.2->db-dtypes) (2020.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from pandas<2.0dev,>=0.24.2->db-dtypes) (2.8.1)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: basemap in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (1.3.8)
Requirement already satisfied: numpy<1.20,>=1.19; python_version == "3.6" in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from basemap) (1.19.5)
Requirement already satisfied: basemap-data<1.4,>=1.3.2 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from basemap) (1.3.2)
Requirement already satisfied: pyproj<3.7.0,>=1.9.3; python_version >= "3.5" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from basemap) (2.6.1.post1)
Requirement already satisfied: pyshp<2.4,>=1.2; python_version >= "2.7" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from basemap) (2.1.0)
Requirement already satisfied: matplotlib<3.8,>=1.5; python_version >= "3.5" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from basemap) (3.2.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (2.8.1)
Requirement already satisfied: six in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from cycler>=0.10->matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: networkx in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (2.4)
Requirement already satisfied: decorator>=4.3.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from networkx) (4.4.2)

import os
import sys
import math

import numpy as np
import geopandas as gpd

from google.cloud import bigquery

import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

import networkx as nx

%load_ext google.cloud.bigquery

Step 0: Google BigQuery Account¶

Authenticate with Google Cloud:
You need to authenticate your Python code to access Google Cloud resources. You can do this by creating a service account and downloading the JSON key file.

Create a service account and download the JSON key file (reference: https://hevodata.com/learn/bigquery-jupyter-notebook/#w5):
- Go to the Google Cloud Console.
- Navigate to "IAM & Admin" > "Service accounts."
- Click "Create Service Account," fill out the details, and grant the necessary permissions (BigQuery User or Editor role).
- Create and download a JSON key file for the service account.
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your JSON key file:
```
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path-to-CREDENTIALS.json"
```

# Setting your Google application credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "iguide-gdelt.json"

Simple test (good to continue if you see a table output)

# Construct a BigQuery client object.
client = bigquery.Client()
sql = """
SELECT
    plurality,
    COUNT(1) AS count,
    year
FROM
    `bigquery-public-data.samples.natality`
WHERE
    NOT IS_NAN(plurality) AND plurality > 1
GROUP BY
    plurality, year
ORDER BY
    count DESC
"""
df = client.query(sql).to_dataframe()
df.head()

BigQuery GDELT Dataset Overview¶

The database structure is project --> dataset --> table. The project id for the GDELT project in Google BigQuery is "gdelt-bq".

1. Datasets and tables.¶

First, we investigate what datasets and tables are accessible.

# Project id of the GDELT project
project_id = "gdelt-bq"

# Create a BigQuery client object
client = bigquery.Client(project=project_id)

# List all datasets and tables in the GDELT project
datasets = list(client.list_datasets())

# Iterate through each dataset and list its tables
for dataset in datasets:
    print("Dataset: {}".format(dataset.dataset_id))
    tables = list(client.list_tables(dataset))
    table_names = [table.table_id for table in tables]
    print("Tables: {}\n--------".format(", ".join(table_names)))

Dataset: covid19
Tables: onlinenews, onlinenewsgeo, tvnews
--------
Dataset: extra
Tables: countries_by_media_50pct, countrygeolookup, countryinfo, countryinfo2, sourcesbycountry, toytonelookup
--------
Dataset: full
Tables: crosswalk_geocountrycodetohuman, events, events_partitioned
--------
Dataset: gdeltv2
Tables: academicliteraturegkg, cloudvision, cloudvision_partitioned, cloudvision_webentitygraph_20191020, domainsbycountry_alllangs_april2015, eventmentions, eventmentions_partitioned, events, events_partitioned, gal, gdg_partitioned, geg_g1, geg_gcnlapi, geg_gcnlapisent, geg_iatv, gegv2_iatv, gemg, gfg_linguistic1grams_20190831, gfg_linguistic2grams_20190831, gfg_linguisticinventory_20190901, gfg_outlet1grams_20190831, gfg_outlet2grams_20190831, gfg_partitioned, ggg, gkg, gkg_partitioned, gkg_socialoutlinks, gng, gqg, grg_dtn, grg_vcn, grg_vcn_nlpapi, gsg, gsg_docembed, gsg_iatvsentembed, iaradio_1grams, iaradio_2grams, iaradio_3grams, iaradio_4grams, iaradio_5grams, iatv_1grams, iatv_1gramsv2, iatv_2grams, iatv_2gramsv2, iatv_3gramsv2, iatv_4gramsv2, iatv_5gramsv2, iatv_aif_captime, iatv_aif_vidtime, iatv_lowerthird, iatv_lowerthirdclusters, iatv_showinventory, iatv_timeinventory, iatvgkg, readability, readability_iatv, vgeg_iatv, vgegv2_iatv, web_1grams, web_2grams, web_pos, webngrams
--------
Dataset: gdeltv2_ngrams
Tables: arabic_trigram
--------
Dataset: hathitrustbooks
Tables: 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012
--------
Dataset: internetarchivebooks
Tables: 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923notxt, 1924notxt, 1925notxt, 1926notxt, 1927notxt, 1928notxt, 1929notxt, 1930notxt, 1931notxt, 1932notxt, 1933notxt, 1934notxt, 1935notxt, 1936notxt, 1937notxt, 1938notxt, 1939notxt, 1940notxt, 1941notxt, 1942notxt, 1943notxt, 1944notxt, 1945notxt, 1946notxt, 1947notxt, 1948notxt, 1949notxt, 1950notxt, 1951notxt, 1952notxt, 1953notxt, 1954notxt, 1955notxt, 1956notxt, 1957notxt, 1958notxt, 1959notxt, 1960notxt, 1961notxt, 1962notxt, 1963notxt, 1964notxt, 1965notxt, 1966notxt, 1967notxt, 1968notxt, 1969notxt, 1970notxt, 1971notxt, 1972notxt, 1973notxt, 1974notxt, 1975notxt, 1976notxt, 1977noftxt, 1978notxt, 1979notxt, 1980notxt, 1981notxt, 1982notxt, 1983notxt, 1984notxt, 1985notxt, 1986notxt, 1987notxt, 1988notxt, 1989notxt, 1990notxt, 1991notxt, 1992notxt, 1993notxt, 1994notxt, 1995notxt, 1996notxt, 1997notxt, 1998notxt, 1999notxt, 2000notxt, 2001notxt, 2002notxt, 2003notxt, 2004notxt, 2005notxt, 2006notxt, 2007notxt, 2008notxt, 2009notxt, 2010notxt, 2011notxt, 2012notxt, 2013notxt, 2014notxt
--------
Dataset: sample_views
Tables: actor_combinations_by_year, actor_combinations_top1_by_year, country_date_count, country_date_matconf_numarts, country_date_matconfbyactiongeo, country_date_matconfbyactionhumangeo, country_date_protest_numarts, export7, export8, urls1
--------

2. Tables in the dataset.¶

There are two ways to check tables in a certain dataset, you can use either use the BigQuery client library directly or bigquery magic function %%bigquery. The magic function %%bigquery takes SQL as input and generate the pandas table directly.

a. BigQuery client library

# Check specific dataset
dataset_ref = client.dataset("full")
dataset = client.get_dataset(dataset_ref)
tables = list(client.list_tables(dataset))
for table in tables:  
    print(table.table_id)

crosswalk_geocountrycodetohuman
events
events_partitioned

# Do not put project="gdelt-bq" in since this create a new table
client = bigquery.Client()
sql = """
SELECT * FROM `gdelt-bq`.full.INFORMATION_SCHEMA.TABLES;
"""
table_df_1 = client.query(sql).to_dataframe()
table_df_1.head()

b. magic function %%bigquery

%%bigquery
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.TABLES`

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 396.29query/s]                          
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.37rows/s]

%%bigquery table_df_2
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.TABLES`

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 336.06query/s]                          
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.61rows/s]

# bigquery result saved to pandas table
table_df_2.head()

3. Columns in the table.¶

Bigquery magic functions can be used to check dataset schema.

%%bigquery
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'events';

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 426.90query/s]                          
Downloading: 100%|██████████| 58/58 [00:01<00:00, 43.76rows/s]

%%bigquery
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'events';

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 395.61query/s]                          
Downloading: 100%|██████████| 58/58 [00:01<00:00, 42.96rows/s]

Case 1: Mapping with GDELT: USA Related Events in 2022¶

%%bigquery
SELECT *
FROM `gdelt-bq.full.events`
LIMIT 5

%%bigquery usa_event_df
SELECT Actor1CountryCode, Actor2CountryCode, count(*) as count
FROM `gdelt-bq.full.events`
WHERE ((Actor1CountryCode = 'USA' and Actor2CountryCode IS NOT NULL) 
       or (Actor2CountryCode = 'USA' and Actor1CountryCode IS NOT NULL)) 
       and Actor1CountryCode != Actor2CountryCode
       and Year = 2022
GROUP BY 1,2

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 602.46query/s] 
Downloading: 100%|██████████| 431/431 [00:01<00:00, 384.29rows/s]

usa_event_df

# Create a new column with the sorted Actor1CountryCode and Actor2CountryCode
usa_event_df['sorted_pair'] = usa_event_df.apply(lambda row: tuple(sorted([row['Actor1CountryCode'], row['Actor2CountryCode']])), axis=1)

# Group by the sorted pair and sum the counts
grouped_df = usa_event_df.groupby('sorted_pair').agg({'count': 'sum'}).reset_index()

max_val = np.log10(grouped_df['count'].max())

def get_alpha(count):
    '''
    Convert a count to an alpha val.
    Log-scaled
    '''
    scale = np.log10(count)
    return (scale/max_val) * 0.25

plt.figure(figsize=(18,18))

# Load the world shape map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Country centroids
world['centroid'] = world.centroid
centroids = world.set_index('iso_a3')['centroid']

event_map = Basemap(projection='moll',
                    llcrnrlat=-80,urcrnrlat=80,
                    llcrnrlon=-180,urcrnrlon=180, 
                    lon_0=0, lat_0=0)

event_map.drawcoastlines()
event_map.drawcountries()
event_map.fillcontinents(color='0.6')
event_map.drawmapboundary()

# Assuming grouped_df is defined somewhere before this
for index, row in grouped_df.iterrows():
    country1, country2 = row['sorted_pair']
    if country1 in centroids.index and country2 in centroids.index:
        point1 = centroids.loc[country1]
        point2 = centroids.loc[country2]
        
        # Convert centroids to lat-long for Basemap
        lon1, lat1 = point1.xy
        lon2, lat2 = point2.xy
        
        line_alpha = get_alpha(row['count'])
        
        event_map.drawgreatcircle(lon1[0], lat1[0], lon2[0], lat2[0], 
                                  linewidth=2,
                                  color='r',
                                  alpha=line_alpha)
plt.show()

/cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages/ipykernel_launcher.py:7: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  import sys

Case2: Entity Connection Graph¶

%%bigquery
SELECT * FROM `gdelt-bq.gdeltv2.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'geg_gcnlapi';

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 382.34query/s]                          
Downloading: 100%|██████████| 13/13 [00:01<00:00, 10.71rows/s]

%%bigquery entity_graph
SELECT Source, Target, Count RawCount, "Undirected" Type, ( Count/SUM(Count) OVER () ) Weight FROM (
SELECT a.entity Source, b.entity Target, COUNT(*) as Count
FROM (
 (SELECT url, entities.name entity FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities 
  where entities.mid is not null and entities.type = 'PERSON'
  and date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00")
) a
JOIN (
 (SELECT url, entities.name entity FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities 
  where entities.mid is not null and entities.type = 'PERSON'
  and date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00")
) b
ON a.url=b.url
WHERE a.entity<b.entity
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 1500
)
order by Count Desc

Query complete after 0.00s: 100%|██████████| 10/10 [00:00<00:00, 3087.45query/s]                       
Downloading: 100%|██████████| 1500/1500 [00:01<00:00, 1263.98rows/s]

entity_graph.head()

entity_graph_filtered = entity_graph[(entity_graph['Source'] == 'Donald Trump') | (entity_graph['Target'] == 'Donald Trump')]

entity_graph_filtered

g = nx.Graph()

for index, row in entity_graph_filtered.iterrows():
    g.add_edge(row['Source'], row['Target'], weight=row['Weight'])

# Set up with a white background and no grid lines
plt.figure(figsize=(20, 15))
plt.title('People most associated in global news with Donald Trump', y=0.97, fontsize=20, fontweight='bold')
cf = plt.gcf()
ax = cf.gca()
ax.set_axis_off()

# Getting the largest connected component
connected_components = (g.subgraph(c) for c in nx.connected_components(g))
gmax = max(connected_components, key=len)

d = nx.degree(gmax)
b = nx.betweenness_centrality(gmax).items()

# # Use a circular layout
pos = nx.circular_layout(gmax)

edge_weights = nx.get_edge_attributes(gmax, 'weight')
normalized_weights = [weight/max(edge_weights.values()) for weight in edge_weights.values()]

nx.draw_networkx_edges(gmax, pos, nodelist=[v[0] for v in b], alpha=0.9, width=normalized_weights*20)
nx.draw_networkx_nodes(gmax, pos, nodelist=[v[0] for v in b], node_size=[v[1] * 20 for v in d],
                       node_color=[math.log(1 + v[1] * 20) for v in b], cmap=plt.cm.Blues)
nx.draw_networkx_labels(gmax, pos, labels={v[0]: v[0] for v in b})

# Display the graph
plt.show()

	table_catalog	table_schema	table_name	table_type	is_insertable_into	is_typed	creation_time	base_table_catalog	base_table_schema	base_table_name	snapshot_time_ms	ddl	default_collation_name	upsert_stream_apply_watermark
0	gdelt-bq	full	events	BASE TABLE	YES	NO	2016-05-19 16:21:47.091000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.events`\n(\n GLOB...	NULL	NaT
1	gdelt-bq	full	crosswalk_geocountrycodetohuman	BASE TABLE	YES	NO	2014-07-18 00:35:19.687000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.crosswalk_geocount...	NULL	NaT
2	gdelt-bq	full	events_partitioned	BASE TABLE	YES	NO	2016-05-20 15:56:23.559000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.events_partitioned...	NULL	NaT

	table_catalog	table_schema	table_name	table_type	is_insertable_into	is_typed	creation_time	base_table_catalog	base_table_schema	base_table_name	snapshot_time_ms	ddl	default_collation_name	upsert_stream_apply_watermark
0	gdelt-bq	full	events	BASE TABLE	YES	NO	2016-05-19 16:21:47.091000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.events`\n(\n GLOB...	NULL	NaT
1	gdelt-bq	full	crosswalk_geocountrycodetohuman	BASE TABLE	YES	NO	2014-07-18 00:35:19.687000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.crosswalk_geocount...	NULL	NaT
2	gdelt-bq	full	events_partitioned	BASE TABLE	YES	NO	2016-05-20 15:56:23.559000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.events_partitioned...	NULL	NaT

	table_catalog	table_schema	table_name	table_type	is_insertable_into	is_typed	creation_time	base_table_catalog	base_table_schema	base_table_name	snapshot_time_ms	ddl	default_collation_name	upsert_stream_apply_watermark
0	gdelt-bq	full	events	BASE TABLE	YES	NO	2016-05-19 16:21:47.091000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.events`\n(\n GLOB...	NULL	NaT
1	gdelt-bq	full	crosswalk_geocountrycodetohuman	BASE TABLE	YES	NO	2014-07-18 00:35:19.687000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.crosswalk_geocount...	NULL	NaT
2	gdelt-bq	full	events_partitioned	BASE TABLE	YES	NO	2016-05-20 15:56:23.559000+00:00	None	None	None	NaT	CREATE TABLE `gdelt-bq.full.events_partitioned...	NULL	NaT

	table_catalog	table_schema	table_name	column_name	field_path	data_type	description	collation_name	rounding_mode
0	gdelt-bq	full	events	GLOBALEVENTID	GLOBALEVENTID	INT64	Unique ID for each event	NULL	None
1	gdelt-bq	full	events	SQLDATE	SQLDATE	INT64	Date the event took place in YYYYMMDD format	NULL	None
2	gdelt-bq	full	events	MonthYear	MonthYear	INT64	Alternative formatting of the event date, in Y...	NULL	None
3	gdelt-bq	full	events	Year	Year	INT64	Alternative formatting of the event date, in Y...	NULL	None
4	gdelt-bq	full	events	FractionDate	FractionDate	FLOAT64	Alternative formatting of the event date, comp...	NULL	None
5	gdelt-bq	full	events	Actor1Code	Actor1Code	STRING	The complete raw CAMEO code for Actor1 (includ...	NULL	None
6	gdelt-bq	full	events	Actor1Name	Actor1Name	STRING	The actual name of the Actor 1. In the case of...	NULL	None
7	gdelt-bq	full	events	Actor1CountryCode	Actor1CountryCode	STRING	The 3-character CAMEO code for the country aff...	NULL	None
8	gdelt-bq	full	events	Actor1KnownGroupCode	Actor1KnownGroupCode	STRING	If Actor1 is a known IGO/NGO/rebel organizatio...	NULL	None
9	gdelt-bq	full	events	Actor1EthnicCode	Actor1EthnicCode	STRING	If the source document specifies the ethnic af...	NULL	None
10	gdelt-bq	full	events	Actor1Religion1Code	Actor1Religion1Code	STRING	If the source document specifies the religious...	NULL	None
11	gdelt-bq	full	events	Actor1Religion2Code	Actor1Religion2Code	STRING	If multiple religious codes are specified for ...	NULL	None
12	gdelt-bq	full	events	Actor1Type1Code	Actor1Type1Code	STRING	The 3-character CAMEO code of the CAMEO “type”...	NULL	None
13	gdelt-bq	full	events	Actor1Type2Code	Actor1Type2Code	STRING	If multiple type/role codes are specified for ...	NULL	None
14	gdelt-bq	full	events	Actor1Type3Code	Actor1Type3Code	STRING	If multiple type/role codes are specified for ...	NULL	None
15	gdelt-bq	full	events	Actor2Code	Actor2Code	STRING	The complete raw CAMEO code for Actor2 (includ...	NULL	None
16	gdelt-bq	full	events	Actor2Name	Actor2Name	STRING	The actual name of the Actor2. In the case of ...	NULL	None
17	gdelt-bq	full	events	Actor2CountryCode	Actor2CountryCode	STRING	The 3-character CAMEO code for the country aff...	NULL	None
18	gdelt-bq	full	events	Actor2KnownGroupCode	Actor2KnownGroupCode	STRING	If Actor2 is a known IGO/NGO/rebel organizatio...	NULL	None
19	gdelt-bq	full	events	Actor2EthnicCode	Actor2EthnicCode	STRING	If the source document specifies the ethnic af...	NULL	None
20	gdelt-bq	full	events	Actor2Religion1Code	Actor2Religion1Code	STRING	If the source document specifies the religious...	NULL	None
21	gdelt-bq	full	events	Actor2Religion2Code	Actor2Religion2Code	STRING	If multiple religious codes are specified for ...	NULL	None
22	gdelt-bq	full	events	Actor2Type1Code	Actor2Type1Code	STRING	The 3-character CAMEO code of the CAMEO “type”...	NULL	None
23	gdelt-bq	full	events	Actor2Type2Code	Actor2Type2Code	STRING	If multiple type/role codes are specified for ...	NULL	None
24	gdelt-bq	full	events	Actor2Type3Code	Actor2Type3Code	STRING	If multiple type/role codes are specified for ...	NULL	None
25	gdelt-bq	full	events	IsRootEvent	IsRootEvent	INT64	The system codes every event found in an entir...	NULL	None
26	gdelt-bq	full	events	EventCode	EventCode	STRING	This is the raw CAMEO action code describing t...	NULL	None
27	gdelt-bq	full	events	EventBaseCode	EventBaseCode	STRING	CAMEO event codes are defined in a three-level...	NULL	None
28	gdelt-bq	full	events	EventRootCode	EventRootCode	STRING	Similar to EventBaseCode, this defines the roo...	NULL	None
29	gdelt-bq	full	events	QuadClass	QuadClass	INT64	The entire CAMEO event taxonomy is ultimately ...	NULL	None
30	gdelt-bq	full	events	GoldsteinScale	GoldsteinScale	FLOAT64	Each CAMEO event code is assigned a numeric sc...	NULL	None
31	gdelt-bq	full	events	NumMentions	NumMentions	INT64	This is the total number of mentions of this e...	NULL	None
32	gdelt-bq	full	events	NumSources	NumSources	INT64	This is the total number of information source...	NULL	None
33	gdelt-bq	full	events	NumArticles	NumArticles	INT64	This is the total number of source documents c...	NULL	None
34	gdelt-bq	full	events	AvgTone	AvgTone	FLOAT64	This is the average “tone” of all documents co...	NULL	None
35	gdelt-bq	full	events	Actor1Geo_Type	Actor1Geo_Type	INT64	Location of Actor1. This field specifies the ...	NULL	None
36	gdelt-bq	full	events	Actor1Geo_FullName	Actor1Geo_FullName	STRING	Location of Actor1. This is the full human-re...	NULL	None
37	gdelt-bq	full	events	Actor1Geo_CountryCode	Actor1Geo_CountryCode	STRING	Location of Actor1. This is the 2-character F...	NULL	None
38	gdelt-bq	full	events	Actor1Geo_ADM1Code	Actor1Geo_ADM1Code	STRING	Location of Actor1. This is the 2-character F...	NULL	None
39	gdelt-bq	full	events	Actor1Geo_Lat	Actor1Geo_Lat	FLOAT64	Location of Actor1. This is the centroid lati...	NULL	None
40	gdelt-bq	full	events	Actor1Geo_Long	Actor1Geo_Long	FLOAT64	Location of Actor1. This is the centroid long...	NULL	None
41	gdelt-bq	full	events	Actor1Geo_FeatureID	Actor1Geo_FeatureID	STRING	Location of Actor1. This is the GNS or GNIS F...	NULL	None
42	gdelt-bq	full	events	Actor2Geo_Type	Actor2Geo_Type	INT64	Location of Actor2. This field specifies the ...	NULL	None
43	gdelt-bq	full	events	Actor2Geo_FullName	Actor2Geo_FullName	STRING	Location of Actor2. This is the full human-re...	NULL	None
44	gdelt-bq	full	events	Actor2Geo_CountryCode	Actor2Geo_CountryCode	STRING	Location of Actor2. This is the 2-character F...	NULL	None
45	gdelt-bq	full	events	Actor2Geo_ADM1Code	Actor2Geo_ADM1Code	STRING	Location of Actor2. This is the 2-character F...	NULL	None
46	gdelt-bq	full	events	Actor2Geo_Lat	Actor2Geo_Lat	FLOAT64	Location of Actor2. This is the centroid lati...	NULL	None
47	gdelt-bq	full	events	Actor2Geo_Long	Actor2Geo_Long	FLOAT64	Location of Actor2. This is the centroid long...	NULL	None
48	gdelt-bq	full	events	Actor2Geo_FeatureID	Actor2Geo_FeatureID	STRING	Location of Actor2. This is the GNS or GNIS F...	NULL	None
49	gdelt-bq	full	events	ActionGeo_Type	ActionGeo_Type	INT64	Location of Event. This field specifies the g...	NULL	None
50	gdelt-bq	full	events	ActionGeo_FullName	ActionGeo_FullName	STRING	Location of Event. This is the full human-rea...	NULL	None
51	gdelt-bq	full	events	ActionGeo_CountryCode	ActionGeo_CountryCode	STRING	Location of Event. This is the 2-character FI...	NULL	None
52	gdelt-bq	full	events	ActionGeo_ADM1Code	ActionGeo_ADM1Code	STRING	Location of Event. This is the 2-character FI...	NULL	None
53	gdelt-bq	full	events	ActionGeo_Lat	ActionGeo_Lat	FLOAT64	Location of Event. This is the centroid latit...	NULL	None
54	gdelt-bq	full	events	ActionGeo_Long	ActionGeo_Long	FLOAT64	Location of Event. This is the centroid longi...	NULL	None
55	gdelt-bq	full	events	ActionGeo_FeatureID	ActionGeo_FeatureID	STRING	Location of Event. This is the GNS or GNIS Fe...	NULL	None
56	gdelt-bq	full	events	DATEADDED	DATEADDED	INT64	This field stores the date the event was added...	NULL	None
57	gdelt-bq	full	events	SOURCEURL	SOURCEURL	STRING	This field is only present in the daily event ...	NULL	None

	Actor1CountryCode	Actor2CountryCode	count
0	USA	GRC	8819
1	PHL	USA	7808
2	VCT	USA	570
3	SAM	USA	4
4	USA	SWE	3219
...	...	...	...
426	USA	GNB	30
427	SAU	USA	10874
428	LVA	USA	707
429	ASA	USA	104
430	BGR	USA	854

	plurality	count	year
0	2	139209	2007
1	2	138866	2008
2	2	137239	2006
3	2	133285	2005
4	2	132344	2004

	table_catalog	table_schema	table_name	column_name	field_path	data_type	description	collation_name	rounding_mode
0	gdelt-bq	gdeltv2	geg_gcnlapi	date	date	TIMESTAMP	None	NULL	None
1	gdelt-bq	gdeltv2	geg_gcnlapi	url	url	STRING	None	NULL	None
2	gdelt-bq	gdeltv2	geg_gcnlapi	lang	lang	STRING	None	NULL	None
3	gdelt-bq	gdeltv2	geg_gcnlapi	polarity	polarity	FLOAT64	None	NULL	None
4	gdelt-bq	gdeltv2	geg_gcnlapi	magnitude	magnitude	FLOAT64	None	NULL	None
5	gdelt-bq	gdeltv2	geg_gcnlapi	score	score	FLOAT64	None	NULL	None
6	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities	ARRAY<STRUCT<name STRING, type STRING, mid STR...	None	NULL	None
7	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities.name	STRING	None	NULL	None
8	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities.type	STRING	None	NULL	None
9	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities.mid	STRING	None	NULL	None
10	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities.wikipediaUrl	STRING	None	NULL	None
11	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities.numMentions	INT64	None	NULL	None
12	gdelt-bq	gdeltv2	geg_gcnlapi	entities	entities.avgSalience	FLOAT64	None	NULL	None

	Source	Target	RawCount	Type	Weight
0	Nancy Pelosi	Stacey Abrams	1120	Undirected	0.005300
1	Elizabeth Warren	Kamala Harris	1030	Undirected	0.004875
2	Cory Booker	Kamala Harris	976	Undirected	0.004619
3	Cory Booker	Elizabeth Warren	909	Undirected	0.004302
4	Kamala Harris	Kirsten Gillibrand	888	Undirected	0.004203

	Source	Target	RawCount	Type	Weight
7	Donald Trump	Nancy Pelosi	799	Undirected	0.003781
13	Donald Trump	Kamala Harris	727	Undirected	0.003441
17	Donald Trump	Elizabeth Warren	682	Undirected	0.003228
18	Donald Trump	Stacey Abrams	682	Undirected	0.003228
24	Donald Trump	Kirsten Gillibrand	635	Undirected	0.003005
...	...	...	...	...	...
757	Angela Merkel	Donald Trump	97	Undirected	0.000459
814	Donald Trump	God	95	Undirected	0.000450
863	Donald Trump	Juan Guaido	93	Undirected	0.000440
1284	Bush	Donald Trump	86	Undirected	0.000407
1433	Donald Trump	Marco Rubio	83	Undirected	0.000393