RIF Data Wrangling - GDELT

In this notebook, We will go through how to use Google BigQuery to access the GDELT project data, and some simple usage cases.

Useful References:

In [2]:
!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'
!pip install db-dtypes
!pip install basemap
!pip install networkx
Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: google-cloud-bigquery[bqstorage,pandas] in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (3.2.0)
Requirement already satisfied, skipping upgrade: protobuf<4.0.0dev,>=3.12.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (3.12.2)
Requirement already satisfied, skipping upgrade: python-dateutil<3.0dev,>=2.7.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.8.1)
Requirement already satisfied, skipping upgrade: grpcio<2.0dev,>=1.38.1 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.48.2)
Requirement already satisfied, skipping upgrade: google-cloud-bigquery-storage<3.0.0dev,>=2.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.13.2)
Requirement already satisfied, skipping upgrade: google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.8.2)
Requirement already satisfied, skipping upgrade: pyarrow<9.0dev,>=3.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (6.0.1)
Requirement already satisfied, skipping upgrade: google-cloud-core<3.0.0dev,>=1.4.1 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.3.1)
Requirement already satisfied, skipping upgrade: packaging<22.0.0dev,>=14.3 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (20.4)
Requirement already satisfied, skipping upgrade: proto-plus<2.0.0dev,>=1.15.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.22.3)
Requirement already satisfied, skipping upgrade: requests<3.0.0dev,>=2.18.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.24.0)
Requirement already satisfied, skipping upgrade: google-resumable-media<3.0dev,>=0.6.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.3.3)
Requirement already satisfied, skipping upgrade: db-dtypes<2.0.0dev,>=0.3.0; extra == "pandas" in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.0.2)
Requirement already satisfied, skipping upgrade: pandas>=1.0.0; extra == "pandas" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.0.5)
Requirement already satisfied, skipping upgrade: setuptools in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from protobuf<4.0.0dev,>=3.12.0->google-cloud-bigquery[bqstorage,pandas]) (49.1.0.post20200704)
Requirement already satisfied, skipping upgrade: six>=1.9 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from protobuf<4.0.0dev,>=3.12.0->google-cloud-bigquery[bqstorage,pandas]) (1.15.0)
Requirement already satisfied, skipping upgrade: google-auth<3.0dev,>=1.25.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (2.22.0)
Requirement already satisfied, skipping upgrade: googleapis-common-protos<2.0dev,>=1.56.2 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (1.56.3)
Requirement already satisfied, skipping upgrade: grpcio-status<2.0dev,>=1.33.2; extra == "grpc" in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (1.59.0)
Requirement already satisfied, skipping upgrade: numpy>=1.16.6 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from pyarrow<9.0dev,>=3.0.0->google-cloud-bigquery[bqstorage,pandas]) (1.19.5)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from packaging<22.0.0dev,>=14.3->google-cloud-bigquery[bqstorage,pandas]) (2.4.7)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (1.25.9)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (2.10)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (2020.6.20)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-bigquery[bqstorage,pandas]) (3.0.4)
Requirement already satisfied, skipping upgrade: google-crc32c<2.0dev,>=1.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery[bqstorage,pandas]) (1.3.0)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from pandas>=1.0.0; extra == "pandas"->google-cloud-bigquery[bqstorage,pandas]) (2020.1)
Requirement already satisfied, skipping upgrade: rsa<5,>=3.1.4 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (4.9)
Requirement already satisfied, skipping upgrade: cachetools<6.0,>=2.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (4.2.4)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (0.3.0)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.3 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from rsa<5,>=3.1.4->google-auth<3.0dev,>=1.25.0->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-bigquery[bqstorage,pandas]) (0.5.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: db-dtypes in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (1.0.2)
Requirement already satisfied: packaging>=17.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from db-dtypes) (20.4)
Requirement already satisfied: pandas<2.0dev,>=0.24.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from db-dtypes) (1.0.5)
Requirement already satisfied: pyarrow<9.0dev,>=3.0.0 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from db-dtypes) (6.0.1)
Requirement already satisfied: numpy<2.0dev,>=1.16.6 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from db-dtypes) (1.19.5)
Requirement already satisfied: pyparsing>=2.0.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from packaging>=17.0->db-dtypes) (2.4.7)
Requirement already satisfied: six in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from packaging>=17.0->db-dtypes) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from pandas<2.0dev,>=0.24.2->db-dtypes) (2020.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from pandas<2.0dev,>=0.24.2->db-dtypes) (2.8.1)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: basemap in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (1.3.8)
Requirement already satisfied: numpy<1.20,>=1.19; python_version == "3.6" in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from basemap) (1.19.5)
Requirement already satisfied: basemap-data<1.4,>=1.3.2 in /home/jovyan/.local/geoai-py3-0.8.0/lib/python3.6/site-packages (from basemap) (1.3.2)
Requirement already satisfied: pyproj<3.7.0,>=1.9.3; python_version >= "3.5" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from basemap) (2.6.1.post1)
Requirement already satisfied: pyshp<2.4,>=1.2; python_version >= "2.7" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from basemap) (2.1.0)
Requirement already satisfied: matplotlib<3.8,>=1.5; python_version >= "3.5" in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from basemap) (3.2.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (2.4.7)
Requirement already satisfied: python-dateutil>=2.1 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (2.8.1)
Requirement already satisfied: six in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from cycler>=0.10->matplotlib<3.8,>=1.5; python_version >= "3.5"->basemap) (1.15.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: networkx in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (2.4)
Requirement already satisfied: decorator>=4.3.0 in /cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages (from networkx) (4.4.2)
In [3]:
import os
import sys
import math

import numpy as np
import geopandas as gpd

from google.cloud import bigquery

import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

import networkx as nx
In [4]:
%load_ext google.cloud.bigquery

Step 0: Google BigQuery Account

Authenticate with Google Cloud:
You need to authenticate your Python code to access Google Cloud resources. You can do this by creating a service account and downloading the JSON key file.

  • Create a service account and download the JSON key file (reference: https://hevodata.com/learn/bigquery-jupyter-notebook/#w5):
    • Go to the Google Cloud Console.
    • Navigate to "IAM & Admin" > "Service accounts."
    • Click "Create Service Account," fill out the details, and grant the necessary permissions (BigQuery User or Editor role).
    • Create and download a JSON key file for the service account.
  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your JSON key file:
    import os
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path-to-CREDENTIALS.json"
    
In [5]:
# Setting your Google application credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "iguide-gdelt.json"
  • Simple test (good to continue if you see a table output)
In [6]:
# Construct a BigQuery client object.
client = bigquery.Client()
sql = """
SELECT
    plurality,
    COUNT(1) AS count,
    year
FROM
    `bigquery-public-data.samples.natality`
WHERE
    NOT IS_NAN(plurality) AND plurality > 1
GROUP BY
    plurality, year
ORDER BY
    count DESC
"""
df = client.query(sql).to_dataframe()
df.head()
Out[6]:
plurality count year
0 2 139209 2007
1 2 138866 2008
2 2 137239 2006
3 2 133285 2005
4 2 132344 2004

BigQuery GDELT Dataset Overview

The database structure is project --> dataset --> table. The project id for the GDELT project in Google BigQuery is "gdelt-bq".

1. Datasets and tables.

First, we investigate what datasets and tables are accessible.

In [7]:
# Project id of the GDELT project
project_id = "gdelt-bq"

# Create a BigQuery client object
client = bigquery.Client(project=project_id)
In [8]:
# List all datasets and tables in the GDELT project
datasets = list(client.list_datasets())

# Iterate through each dataset and list its tables
for dataset in datasets:
    print("Dataset: {}".format(dataset.dataset_id))
    tables = list(client.list_tables(dataset))
    table_names = [table.table_id for table in tables]
    print("Tables: {}\n--------".format(", ".join(table_names)))
Dataset: covid19
Tables: onlinenews, onlinenewsgeo, tvnews
--------
Dataset: extra
Tables: countries_by_media_50pct, countrygeolookup, countryinfo, countryinfo2, sourcesbycountry, toytonelookup
--------
Dataset: full
Tables: crosswalk_geocountrycodetohuman, events, events_partitioned
--------
Dataset: gdeltv2
Tables: academicliteraturegkg, cloudvision, cloudvision_partitioned, cloudvision_webentitygraph_20191020, domainsbycountry_alllangs_april2015, eventmentions, eventmentions_partitioned, events, events_partitioned, gal, gdg_partitioned, geg_g1, geg_gcnlapi, geg_gcnlapisent, geg_iatv, gegv2_iatv, gemg, gfg_linguistic1grams_20190831, gfg_linguistic2grams_20190831, gfg_linguisticinventory_20190901, gfg_outlet1grams_20190831, gfg_outlet2grams_20190831, gfg_partitioned, ggg, gkg, gkg_partitioned, gkg_socialoutlinks, gng, gqg, grg_dtn, grg_vcn, grg_vcn_nlpapi, gsg, gsg_docembed, gsg_iatvsentembed, iaradio_1grams, iaradio_2grams, iaradio_3grams, iaradio_4grams, iaradio_5grams, iatv_1grams, iatv_1gramsv2, iatv_2grams, iatv_2gramsv2, iatv_3gramsv2, iatv_4gramsv2, iatv_5gramsv2, iatv_aif_captime, iatv_aif_vidtime, iatv_lowerthird, iatv_lowerthirdclusters, iatv_showinventory, iatv_timeinventory, iatvgkg, readability, readability_iatv, vgeg_iatv, vgegv2_iatv, web_1grams, web_2grams, web_pos, webngrams
--------
Dataset: gdeltv2_ngrams
Tables: arabic_trigram
--------
Dataset: hathitrustbooks
Tables: 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012
--------
Dataset: internetarchivebooks
Tables: 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923notxt, 1924notxt, 1925notxt, 1926notxt, 1927notxt, 1928notxt, 1929notxt, 1930notxt, 1931notxt, 1932notxt, 1933notxt, 1934notxt, 1935notxt, 1936notxt, 1937notxt, 1938notxt, 1939notxt, 1940notxt, 1941notxt, 1942notxt, 1943notxt, 1944notxt, 1945notxt, 1946notxt, 1947notxt, 1948notxt, 1949notxt, 1950notxt, 1951notxt, 1952notxt, 1953notxt, 1954notxt, 1955notxt, 1956notxt, 1957notxt, 1958notxt, 1959notxt, 1960notxt, 1961notxt, 1962notxt, 1963notxt, 1964notxt, 1965notxt, 1966notxt, 1967notxt, 1968notxt, 1969notxt, 1970notxt, 1971notxt, 1972notxt, 1973notxt, 1974notxt, 1975notxt, 1976notxt, 1977noftxt, 1978notxt, 1979notxt, 1980notxt, 1981notxt, 1982notxt, 1983notxt, 1984notxt, 1985notxt, 1986notxt, 1987notxt, 1988notxt, 1989notxt, 1990notxt, 1991notxt, 1992notxt, 1993notxt, 1994notxt, 1995notxt, 1996notxt, 1997notxt, 1998notxt, 1999notxt, 2000notxt, 2001notxt, 2002notxt, 2003notxt, 2004notxt, 2005notxt, 2006notxt, 2007notxt, 2008notxt, 2009notxt, 2010notxt, 2011notxt, 2012notxt, 2013notxt, 2014notxt
--------
Dataset: sample_views
Tables: actor_combinations_by_year, actor_combinations_top1_by_year, country_date_count, country_date_matconf_numarts, country_date_matconfbyactiongeo, country_date_matconfbyactionhumangeo, country_date_protest_numarts, export7, export8, urls1
--------

2. Tables in the dataset.

There are two ways to check tables in a certain dataset, you can use either use the BigQuery client library directly or bigquery magic function %%bigquery. The magic function %%bigquery takes SQL as input and generate the pandas table directly.

a. BigQuery client library

In [9]:
# Check specific dataset
dataset_ref = client.dataset("full")
dataset = client.get_dataset(dataset_ref)
tables = list(client.list_tables(dataset))
for table in tables:  
    print(table.table_id)
crosswalk_geocountrycodetohuman
events
events_partitioned
In [10]:
# Do not put project="gdelt-bq" in since this create a new table
client = bigquery.Client()
sql = """
SELECT * FROM `gdelt-bq`.full.INFORMATION_SCHEMA.TABLES;
"""
table_df_1 = client.query(sql).to_dataframe()
table_df_1.head()
Out[10]:
table_catalog table_schema table_name table_type is_insertable_into is_typed creation_time base_table_catalog base_table_schema base_table_name snapshot_time_ms ddl default_collation_name upsert_stream_apply_watermark
0 gdelt-bq full events BASE TABLE YES NO 2016-05-19 16:21:47.091000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.events`\n(\n GLOB... NULL NaT
1 gdelt-bq full crosswalk_geocountrycodetohuman BASE TABLE YES NO 2014-07-18 00:35:19.687000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.crosswalk_geocount... NULL NaT
2 gdelt-bq full events_partitioned BASE TABLE YES NO 2016-05-20 15:56:23.559000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.events_partitioned... NULL NaT

b. magic function %%bigquery

In [11]:
%%bigquery
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.TABLES`
Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 396.29query/s]                          
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.37rows/s]
Out[11]:
table_catalog table_schema table_name table_type is_insertable_into is_typed creation_time base_table_catalog base_table_schema base_table_name snapshot_time_ms ddl default_collation_name upsert_stream_apply_watermark
0 gdelt-bq full events BASE TABLE YES NO 2016-05-19 16:21:47.091000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.events`\n(\n GLOB... NULL NaT
1 gdelt-bq full crosswalk_geocountrycodetohuman BASE TABLE YES NO 2014-07-18 00:35:19.687000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.crosswalk_geocount... NULL NaT
2 gdelt-bq full events_partitioned BASE TABLE YES NO 2016-05-20 15:56:23.559000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.events_partitioned... NULL NaT
In [12]:
%%bigquery table_df_2
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.TABLES`
Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 336.06query/s]                          
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.61rows/s]
In [13]:
# bigquery result saved to pandas table
table_df_2.head()
Out[13]:
table_catalog table_schema table_name table_type is_insertable_into is_typed creation_time base_table_catalog base_table_schema base_table_name snapshot_time_ms ddl default_collation_name upsert_stream_apply_watermark
0 gdelt-bq full events BASE TABLE YES NO 2016-05-19 16:21:47.091000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.events`\n(\n GLOB... NULL NaT
1 gdelt-bq full crosswalk_geocountrycodetohuman BASE TABLE YES NO 2014-07-18 00:35:19.687000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.crosswalk_geocount... NULL NaT
2 gdelt-bq full events_partitioned BASE TABLE YES NO 2016-05-20 15:56:23.559000+00:00 None None None NaT CREATE TABLE `gdelt-bq.full.events_partitioned... NULL NaT

3. Columns in the table.

Bigquery magic functions can be used to check dataset schema.

In [14]:
%%bigquery
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'events';
Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 426.90query/s]                          
Downloading: 100%|██████████| 58/58 [00:01<00:00, 43.76rows/s]
Out[14]:
table_catalog table_schema table_name column_name ordinal_position is_nullable data_type is_generated generation_expression is_stored is_hidden is_updatable is_system_defined is_partitioning_column clustering_ordinal_position collation_name column_default rounding_mode
0 gdelt-bq full events GLOBALEVENTID 1 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
1 gdelt-bq full events SQLDATE 2 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
2 gdelt-bq full events MonthYear 3 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
3 gdelt-bq full events Year 4 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
4 gdelt-bq full events FractionDate 5 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
5 gdelt-bq full events Actor1Code 6 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
6 gdelt-bq full events Actor1Name 7 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
7 gdelt-bq full events Actor1CountryCode 8 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
8 gdelt-bq full events Actor1KnownGroupCode 9 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
9 gdelt-bq full events Actor1EthnicCode 10 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
10 gdelt-bq full events Actor1Religion1Code 11 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
11 gdelt-bq full events Actor1Religion2Code 12 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
12 gdelt-bq full events Actor1Type1Code 13 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
13 gdelt-bq full events Actor1Type2Code 14 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
14 gdelt-bq full events Actor1Type3Code 15 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
15 gdelt-bq full events Actor2Code 16 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
16 gdelt-bq full events Actor2Name 17 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
17 gdelt-bq full events Actor2CountryCode 18 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
18 gdelt-bq full events Actor2KnownGroupCode 19 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
19 gdelt-bq full events Actor2EthnicCode 20 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
20 gdelt-bq full events Actor2Religion1Code 21 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
21 gdelt-bq full events Actor2Religion2Code 22 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
22 gdelt-bq full events Actor2Type1Code 23 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
23 gdelt-bq full events Actor2Type2Code 24 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
24 gdelt-bq full events Actor2Type3Code 25 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
25 gdelt-bq full events IsRootEvent 26 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
26 gdelt-bq full events EventCode 27 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
27 gdelt-bq full events EventBaseCode 28 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
28 gdelt-bq full events EventRootCode 29 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
29 gdelt-bq full events QuadClass 30 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
30 gdelt-bq full events GoldsteinScale 31 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
31 gdelt-bq full events NumMentions 32 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
32 gdelt-bq full events NumSources 33 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
33 gdelt-bq full events NumArticles 34 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
34 gdelt-bq full events AvgTone 35 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
35 gdelt-bq full events Actor1Geo_Type 36 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
36 gdelt-bq full events Actor1Geo_FullName 37 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
37 gdelt-bq full events Actor1Geo_CountryCode 38 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
38 gdelt-bq full events Actor1Geo_ADM1Code 39 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
39 gdelt-bq full events Actor1Geo_Lat 40 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
40 gdelt-bq full events Actor1Geo_Long 41 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
41 gdelt-bq full events Actor1Geo_FeatureID 42 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
42 gdelt-bq full events Actor2Geo_Type 43 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
43 gdelt-bq full events Actor2Geo_FullName 44 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
44 gdelt-bq full events Actor2Geo_CountryCode 45 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
45 gdelt-bq full events Actor2Geo_ADM1Code 46 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
46 gdelt-bq full events Actor2Geo_Lat 47 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
47 gdelt-bq full events Actor2Geo_Long 48 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
48 gdelt-bq full events Actor2Geo_FeatureID 49 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
49 gdelt-bq full events ActionGeo_Type 50 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
50 gdelt-bq full events ActionGeo_FullName 51 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
51 gdelt-bq full events ActionGeo_CountryCode 52 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
52 gdelt-bq full events ActionGeo_ADM1Code 53 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
53 gdelt-bq full events ActionGeo_Lat 54 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
54 gdelt-bq full events ActionGeo_Long 55 YES FLOAT64 NEVER None None NO None NO NO <NA> NULL NULL None
55 gdelt-bq full events ActionGeo_FeatureID 56 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
56 gdelt-bq full events DATEADDED 57 YES INT64 NEVER None None NO None NO NO <NA> NULL NULL None
57 gdelt-bq full events SOURCEURL 58 YES STRING NEVER None None NO None NO NO <NA> NULL NULL None
In [15]:
%%bigquery
SELECT * FROM `gdelt-bq.full.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'events';
Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 395.61query/s]                          
Downloading: 100%|██████████| 58/58 [00:01<00:00, 42.96rows/s]
Out[15]:
table_catalog table_schema table_name column_name field_path data_type description collation_name rounding_mode
0 gdelt-bq full events GLOBALEVENTID GLOBALEVENTID INT64 Unique ID for each event NULL None
1 gdelt-bq full events SQLDATE SQLDATE INT64 Date the event took place in YYYYMMDD format NULL None
2 gdelt-bq full events MonthYear MonthYear INT64 Alternative formatting of the event date, in Y... NULL None
3 gdelt-bq full events Year Year INT64 Alternative formatting of the event date, in Y... NULL None
4 gdelt-bq full events FractionDate FractionDate FLOAT64 Alternative formatting of the event date, comp... NULL None
5 gdelt-bq full events Actor1Code Actor1Code STRING The complete raw CAMEO code for Actor1 (includ... NULL None
6 gdelt-bq full events Actor1Name Actor1Name STRING The actual name of the Actor 1. In the case of... NULL None
7 gdelt-bq full events Actor1CountryCode Actor1CountryCode STRING The 3-character CAMEO code for the country aff... NULL None
8 gdelt-bq full events Actor1KnownGroupCode Actor1KnownGroupCode STRING If Actor1 is a known IGO/NGO/rebel organizatio... NULL None
9 gdelt-bq full events Actor1EthnicCode Actor1EthnicCode STRING If the source document specifies the ethnic af... NULL None
10 gdelt-bq full events Actor1Religion1Code Actor1Religion1Code STRING If the source document specifies the religious... NULL None
11 gdelt-bq full events Actor1Religion2Code Actor1Religion2Code STRING If multiple religious codes are specified for ... NULL None
12 gdelt-bq full events Actor1Type1Code Actor1Type1Code STRING The 3-character CAMEO code of the CAMEO “type”... NULL None
13 gdelt-bq full events Actor1Type2Code Actor1Type2Code STRING If multiple type/role codes are specified for ... NULL None
14 gdelt-bq full events Actor1Type3Code Actor1Type3Code STRING If multiple type/role codes are specified for ... NULL None
15 gdelt-bq full events Actor2Code Actor2Code STRING The complete raw CAMEO code for Actor2 (includ... NULL None
16 gdelt-bq full events Actor2Name Actor2Name STRING The actual name of the Actor2. In the case of ... NULL None
17 gdelt-bq full events Actor2CountryCode Actor2CountryCode STRING The 3-character CAMEO code for the country aff... NULL None
18 gdelt-bq full events Actor2KnownGroupCode Actor2KnownGroupCode STRING If Actor2 is a known IGO/NGO/rebel organizatio... NULL None
19 gdelt-bq full events Actor2EthnicCode Actor2EthnicCode STRING If the source document specifies the ethnic af... NULL None
20 gdelt-bq full events Actor2Religion1Code Actor2Religion1Code STRING If the source document specifies the religious... NULL None
21 gdelt-bq full events Actor2Religion2Code Actor2Religion2Code STRING If multiple religious codes are specified for ... NULL None
22 gdelt-bq full events Actor2Type1Code Actor2Type1Code STRING The 3-character CAMEO code of the CAMEO “type”... NULL None
23 gdelt-bq full events Actor2Type2Code Actor2Type2Code STRING If multiple type/role codes are specified for ... NULL None
24 gdelt-bq full events Actor2Type3Code Actor2Type3Code STRING If multiple type/role codes are specified for ... NULL None
25 gdelt-bq full events IsRootEvent IsRootEvent INT64 The system codes every event found in an entir... NULL None
26 gdelt-bq full events EventCode EventCode STRING This is the raw CAMEO action code describing t... NULL None
27 gdelt-bq full events EventBaseCode EventBaseCode STRING CAMEO event codes are defined in a three-level... NULL None
28 gdelt-bq full events EventRootCode EventRootCode STRING Similar to EventBaseCode, this defines the roo... NULL None
29 gdelt-bq full events QuadClass QuadClass INT64 The entire CAMEO event taxonomy is ultimately ... NULL None
30 gdelt-bq full events GoldsteinScale GoldsteinScale FLOAT64 Each CAMEO event code is assigned a numeric sc... NULL None
31 gdelt-bq full events NumMentions NumMentions INT64 This is the total number of mentions of this e... NULL None
32 gdelt-bq full events NumSources NumSources INT64 This is the total number of information source... NULL None
33 gdelt-bq full events NumArticles NumArticles INT64 This is the total number of source documents c... NULL None
34 gdelt-bq full events AvgTone AvgTone FLOAT64 This is the average “tone” of all documents co... NULL None
35 gdelt-bq full events Actor1Geo_Type Actor1Geo_Type INT64 Location of Actor1. This field specifies the ... NULL None
36 gdelt-bq full events Actor1Geo_FullName Actor1Geo_FullName STRING Location of Actor1. This is the full human-re... NULL None
37 gdelt-bq full events Actor1Geo_CountryCode Actor1Geo_CountryCode STRING Location of Actor1. This is the 2-character F... NULL None
38 gdelt-bq full events Actor1Geo_ADM1Code Actor1Geo_ADM1Code STRING Location of Actor1. This is the 2-character F... NULL None
39 gdelt-bq full events Actor1Geo_Lat Actor1Geo_Lat FLOAT64 Location of Actor1. This is the centroid lati... NULL None
40 gdelt-bq full events Actor1Geo_Long Actor1Geo_Long FLOAT64 Location of Actor1. This is the centroid long... NULL None
41 gdelt-bq full events Actor1Geo_FeatureID Actor1Geo_FeatureID STRING Location of Actor1. This is the GNS or GNIS F... NULL None
42 gdelt-bq full events Actor2Geo_Type Actor2Geo_Type INT64 Location of Actor2. This field specifies the ... NULL None
43 gdelt-bq full events Actor2Geo_FullName Actor2Geo_FullName STRING Location of Actor2. This is the full human-re... NULL None
44 gdelt-bq full events Actor2Geo_CountryCode Actor2Geo_CountryCode STRING Location of Actor2. This is the 2-character F... NULL None
45 gdelt-bq full events Actor2Geo_ADM1Code Actor2Geo_ADM1Code STRING Location of Actor2. This is the 2-character F... NULL None
46 gdelt-bq full events Actor2Geo_Lat Actor2Geo_Lat FLOAT64 Location of Actor2. This is the centroid lati... NULL None
47 gdelt-bq full events Actor2Geo_Long Actor2Geo_Long FLOAT64 Location of Actor2. This is the centroid long... NULL None
48 gdelt-bq full events Actor2Geo_FeatureID Actor2Geo_FeatureID STRING Location of Actor2. This is the GNS or GNIS F... NULL None
49 gdelt-bq full events ActionGeo_Type ActionGeo_Type INT64 Location of Event. This field specifies the g... NULL None
50 gdelt-bq full events ActionGeo_FullName ActionGeo_FullName STRING Location of Event. This is the full human-rea... NULL None
51 gdelt-bq full events ActionGeo_CountryCode ActionGeo_CountryCode STRING Location of Event. This is the 2-character FI... NULL None
52 gdelt-bq full events ActionGeo_ADM1Code ActionGeo_ADM1Code STRING Location of Event. This is the 2-character FI... NULL None
53 gdelt-bq full events ActionGeo_Lat ActionGeo_Lat FLOAT64 Location of Event. This is the centroid latit... NULL None
54 gdelt-bq full events ActionGeo_Long ActionGeo_Long FLOAT64 Location of Event. This is the centroid longi... NULL None
55 gdelt-bq full events ActionGeo_FeatureID ActionGeo_FeatureID STRING Location of Event. This is the GNS or GNIS Fe... NULL None
56 gdelt-bq full events DATEADDED DATEADDED INT64 This field stores the date the event was added... NULL None
57 gdelt-bq full events SOURCEURL SOURCEURL STRING This field is only present in the daily event ... NULL None
In [ ]:
%%bigquery
SELECT *
FROM `gdelt-bq.full.events`
LIMIT 5
In [16]:
%%bigquery usa_event_df
SELECT Actor1CountryCode, Actor2CountryCode, count(*) as count
FROM `gdelt-bq.full.events`
WHERE ((Actor1CountryCode = 'USA' and Actor2CountryCode IS NOT NULL) 
       or (Actor2CountryCode = 'USA' and Actor1CountryCode IS NOT NULL)) 
       and Actor1CountryCode != Actor2CountryCode
       and Year = 2022
GROUP BY 1,2
Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 602.46query/s] 
Downloading: 100%|██████████| 431/431 [00:01<00:00, 384.29rows/s]
In [17]:
usa_event_df
Out[17]:
Actor1CountryCode Actor2CountryCode count
0 USA GRC 8819
1 PHL USA 7808
2 VCT USA 570
3 SAM USA 4
4 USA SWE 3219
... ... ... ...
426 USA GNB 30
427 SAU USA 10874
428 LVA USA 707
429 ASA USA 104
430 BGR USA 854

431 rows × 3 columns

In [18]:
# Create a new column with the sorted Actor1CountryCode and Actor2CountryCode
usa_event_df['sorted_pair'] = usa_event_df.apply(lambda row: tuple(sorted([row['Actor1CountryCode'], row['Actor2CountryCode']])), axis=1)

# Group by the sorted pair and sum the counts
grouped_df = usa_event_df.groupby('sorted_pair').agg({'count': 'sum'}).reset_index()
In [19]:
max_val = np.log10(grouped_df['count'].max())

def get_alpha(count):
    '''
    Convert a count to an alpha val.
    Log-scaled
    '''
    scale = np.log10(count)
    return (scale/max_val) * 0.25
In [20]:
plt.figure(figsize=(18,18))

# Load the world shape map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Country centroids
world['centroid'] = world.centroid
centroids = world.set_index('iso_a3')['centroid']

event_map = Basemap(projection='moll',
                    llcrnrlat=-80,urcrnrlat=80,
                    llcrnrlon=-180,urcrnrlon=180, 
                    lon_0=0, lat_0=0)

event_map.drawcoastlines()
event_map.drawcountries()
event_map.fillcontinents(color='0.6')
event_map.drawmapboundary()

# Assuming grouped_df is defined somewhere before this
for index, row in grouped_df.iterrows():
    country1, country2 = row['sorted_pair']
    if country1 in centroids.index and country2 in centroids.index:
        point1 = centroids.loc[country1]
        point2 = centroids.loc[country2]
        
        # Convert centroids to lat-long for Basemap
        lon1, lat1 = point1.xy
        lon2, lat2 = point2.xy
        
        line_alpha = get_alpha(row['count'])
        
        event_map.drawgreatcircle(lon1[0], lat1[0], lon2[0], lat2[0], 
                                  linewidth=2,
                                  color='r',
                                  alpha=line_alpha)
plt.show()
/cvmfs/cybergis.illinois.edu/software/conda/cybergisx/geoai-0.8.0/lib/python3.6/site-packages/ipykernel_launcher.py:7: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  import sys

Case2: Entity Connection Graph

In [26]:
%%bigquery
SELECT * FROM `gdelt-bq.gdeltv2.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'geg_gcnlapi';
Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 382.34query/s]                          
Downloading: 100%|██████████| 13/13 [00:01<00:00, 10.71rows/s]
Out[26]:
table_catalog table_schema table_name column_name field_path data_type description collation_name rounding_mode
0 gdelt-bq gdeltv2 geg_gcnlapi date date TIMESTAMP None NULL None
1 gdelt-bq gdeltv2 geg_gcnlapi url url STRING None NULL None
2 gdelt-bq gdeltv2 geg_gcnlapi lang lang STRING None NULL None
3 gdelt-bq gdeltv2 geg_gcnlapi polarity polarity FLOAT64 None NULL None
4 gdelt-bq gdeltv2 geg_gcnlapi magnitude magnitude FLOAT64 None NULL None
5 gdelt-bq gdeltv2 geg_gcnlapi score score FLOAT64 None NULL None
6 gdelt-bq gdeltv2 geg_gcnlapi entities entities ARRAY<STRUCT<name STRING, type STRING, mid STR... None NULL None
7 gdelt-bq gdeltv2 geg_gcnlapi entities entities.name STRING None NULL None
8 gdelt-bq gdeltv2 geg_gcnlapi entities entities.type STRING None NULL None
9 gdelt-bq gdeltv2 geg_gcnlapi entities entities.mid STRING None NULL None
10 gdelt-bq gdeltv2 geg_gcnlapi entities entities.wikipediaUrl STRING None NULL None
11 gdelt-bq gdeltv2 geg_gcnlapi entities entities.numMentions INT64 None NULL None
12 gdelt-bq gdeltv2 geg_gcnlapi entities entities.avgSalience FLOAT64 None NULL None
In [27]:
%%bigquery entity_graph
SELECT Source, Target, Count RawCount, "Undirected" Type, ( Count/SUM(Count) OVER () ) Weight FROM (
SELECT a.entity Source, b.entity Target, COUNT(*) as Count
FROM (
 (SELECT url, entities.name entity FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities 
  where entities.mid is not null and entities.type = 'PERSON'
  and date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00")
) a
JOIN (
 (SELECT url, entities.name entity FROM `gdelt-bq.gdeltv2.geg_gcnlapi`, unnest(entities) entities 
  where entities.mid is not null and entities.type = 'PERSON'
  and date >= "2019-02-05 00:00:00" AND date < "2019-02-06 00:00:00")
) b
ON a.url=b.url
WHERE a.entity<b.entity
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 1500
)
order by Count Desc
Query complete after 0.00s: 100%|██████████| 10/10 [00:00<00:00, 3087.45query/s]                       
Downloading: 100%|██████████| 1500/1500 [00:01<00:00, 1263.98rows/s]
In [47]:
entity_graph.head()
Out[47]:
Source Target RawCount Type Weight
0 Nancy Pelosi Stacey Abrams 1120 Undirected 0.005300
1 Elizabeth Warren Kamala Harris 1030 Undirected 0.004875
2 Cory Booker Kamala Harris 976 Undirected 0.004619
3 Cory Booker Elizabeth Warren 909 Undirected 0.004302
4 Kamala Harris Kirsten Gillibrand 888 Undirected 0.004203
In [52]:
entity_graph_filtered = entity_graph[(entity_graph['Source'] == 'Donald Trump') | (entity_graph['Target'] == 'Donald Trump')]
In [53]:
entity_graph_filtered
Out[53]:
Source Target RawCount Type Weight
7 Donald Trump Nancy Pelosi 799 Undirected 0.003781
13 Donald Trump Kamala Harris 727 Undirected 0.003441
17 Donald Trump Elizabeth Warren 682 Undirected 0.003228
18 Donald Trump Stacey Abrams 682 Undirected 0.003228
24 Donald Trump Kirsten Gillibrand 635 Undirected 0.003005
... ... ... ... ... ...
757 Angela Merkel Donald Trump 97 Undirected 0.000459
814 Donald Trump God 95 Undirected 0.000450
863 Donald Trump Juan Guaido 93 Undirected 0.000440
1284 Bush Donald Trump 86 Undirected 0.000407
1433 Donald Trump Marco Rubio 83 Undirected 0.000393

61 rows × 5 columns

In [71]:
g = nx.Graph()

for index, row in entity_graph_filtered.iterrows():
    g.add_edge(row['Source'], row['Target'], weight=row['Weight'])

# Set up with a white background and no grid lines
plt.figure(figsize=(20, 15))
plt.title('People most associated in global news with Donald Trump', y=0.97, fontsize=20, fontweight='bold')
cf = plt.gcf()
ax = cf.gca()
ax.set_axis_off()

# Getting the largest connected component
connected_components = (g.subgraph(c) for c in nx.connected_components(g))
gmax = max(connected_components, key=len)

d = nx.degree(gmax)
b = nx.betweenness_centrality(gmax).items()

# # Use a circular layout
pos = nx.circular_layout(gmax)

edge_weights = nx.get_edge_attributes(gmax, 'weight')
normalized_weights = [weight/max(edge_weights.values()) for weight in edge_weights.values()]

nx.draw_networkx_edges(gmax, pos, nodelist=[v[0] for v in b], alpha=0.9, width=normalized_weights*20)
nx.draw_networkx_nodes(gmax, pos, nodelist=[v[0] for v in b], node_size=[v[1] * 20 for v in d],
                       node_color=[math.log(1 + v[1] * 20) for v in b], cmap=plt.cm.Blues)
nx.draw_networkx_labels(gmax, pos, labels={v[0]: v[0] for v in b})

# Display the graph
plt.show()
In [ ]: