Authors:Yunfan Kang
The expressive-max-p
problem involves clustering a set of geographic areas into the maximum number of homogeneous regions that satisfies a set of user defined constraints. Different from the max-p
regions (Duque, Anselin, Rey (2012)) problem, EMP formulation supports 5 aggregates: MIN, MAX, AVG, SUM, and COUNT. Each aggregate can be paired with a range operator and each query can contain any subset of the five aggregates.
This notebook demonstrate the algorithm proposed in Kang and Magdy (2021) .
The pyneapple library package can be installed using the following command:
!pip install pyneapple-lib
#!pip install git+https://github.com/MagdyLab/Pyneapple.git@main
import pyneapple.regionalization.expressive_maxp as emp
import pyneapple.weight.rook as rook
import libpysal
import time
from libpysal.weights import Queen, Rook, KNN, Kernel, DistanceBand
import numpy as np
import geopandas
import pandas
import matplotlib.pyplot as plt
import jpype
from jpype import java
from jpype import javax
To demonstrate max-p-enriched
, we combine the census tracts of the Los Angeles City with facts about population and employment status. The plot shows the dataset with the areas colored by its population in 2010.
path = "data/LACity/LACity.shp"
lacity = geopandas.read_file(path)
lacity.plot(column = 'pop2010', figsize = (12, 8), edgecolor = 'w')
lacity.head()
To formulate a expressive max-p
problem, a number of parameters need to be specified.
Firstly, a spatial weights object needs to be calculated. The Pineapple package provides a module Pineapple.weight.rook
that is supposed to give the same result as the libpysal.weights.Rook
but runs faster espicially when the number of areas is large.
w = Rook.from_dataframe(lacity)
Then, we can formulate the query by specifying the constraints and the dissimilarity attribute.
For the first example query, we show how a max-p
query by formulating it as an max-p-enriched
query. We use the model to aggregate the cencus tracts in Los Angeles city into regions with population >= 200000 and the heterogeneity mesured by the number of households in each area is minimized.
sum_attr = 'pop2010'
sum_low = 200000.0
dis_attr = 'households'
We then pass the parameters to the module Pineapple.regionalization.emp
. For the unused constraints, the attribute can be specified as any arbitrary attribute and the range is set to be (-infinity, infinity)
non_attr = 'pop_16up'
inf = java.lang.Double.POSITIVE_INFINITY
p, regions = emp.expressive_maxp(lacity, w, dis_attr, non_attr, -inf, inf, non_attr, -inf, inf, non_attr, -inf, inf, sum_attr, sum_low, inf, -inf, inf)
The number of regions, i.e. the p value, and the region label of each area is returned after the regionalization computation.
p
regions
lacity["regionLabel"] = regions
lacity.plot(column = "regionLabel", cmap="tab20", edgecolor="w", figsize = (12, 8))
Next, we forlulate a query with multiple constraints: aggregate the cencus tracts in Los Angeles city into regions with population >= 20000
, the minimum labor force per area <= 3000
, average employed population between 1000 and 4000
, and the heterogeneity mesured by the number of households in each area is minimized.
Similar to the previous example, we just need to specify the attribute for each constraint and the coresponding range.
min_attr = 'pop_16up'
min_high = 3000.0
avg_attr = 'employed'
avg_low = 1000.0
avg_high = 4000.0
sum_attr = 'pop2010'
sum_low = 20000.0
dis_attr = 'households'
non_attr = 'pop_16up'
inf = java.lang.Double.POSITIVE_INFINITY
p, regions = emp.expressive_maxp(lacity, w, dis_attr, min_attr, -inf, min_high, non_attr, -inf, inf, avg_attr, avg_low, avg_high, sum_attr, sum_low, inf, -inf, inf)
p
regions
lacity["regionLabel"] = regions
lacity.plot(column = "regionLabel", cmap="tab20", edgecolor="w", figsize = (12, 8))