To Dependency hell and back with Python

Posted by StuffonmyMind on March 19, 2021

Python is mostly awesome but sometimes it gives me splitting migraines and one of those scenarios is when I get sucked into a rabbit hole of debugging and utter confusion only to discover that a dependency of another dependency has changed it’s API a wee bit

The first time it happened was with pandas where a not so directly used sub-dependency numpy is not pinned https://github.com/pandas-dev/pandas/blob/00a622401e06bd7afaaa508707a46f3dcc494fe4/setup.cfg#L34 and so installing pandas upgrades numpy and broke a bunch of stuff in staging servers while working perfectly fine locally cause numpy was already installed in my environment. I had to test with docker-compose after wiping every single cache in existence

The second time around an API that writes a dataframe to postgres started failing with the error log

ImportError: cannot import name 'RowProxy' from 'sqlalchemy.engine'

This was because SQLAlchemy==1.4.4 was updated/installed by a sub dependency 😞 so just downgrading SQLAlchemy to SQLAlchemy==1.3.20 solved the whole thing, I initially started going around stripping brackets ( before wondering how it worked for me locally and checking out the versions

Now this happened the third time today but thankfully I caught it in like 5 mins cause I have started expecting this from Python every time I am hit with the old But it works on my machine and not on the server !

I have a utility function that calculates the area of a polygon populating several data sinks in the system and people use it for analysis and also data science stuff

It is a very simple function that Reprojects the polygon and returns the area in Sq Kms

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pyproj
import shapely.ops as ops
import geopandas as gpd
from functools import partial
from shapely.geometry import Polygon

world = gpd.read_file(
    gpd.datasets.get_path('naturalearth_lowres')
)
polygon = world['geometry'][1]

def polygon_area(geom):
    """Polygon area."""
    geom_area = ops.transform(
        partial(
            pyproj.transform,
            pyproj.Proj("EPSG:4326"),
            pyproj.Proj(
                proj="aea", 
                lat_1=geom.bounds[1], 
                lat_2=geom.bounds[3]
            ),
        ),
        geom,
    )
    # Get the area in km^2
    return geom_area.area / 1000000

polygon_area(polygon)

Returns 773852 sq km which was very very wrong and people doing the analysis and running data science experiments with this data were quick to spot this

Now this fixed function is the one below

1
2
3
4
5
6
7
8
9
10
11
12
13
def polygon_area(geom):
    """Polygon area."""
    geom_area = ops.transform(
        partial(
            pyproj.transform,
            pyproj.Proj(init="EPSG:4326"),
            pyproj.Proj(proj="aea", lat_1=geom.bounds[1], lat_2=geom.bounds[3]),
        ),
        geom,
    )
    # Get the area in km^2
    return geom_area.area / 1000000
polygon_area(polygon)

This rightly returns 932590 sq km as the area

For reference I use geopandas==0.7.0 and pyproj is a sub dependency so it gets automatically upgraded by geopandas, in my environment it was pyproj==2.5.0

I just added an init= to the initialization of the projection

Ham

Without the init argument specified the defaults turn to something else and the polygons end up being wildly different as you can see from the above notebook.

You blink and you miss, that is how dumb the fix was, just an init param but the usage of init parameter in the above function actually returned a Future warning and also linked to https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6 saying it was going to be deprecated, so looks like I should use CRS rather than Proj here

Note: Poetry manages these sub-dependencies pretty well and creates a .lock file to pin the working environment, it has been a lifesaver.