Describe the bug
Since v0.27.0, importing pandera permanently sets SPARK_LOCAL_IP=127.0.0.1 in os.environ. This was introduced in PR #2123, which removed the finally cleanup block from _set_pyspark_environment_variables() in pandera/external_config.py. Prior to v0.27.0, the env var was set temporarily to import pyspark.pandas and then cleaned up. Now it persists for the lifetime of the process.
This breaks any Spark setup where SPARK_LOCAL_IP should not be 127.0.0.1. For example, standalone or pseudo-distributed Spark clusters where executors need to connect back to the actual driver IP. Executors instead try to connect to localhost, resulting in Connection refused errors.
Error output
When starting a Spark session on a distributed cluster (that is not YARN managed) after importing pandera:
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: localhost/127.0.0.1:36987
Caused by: java.net.ConnectException: Connection refused
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
24 import os
25
26 print("SPARK_LOCAL_IP" in os.environ) # False
27
28 from pandera.pandas import DataFrameSchema
29
30 print(os.environ.get("SPARK_LOCAL_IP")) # 127.0.0.1
31 ```
Expected behavior
Importing pandera should not permanently mutate the process environment. The pre-v0.27.0 behaviour (set temporarily, clean up in finally) was correct.
Desktop (please complete the following information):
- OS: Linux (also reproducible on macOS)
- Python version: 3.11
- Pandera version: 0.27.0+
- pyspark version: 3.x
Describe the bug
Since v0.27.0, importing pandera permanently sets
SPARK_LOCAL_IP=127.0.0.1inos.environ. This was introduced in PR #2123, which removed thefinallycleanup block from_set_pyspark_environment_variables()inpandera/external_config.py. Prior tov0.27.0, the env var was set temporarily to importpyspark.pandasand then cleaned up. Now it persists for the lifetime of the process.This breaks any Spark setup where
SPARK_LOCAL_IPshould not be127.0.0.1. For example, standalone or pseudo-distributed Spark clusters where executors need to connect back to the actual driver IP. Executors instead try to connect tolocalhost, resulting inConnection refusederrors.Error output
When starting a Spark session on a distributed cluster (that is not YARN managed) after importing pandera:
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Expected behavior
Importing pandera should not permanently mutate the process environment. The pre-v0.27.0 behaviour (set temporarily, clean up in
finally) was correct.Desktop (please complete the following information):