Skip to content

SPARK_LOCAL_IP environment variable permanently set to 127.0.0.1 after import (regression in v0.27.0) #2344

@mabunassar

Description

@mabunassar

Describe the bug

Since v0.27.0, importing pandera permanently sets SPARK_LOCAL_IP=127.0.0.1 in os.environ. This was introduced in PR #2123, which removed the finally cleanup block from _set_pyspark_environment_variables() in pandera/external_config.py. Prior to v0.27.0, the env var was set temporarily to import pyspark.pandas and then cleaned up. Now it persists for the lifetime of the process.

This breaks any Spark setup where SPARK_LOCAL_IP should not be 127.0.0.1. For example, standalone or pseudo-distributed Spark clusters where executors need to connect back to the actual driver IP. Executors instead try to connect to localhost, resulting in Connection refused errors.

Error output

When starting a Spark session on a distributed cluster (that is not YARN managed) after importing pandera:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
   Connection refused: localhost/127.0.0.1:36987
Caused by: java.net.ConnectException: Connection refused
  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

  24 import os
  25
  26 print("SPARK_LOCAL_IP" in os.environ)  # False
  27
  28 from pandera.pandas import DataFrameSchema
  29
  30 print(os.environ.get("SPARK_LOCAL_IP"))  # 127.0.0.1
  31 ```

Expected behavior

Importing pandera should not permanently mutate the process environment. The pre-v0.27.0 behaviour (set temporarily, clean up in finally) was correct.

Desktop (please complete the following information):

  • OS: Linux (also reproducible on macOS)
  • Python version: 3.11
  • Pandera version: 0.27.0+
  • pyspark version: 3.x

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions