Category : apache-spark

In my spark job, each executor will initialize a c++ library which can segment a text line into word list. When two or more executors allocated on the same machine, initialization of the c++ library multiple times will cause core dump. public class TextSegment implements Serializable { private SparkSession spark = SparkSession.builder() .appName("TextSegment") .getOrCreate(); public ..

Read more

Error: AnalysisException: Path does not exist: file:/D:/sk – Add/Main/2020/Main_2012-01.txt; Code1 with error: for i in os.listdir(): if j.endswith(‘.txt’): print(i) df= spark.read.text(i) Code2 with same err: path=r"D:sk – AddMain20" for i in os.listdir(): if j.endswith(‘.txt’): print(i) df= spark.read.text(path+”+i) Code3 without error: df1=spark.read.text(r’D:sk – AddMain20Main_2020-12.txt’) Why is it adding file:/ prefix for my file name and causing ..

Read more

I keep having this error during installation to of spark on windows 10. "’spark-shell’ is not recognized as an internal or external command, operable program or batch file." I checked several previous questions and tried everything still having same issue. I then tried installing java jre and jdk tried both (i am not sure if ..

Read more

Jupyter notebook was installed via anaconda on windows 10. I want to run pyspark directly from notebook. For that, I used these two instructions: set PYSPARK_DRIVER_PYTHON=jupyter set PYSPARK_DRIVER_PYTHON_OPTS=notebook But when I typed: pyspark in the command line, the error message I got is: ‘jupyter’ is not recognized as an internal or external command I couldn’t ..

Read more

I have not faced this problem with any of other software on mysystem. Able to install and run everything in window terminal/command prompt and Git-Bash Recently, I started learning Spark. Installed Spark setting everything JAVA_HOME, SCALA_HOME, hadoop winutils file. Spark-shell and pyspark-shell both are running perfect in command prompt/window terminal and in Jupyter through pyspark ..

Read more

Dear users and coders, PySpark 3.0.1 and Python 3.6 Error when running a simple standalone Traceback (most recent call last): File "C:Usersbrueclipse-workspaceSPARKtest.py", line 8, in <module> sc = SparkContext(conf=conf) File "C:UsersbruAppDataLocalProgramsPythonPython36libsite-packagespysparkcontext.py", line 133, in __init__ SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "C:UsersbruAppDataLocalProgramsPythonPython36libsite-packagespysparkcontext.py", line 325, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway(conf) File "C:UsersbruAppDataLocalProgramsPythonPython36libsite-packagespysparkjava_gateway.py", line 105, in launch_gateway ..

Read more

I am installing pyspark in ubuntu wsl in windows 10. These are the commands I used after installing wsl from Microsoft Store. #install Java runtime environment (JRE) sudo apt-get install openjdk-8-jre-headless export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre #download spark, visit https://spark.apache.org/downloads.html if you want a different version wget https://apache.osuosl.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz #untar and set a symlink sudo tar -xvzf spark-2.4.7-bin-hadoop2.7.tgz -C ..

Read more