pyspark.SparkContext.addArchive#
- SparkContext.addArchive(path)[source]#
- Add an archive to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. - To access the file in Spark jobs, use - SparkFiles.get()with the filename to find its download/unpacked location. The given path should be one of .zip, .tar, .tar.gz, .tgz and .jar.- New in version 3.3.0. - Parameters
- pathstr
- can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use - SparkFiles.get()to find its download location.
 
 - Notes - A path can be added only once. Subsequent additions of the same path are ignored. This API is experimental. - Examples - Creates a zipped file that contains a text file written ‘100’. - >>> import os >>> import tempfile >>> import zipfile >>> from pyspark import SparkFiles - >>> with tempfile.TemporaryDirectory(prefix="addArchive") as d: ... path = os.path.join(d, "test.txt") ... with open(path, "w") as f: ... _ = f.write("100") ... ... zip_path1 = os.path.join(d, "test1.zip") ... with zipfile.ZipFile(zip_path1, "w", zipfile.ZIP_DEFLATED) as z: ... z.write(path, os.path.basename(path)) ... ... zip_path2 = os.path.join(d, "test2.zip") ... with zipfile.ZipFile(zip_path2, "w", zipfile.ZIP_DEFLATED) as z: ... z.write(path, os.path.basename(path)) ... ... sc.addArchive(zip_path1) ... arch_list1 = sorted(sc.listArchives) ... ... sc.addArchive(zip_path2) ... arch_list2 = sorted(sc.listArchives) ... ... # add zip_path2 twice, this addition will be ignored ... sc.addArchive(zip_path2) ... arch_list3 = sorted(sc.listArchives) ... ... def func(iterator): ... with open("%s/test.txt" % SparkFiles.get("test1.zip")) as f: ... mul = int(f.readline()) ... return [x * mul for x in iterator] ... ... collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect() - >>> arch_list1 ['file:/.../test1.zip'] >>> arch_list2 ['file:/.../test1.zip', 'file:/.../test2.zip'] >>> arch_list3 ['file:/.../test1.zip', 'file:/.../test2.zip'] >>> collected [100, 200, 300, 400]