pyspark.sql.functions.from_xml#

pyspark.sql.functions.from_xml(col, schema, options=None)[source]#

Parses a column containing a XML string to a row with the specified schema. Returns null, in the case of an unparseable string.

New in version 4.0.0.

Parameters

colColumn or str: a column or column name in XML format
schemaStructType, Column or str: a StructType, Column or Python string literal with a DDL-formatted string to use when parsing the Xml column
optionsdict, optional: options to control parsing. accepts the same options as the Xml datasource. See Data Source Option for the version you use.

Returns

Column: a new column of complex type from given XML object.

Examples

Example 1: Parsing XML with a DDL-formatted string schema

>>> import pyspark.sql.functions as sf
>>> data = [(1, '''<p><a>1</a></p>''')]
>>> df = spark.createDataFrame(data, ("key", "value"))
... # Define the schema using a DDL-formatted string
>>> schema = "STRUCT<a: BIGINT>"
... # Parse the XML column using the DDL-formatted schema
>>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
[Row(xml=Row(a=1))]

Example 2: Parsing XML with a StructType schema

>>> import pyspark.sql.functions as sf
>>> from pyspark.sql.types import StructType, LongType
>>> data = [(1, '''<p><a>1</a></p>''')]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> schema = StructType().add("a", LongType())
>>> df.select(sf.from_xml(df.value, schema)).show()
+---------------+
|from_xml(value)|
+---------------+
|            {1}|
+---------------+

Example 3: Parsing XML with ArrayType in schema

>>> import pyspark.sql.functions as sf
>>> data = [(1, '<p><a>1</a><a>2</a></p>')]
>>> df = spark.createDataFrame(data, ("key", "value"))
... # Define the schema with an Array type
>>> schema = "STRUCT<a: ARRAY<BIGINT>>"
... # Parse the XML column using the schema with an Array
>>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
[Row(xml=Row(a=[1, 2]))]

Example 4: Parsing XML using pyspark.sql.functions.schema_of_xml()

>>> import pyspark.sql.functions as sf
>>> # Sample data with an XML column
... data = [(1, '<p><a>1</a><a>2</a></p>')]
>>> df = spark.createDataFrame(data, ("key", "value"))
... # Generate the schema from an example XML value
>>> schema = sf.schema_of_xml(sf.lit(data[0][1]))
... # Parse the XML column using the generated schema
>>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
[Row(xml=Row(a=[1, 2]))]