Class QuantileDiscretizer
- All Implemented Interfaces:
- Serializable,- org.apache.spark.internal.Logging,- QuantileDiscretizerBase,- Params,- HasHandleInvalid,- HasInputCol,- HasInputCols,- HasOutputCol,- HasOutputCols,- HasRelativeError,- DefaultParamsWritable,- Identifiable,- MLWritable
QuantileDiscretizer takes a column with continuous features and outputs a column with binned
 categorical features. The number of bins can be set using the numBuckets parameter. It is
 possible that the number of buckets used will be smaller than this value, for example, if there
 are too few distinct values of the input to create enough distinct quantiles.
 Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
 parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
 thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
 be set, or if the number of buckets should be the same across columns, numBuckets can be
 set as a convenience. Note that in multiple columns case, relative error is applied to all
 columns.
 
 NaN handling:
 null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
 will produce a Bucketizer model for making predictions. During the transformation,
 Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
 also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
 If the user chooses to keep NaN values, they will be handled specially and placed into their own
 bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
 but NaNs will be counted in a special bucket[4].
 
 Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
 org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
 for a detailed description). The precision of the approximation can be controlled with the
 relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
 covering all real values.
- See Also:
- 
Nested Class SummaryNested classes/interfaces inherited from interface org.apache.spark.internal.Loggingorg.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionCreates a copy of this instance with the same UID and some extra params.Fits a model to the input data.Param for how to handle invalid entries.inputCol()Param for input column name.final StringArrayParamParam for input column names.static QuantileDiscretizerstatic org.apache.spark.internal.Logging.LogStringContextLogStringContext(scala.StringContext sc) Number of buckets (quantiles, or categories) into which data points are grouped.Array of number of buckets (quantiles, or categories) into which data points are grouped.static org.slf4j.Loggerstatic voidorg$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) Param for output column name.final StringArrayParamParam for output column names.static MLReader<T>read()final DoubleParamParam for the relative target precision for the approximate quantile algorithm.setHandleInvalid(String value) setInputCol(String value) setInputCols(String[] value) setNumBuckets(int value) setNumBucketsArray(int[] value) setOutputCol(String value) setOutputCols(String[] value) setRelativeError(double value) transformSchema(StructType schema) Check transform validity and derive the output schema from the input schema.uid()An immutable unique ID for the object and its derivatives.Methods inherited from class org.apache.spark.ml.PipelineStageparamsMethods inherited from class java.lang.Objectequals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.ml.util.DefaultParamsWritablewriteMethods inherited from interface org.apache.spark.ml.param.shared.HasHandleInvalidgetHandleInvalidMethods inherited from interface org.apache.spark.ml.param.shared.HasInputColgetInputColMethods inherited from interface org.apache.spark.ml.param.shared.HasInputColsgetInputColsMethods inherited from interface org.apache.spark.ml.param.shared.HasOutputColgetOutputColMethods inherited from interface org.apache.spark.ml.param.shared.HasOutputColsgetOutputColsMethods inherited from interface org.apache.spark.ml.param.shared.HasRelativeErrorgetRelativeErrorMethods inherited from interface org.apache.spark.ml.util.IdentifiabletoStringMethods inherited from interface org.apache.spark.internal.LogginginitializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContextMethods inherited from interface org.apache.spark.ml.util.MLWritablesaveMethods inherited from interface org.apache.spark.ml.param.Paramsclear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwnMethods inherited from interface org.apache.spark.ml.feature.QuantileDiscretizerBasegetNumBuckets, getNumBucketsArray
- 
Constructor Details- 
QuantileDiscretizer
- 
QuantileDiscretizerpublic QuantileDiscretizer()
 
- 
- 
Method Details- 
load
- 
read
- 
org$apache$spark$internal$Logging$$log_public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- 
org$apache$spark$internal$Logging$$log__$eqpublic static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) 
- 
LogStringContextpublic static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc) 
- 
numBucketsDescription copied from interface:QuantileDiscretizerBaseNumber of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2.See also QuantileDiscretizerBase.handleInvalid(), which can optionally create an additional bucket for NaN values.default: 2 - Specified by:
- numBucketsin interface- QuantileDiscretizerBase
- Returns:
- (undocumented)
 
- 
numBucketsArrayDescription copied from interface:QuantileDiscretizerBaseArray of number of buckets (quantiles, or categories) into which data points are grouped. Each value must be greater than or equal to 2See also QuantileDiscretizerBase.handleInvalid(), which can optionally create an additional bucket for NaN values.- Specified by:
- numBucketsArrayin interface- QuantileDiscretizerBase
- Returns:
- (undocumented)
 
- 
handleInvalidDescription copied from interface:QuantileDiscretizerBaseParam for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Note that in the multiple columns case, the invalid handling is applied to all columns. That said for 'error' it will throw an error if any invalids are found in any column, for 'skip' it will skip rows with any invalids in any columns, etc. Default: "error"- Specified by:
- handleInvalidin interface- HasHandleInvalid
- Specified by:
- handleInvalidin interface- QuantileDiscretizerBase
- Returns:
- (undocumented)
 
- 
relativeErrorDescription copied from interface:HasRelativeErrorParam for the relative target precision for the approximate quantile algorithm. Must be in the range [0, 1].- Specified by:
- relativeErrorin interface- HasRelativeError
- Returns:
- (undocumented)
 
- 
outputColsDescription copied from interface:HasOutputColsParam for output column names.- Specified by:
- outputColsin interface- HasOutputCols
- Returns:
- (undocumented)
 
- 
inputColsDescription copied from interface:HasInputColsParam for input column names.- Specified by:
- inputColsin interface- HasInputCols
- Returns:
- (undocumented)
 
- 
outputColDescription copied from interface:HasOutputColParam for output column name.- Specified by:
- outputColin interface- HasOutputCol
- Returns:
- (undocumented)
 
- 
inputColDescription copied from interface:HasInputColParam for input column name.- Specified by:
- inputColin interface- HasInputCol
- Returns:
- (undocumented)
 
- 
uidDescription copied from interface:IdentifiableAn immutable unique ID for the object and its derivatives.- Specified by:
- uidin interface- Identifiable
- Returns:
- (undocumented)
 
- 
setRelativeError
- 
setNumBuckets
- 
setInputCol
- 
setOutputCol
- 
setHandleInvalid
- 
setNumBucketsArray
- 
setInputCols
- 
setOutputCols
- 
transformSchemaDescription copied from class:PipelineStageCheck transform validity and derive the output schema from the input schema.We check validity for interactions between parameters during transformSchemaand raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled byParam.validate().Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks. - Specified by:
- transformSchemain class- PipelineStage
- Parameters:
- schema- (undocumented)
- Returns:
- (undocumented)
 
- 
fitDescription copied from class:EstimatorFits a model to the input data.- Specified by:
- fitin class- Estimator<Bucketizer>
- Parameters:
- dataset- (undocumented)
- Returns:
- (undocumented)
 
- 
copyDescription copied from interface:ParamsCreates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy().- Specified by:
- copyin interface- Params
- Specified by:
- copyin class- Estimator<Bucketizer>
- Parameters:
- extra- (undocumented)
- Returns:
- (undocumented)
 
 
-