pyspark.RDD.countApprox#
- RDD.countApprox(timeout, confidence=0.95)[source]#
- Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. - New in version 1.2.0. - Parameters
- timeoutint
- maximum time to wait for the job, in milliseconds 
- confidencefloat
- the desired statistical confidence in the result 
 
- Returns
- int
- a potentially incomplete result, with error bounds 
 
 - See also - Examples - >>> rdd = sc.parallelize(range(1000), 10) >>> rdd.countApprox(1000, 1.0) 1000