Hello Friends,
In this video I have demonstrated how we can reduce the processing time by more than 95% with correct usage of repartition() function in Apache Spark.
If we repartition() the data before running join or aggregation queries then it reduced the amount of data shuffle read / write and as such processing happens very fast.
Also by increasing the number of partitions, we make the aggregation tasks more manageable for the processor and thereby reduce the processing time.
The data file used in demo can be downloaded from our website https://k2analytics.co.in under the Resource tab. Within Resources the file will be in the Complimentary Resources. You may have to change the hdfs file path to file system path in case you are running the code in a Standalone Cluster.
Thanks.
コメント