Summary:
In this paper, we present a distributed classification technique for big data by efficiently using distributed storage architecture and data processing units of a cluster. While handling such large data, the existing approaches consider specific data partitioning techniques which demand complete data be processed before partitioning. This leads to an excessive overhead of high computation and data communication. The proposed method does not require any pre-structured data partitioning technique and is also adaptive to big data mining tools. We hypothesize that an effective aggregation of the information generated from data partitions by subprocesses of the complete learning process can lead to accurate prediction results while reducing the overall time complexity. We build three SVM based classifiers, namely one phase voting SVM (1PVSVM), two phase voting SVM (2PVSVM), and similarity based SVM (SIMSVM). Each of these classifiers utilizes the support vectors as the local information to construct the synthesized learner for efficiently reducing the training time and ensuring minimal communication between processing units. In this context, an extensive empirical analysis demonstrates the effectiveness of our classifiers when compared to other existing approaches on several benchmark datasets. However, among existing methods and three of our proposed (1PVSVM, 2PVSIM, and SIMSVM) methods, SIMSVM is the most efficient. Considering MNIST dataset, SIMSVM achieves an average speedup ratio of 0.78 and minimum scalability of 73% when the data size is scaled up to 10 times. It also retains high accuracy (99%) similar to centralized approaches.
Publication Type: Journal Article
Publication Date: September 5th, 2022
Publisher: IEEE
Author(s): Amrit Pal; Abishi Chowdhury; Satakshi; Husnu S. Narman; Arkabandhu Chowdhury; Manish Kumar
Links: