Benavides, Sean Lester C.; Masapol, Cid Antonio F.

An enhancement of Jiang, Z., et al.,’s compression-based classification algorithm applied to news article categorization - Undergraduate Thesis: (Bachelor of Science in Computer Science) - Pamantasan ng Lungsod ng Maynila, 2025

ABSTRACT: This study enhances the compression-based classification algorithm proposed by Jiang et al., specifically for news article categorization, by improving classification accuracy and computational efficiency. The original algorithm faces challenges related to resilience against stopwords, limitations in detecting semantic similarities, and inefficiencies in classification due to the computational cost of k-nearest neighbors (k-NN). Online news sites for example, may miscategorize articles, affecting their searchability and hindering users’ ability to find relevant content. To address these issues, the study implements preprocessing techniques such as stopword removal and unigram extraction to refine feature selection and reduce redundancy. The gzip compression method is optimized to detect textual patterns more efficiently, improving classification performance. Additionally, the k-NN algorithm is replaced with Approximate Nearest Neighbors On Yeah! (ANNOY), significantly enhancing scalability and reducing execution time. Experiments conducted on multiple datasets demonstrate substantial improvements, with classification accuracy increasing by an average of 6.47% and processing time decreasing by an average of 2.45% in large datasets. These results highlight gzip’s effectiveness as a lightweight, training-free method for text classification. The proposed enhancements offer a practical and computationally efficient approach to news article categorization, particularly in resource-constrained environments.

QA76.9 A43 B46 2025