000 02194nam a22001817a 4500
003 FT8935
005 20260112133338.0
050 _aQA76.9 A43 B46 2025
100 1 _aBenavides, Sean Lester C.; Masapol, Cid Antonio F.
245 _aAn enhancement of Jiang, Z., et al.,’s compression-based classification algorithm applied to news article categorization
264 1 _cc2025
300 _bUndergraduate Thesis: (Bachelor of Science in Computer Science) - Pamantasan ng Lungsod ng Maynila, 2025
336 _2text
_atext
_btext
337 _2 unmediated
_a unmediated
_b unmediated
338 _2volume
_avolume
_bvolume
505 _aABSTRACT: This study enhances the compression-based classification algorithm proposed by Jiang et al., specifically for news article categorization, by improving classification accuracy and computational efficiency. The original algorithm faces challenges related to resilience against stopwords, limitations in detecting semantic similarities, and inefficiencies in classification due to the computational cost of k-nearest neighbors (k-NN). Online news sites for example, may miscategorize articles, affecting their searchability and hindering users’ ability to find relevant content. To address these issues, the study implements preprocessing techniques such as stopword removal and unigram extraction to refine feature selection and reduce redundancy. The gzip compression method is optimized to detect textual patterns more efficiently, improving classification performance. Additionally, the k-NN algorithm is replaced with Approximate Nearest Neighbors On Yeah! (ANNOY), significantly enhancing scalability and reducing execution time. Experiments conducted on multiple datasets demonstrate substantial improvements, with classification accuracy increasing by an average of 6.47% and processing time decreasing by an average of 2.45% in large datasets. These results highlight gzip’s effectiveness as a lightweight, training-free method for text classification. The proposed enhancements offer a practical and computationally efficient approach to news article categorization, particularly in resource-constrained environments.
942 _2lcc
_cMS
999 _c37422
_d37422