An enhancement of the gibberish classification algorithm for detecting gibberish content in text document / Jasmine P. Laurente, Carla Johnica D. Quilop. 6
By: Jasmine P. Laurente, Carla Johnica D. Quilop. 4 0 16 [, ] | [, ] |
Contributor(s): 5 6 [] |
Language: Unknown language code Summary language: Unknown language code Original language: Unknown language code Series: ; March 2016.46Edition: Description: 28 cm. 92 ppContent type: text Media type: unmediated Carrier type: volumeISBN: ISSN: 2Other title: 6 []Uniform titles: | | Subject(s): -- 2 -- 0 -- -- | -- 2 -- 0 -- 6 -- | 2 0 -- | -- -- 20 -- | | -- -- -- -- 20 -- | -- -- -- 20 -- --Genre/Form: -- 2 -- Additional physical formats: DDC classification: | LOC classification: | | 2Other classification:| Item type | Current location | Home library | Collection | Call number | Status | Date due | Barcode | Item holds |
|---|---|---|---|---|---|---|---|---|
| Book | PLM | PLM Filipiniana Section | Filipiniana-Thesis | T QA76.9.L38.2016 (Browse shelf) | Available | FT6071 |
Browsing PLM Shelves , Shelving location: Filipiniana Section , Collection code: Filipiniana-Thesis Close shelf browser
Thesis: (BSCS major in Computer Science) - Pamantasan ng Lungsod ng Maynila, 2016. 56
5
ABSTRACT: Gibberish Classification algorithm aims to detect whether the text is valid, or randomly typed in a keyboard. It returns a percentage where a low one means valid test, and a high one means gibberish text. If the result is lower than 50%, it's likely that the text is valid. If a result is higher than 50%, it's likely that the text is gibberish. The algorithm is optimized for the English Language and for longer text. It will still work for shorter text, for example is one sentence, but then the result will be less accurate. The algorithm won't give a percentage lower than 1%, except if the input string is null or empty, then it returns 0%. The algorithm checks three things. First, it checks whether the amount of unique characters, in a chunk of 35 characters, is in usual range, Second, if the amount of vowels in the letters is in usual range. Third, it checks whether the word/clear ration is in usual range. The final percentage will be computed based from these three things. The researchers purpose is to improve the Gibberish Classification Algorithm so it can be more accurate in giving the final percentage of how the text is. It will now based from the right spelling or structure of words in the English language and not by the range of unique characters, vowels and word/char ratio. Since the Gibberish Classification Algorithm is still in its early stage, so there are still some incorrect return values. There are still cases that the Gibberish Classification Algorithm produces a high percentage to a clearly valid sentence and conversely, for gibberish inputs, the algorithm sometimes produces a low percentage. While studying the existing gibberish classification algorithm, the researchers encountered these problems. First, words with correct spelling are being considered as gibberish with 15 out of 25 of the sample valid inputs are evaluated as gibberish (60% incorrect results). Second, the algorithm returns a valid percentage to sentences that uses numerous punctuation marks with 17 out of 25 of the sample invalid inputs are evaluated as valid (68% incorrect result). Third, words that uses mixed uppercase and lowercase letters are being considered as valid. In order to improve the existing algorithm, the researchers solution to the encountered problems are the following: First is to lessen the 60% incorrect results regarding into 10% by adding additional computation for words and for sentence. Second is to lessen the 68% incorrect results regarding punctuation into 18% by adding additional computation for punctuation marks. Third is to be able to check the case of the letters in each word and consider a word as gibberish if different cases is detected. Improving the Gibberish Classification Algorithm will be a great help to people, especially to English Proficiency teachers, and people who wants to detect if there are gibberish content in their documents.
5

There are no comments for this item.