Effective Feature Selection for Mining Text Data with Side-Information
Abstract— Many text documents contain side-information. Many web documents consist of meta-data with them which correspond to different kinds of attributes such as the source or other information related to the origin of the document. Data such as location, ownership or even temporal information may be considered as side-information. This huge amount of information may be used for performing text clustering. This information can either improve the quality of the representation for the mining process, or can add noise to the process. When the information is noisy it can be a risky approach for performing mining process along with the side-information. These noises can reduce the quality of clustering while if the side-information is informative then it can improve the quality of clustering. In existing system, Gini index is used as the feature selection method to filter the informative side-information from text documents. It is effective to a certain extent but the remaining number of features is still huge. It is important to use feature selection methods to handle the high dimensionality of data for effective text categorization. In the proposed system, In order to improve the document clustering and classification accuracy as well as reduce the number of selected features, a novel feature selection method was proposed. To improve the accuracy and purity of document clustering with less time complexity a new method called Effective Feature Selection (EFS) is introduced. This three-stage procedure includes feature subset selection, feature ranking and feature re-ranking.
Index Terms— Effective Feature Selection (EFS), feature subset selection, feature ranking and feature re-ranking, Side-information.
Click Here
International Journal for Trends in Technology & Engineering © 2015 IJTET JOURNAL