Abstract—The process of discovering information from semi-structured XML documents can be improved by using a high performance classifier. On the other hand, one of the commonly used methods for improving the performance of a document classifier is accurate recognition and selection of key features of documents. The aim of this project is to present and implement a new mechanism for discovering the main features of semi-structured XML documents so that the performance of the document classifier built on these subsets of features set can be improved. Our suggested solution for achieving this goal is to focus on document preprocessing techniques and specially feature weighting methods. Having studied the literature on feature weighting methods, we propose two novel methods for this purpose named LBTF (Location-Based Term Frequency) and TFCRF (Term Frequency and Category Relevancy Factor). In LBTF method which is presented for semi-structured documents, the weight of a feature in a document is a function of its location (structural information) as well as its frequency (content information) in the document. The TFCRF, specified for semi/non-structured documents, considers the feature distribution within different categories in addition to the feature distribution within different documents by taking into account the number of documents in each class.

For evaluating the proposed methods, a semi-structured document classification system with two preprocessor and classifier subsystems has been designed and implemented. The results of our experiments indicate significant improvement of the performance measures of the SVM classifier (about 5%-9%) when using the TFCRF and LBTF feature weighting methods rather than other implemented standard feature weighting methods such as TF-based methods and IDF-based methods on the inex document collection. Furthermore, the performance of the classifier based on our proposed feature weighting methods is not affected by on the number of documents and features and can be maximize by a small number of features.

Keywords— Feature weighting, Information Discovery, Semi-structured document classification, Text mining, XML.