Introduction to Text Classification

Yutaka Sasaki
Toyota Technological Institute


Text classification (or text categorization) has been actively investigated by many researchers since the 1980s. From the viewpoint of text mining, text classification can be regarded as preprocessing technology to filter out irrelevant documents from a large-scale corpus. The task of text classification is to allocate documents into predefined topics, such as economics, politics, and sport. Due to the drastic increase in online textual information, e.g., email messages, online news, web pages, as well as a huge number of resources for scientific online abstracts such as MEDLINE, there is an ever-growing demand for text classification. It is an interesting question how to achieve high performance in the task of assigning multiple topics to documents in a targeted domain and how to make the most of the multi-topical features of the documents. The task to classify each document into multiple topics is called multi-topic text classification.

For example, automatically assigning multiple clinical codes to clinical free text is a typical multi-topic text classification problem. The Medical NLP Challenge 2007 [Pestian et al., 2007], which was organized by the Computational Medicine Center (CMC), targeted the task of assigning the ICD-9-CM clinical codes to clinical free text in clinical records.

Brief History

The origin of text classification goes back to the early '60s [Sebastiani, 2002]. In the late '90s, Machine Learning techniques were successfully applied to text classification. Support Vector Machines were applied to text classification in [Joachims, 1998; Dumais et al., 1998]. Maximum Entropy Models were also applied in [Nigam et al., 1999].

Multi-label classification of multi-topic text has been investigated in the last years. AdaBoost was enhanced to handle multi-labels in [Schapire and Singer, 2000]. In this approach, the task of assigning multi-topics to a text is regarded as a ranking of labels for the text. This ranking-based evaluation was inspired by Information Retrieval. In a text classification problem, however, we need a definite set of topics for each document rather than the rankings of topic candidates. [McCallum, 1999] proposed to use the EM algorithm to train a mixture model of multi-labels. Parametric Mixture Models (PMM) were also proposed in [Ueda et al., 2002]. Maximum Entropy Models were extended to multi-labelled MEMs (MLME) in [Zhu et al., 2005].

Data set

  • Reuters-21578 (ModApt Split is the standard test/training separation.)
  • 20 Newsgroups
  • CMC Medical NLP Challenge 2007 data

    Tutorial slides

  • MIB Doctor Training Course slides

    My publications

    Yutaka Sasaki, Brian Rea, Sophia Ananiadou, Multi-Topic Aspects in Clinical Text Classification, IEEE International Conference on Bioinformatics and Biomedicine 2007 (IEEE BIBM-07), Silicon Valley, Nov. 2-7, pp.62-67, 2007.(pdf)

    Yutaka Sasaki, Brian Rea, and Sophia Ananiadou, Clinical Text Classification under the Open and Closed Topic Assumptions, International Journal on Data Mining and Bioinformatics (IJDMB), Inderscience Publishers, 2009. (to appear)

    Relevant publications

    [Apte et al., 1994] Apte, C., Damerau, F.J. and Weiss, S.M. (1994) 'Automated Learning of Decision Rules for Text Categorization', ACM Transactions on Information Systems, Vol. 12, No. 3, 233-251.

    [Aronson et al., 2007] Aronson, A.R., Bodenreider, O., Demner-Fushman, D., Fung, K.W., Lee, V.K., Mork, J.G., Neveol, A., Peters, L. and Rogers, W.J. (2007) 'From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches', ACL-07 Workshop on Biological, translational, and Clinical Language Processing, June 2007, Prague, Czech Republic, pp. 105112.

    [Berger et al., 1996] Berger, A.L., Della Pietra, S.A.and Della Pietra, V.J. (1996) 'A Maximum Entropy Approach to Natural Language Processing', Computational Linguistics, Vol. 22, No. 1, pp. 39-71.

    [Boutell et al., 2003] Boutell, M., Shen, X., Luo, J. and Brown, C. (2003) 'Multi-label Semantic Scene Classification', Technical Reports 813, Department of Computer Science, Univerisity of Rochester.

    [Cortes et al., 1995] Cortes, C. and Vapnik V.M. (1995) 'Support Vector Networks', Machine Learning}, Vol. 20, pp. 273-297.

    [Crammer and Singer, 2001] Crammer, K. and Singer, Y. (2001) 'On the Algorithmic Implementation of Multi-class SVMs', Journal of Machine Learning Research, Vol. 2, pp. 265-292.

    [Crammer et al., 2007] Crammer, K., Dredze, M., Ganchev, K., Talukdar P.P., and Carroll, S. (2007) 'Automatic Code Assignment to Medical Text', ACL-07 Workshop on Biological, translational, and Clinical Language Processing, June 2007, Prague, Czech Republic, pp. 129-136, 2007.

    [Debole and Sebastiani, 2005] Debole, F. and Sebastiani, F. (2005) 'An Analysis of the Relative Hardness of Reuters-21578 Subsets',{\em Journal of the American Society for Information Science and Technology} Vol. 56, pp.584-596.

    [Dumais et at., 1998] Dumais, S.T., Platt, J., Heckerman, D. and Sahami, M. (1998) 'Inductive Learning Algorithms and Representations for Text Categorization', Prof. CIKM '98 . pp.148-155.

    [Frantzi and Ananiadou, 1999] Frantzi, K. and Ananiadou, S. (1999) 'The C-value / NC-value Domain Independent Method for Multi-word Term Extraction', Journal of Natural Language Processing, Vol. 6, No. 3, 145-179.

    [Gower and Legendre, 1986], Gower, J.C. and Legendre, P. (1986) 'Metric and Euclidean Properties of Dissimilarity Coefficient', Journal of Classification, Vol. 3, pp. 5-48.

    [Joachims, 1998] Joachims, T. (1998) 'Text Categorization with Support Vector Machines: Learning with Many Relevant Features', {\em Proc. of 10th European Conference on Machine Learning (ECML-98)}, pp.137-142.

    [Kazama and Tsujii, 2005] Kazama, J. and Tsujii, J. (2005) 'Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization', Machine Learning, Vol. 60, No. 1-3, pp.169-194.

    [Lang, 1995] Lang, K. (1995) 'Newsweeder: Learning to Filter Netnews', Proc. 12th International Conference on Machine Learning (ICML '95), pp.331-339.

    [McCallum, 1999] McCallum, A. (1999) 'Multi-label Text Classification with a Mixture Model Trained by EM', AAAI-99 Workshop on Text Learning, pp. 1-7.

    [Nigam et al., 1999] Nigam, K., Lafferty, J. and McCallum, A. (1999) 'Using Maximum Entropy for Text Classification', IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67.

    [Patrick et al., 2007] Patrick, J., Zhang, Y. and Wang, Y.(2007) 'Developing Feature Types for Classifying Clinical Notes', ACL-07 Workshop on Biological, translational, and Clinical Language Processing, June 2007, Prague, Czech Republic, pp. 191-192.

    [Pestian et al., 2007] Pestian, J.P., Brew, C. and Matykiewicz, P., Hovermale D.J., Johnson, N., K.B. Cohen and Duch, W. (2007) 'A Shared Task Involving Multi-label Classification of Clinical Free Text', in Prof. of ACL-2007 Workshop on BioNLP, pp. 97-104.

    [Platt et al., 1999] Platt, J.C., Cristianini, N. and Shawe-Taylor, J. (1999) 'Large Margin DAGs for Multiclass Classification', in Proc. of NIPS-1999, pp. 547-553.

    [Sebastiani, 2002] Sebastiani, F. (2002) 'Machine Learning in Automated Text Categorization', ACM Computing Surveys, Vol. 34, No.1, pp.1-47.

    [Schapire and Singer, 2000] Schapire, R.E. and Singer, Y. (2000) 'BoosTexter: A Boosting-based System for Text Categorization', Machine Learning, Springer, Vol. 39, pp.135-168.

    Ueda and Saito, 2002, Ueda N. and Saito, K. (2002) 'Parametric Mixture Models for Multi-Labeled Text', Advances in Neural Information Processing Systems 15, MIT Press, pp.737-744.

    [Vapnik, 1995] Vapnik, V.N. (1995) The Nature of Statistical Learning Theory, Springer.

    [Yang, 1997] Yang, Y. (1997) 'An Evaluation of Statistical Approaches to Text Categorization', Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University.

    [Yang and Liu, 1999] Yang, Y. and Liu,X. (1999) 'A Re-examination of Text Categorization Methods', Proc. the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), pp. 42-49.

    [Zhu et al., 2005] Zhu, S., Ji, X., Xu, W. and Gong, Y. (2005) 'Multi-labelled Classification Using Maximum Entropy Method', Proc. the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-05), pp. 274-281.

    (C)2009 Yutaka Sasaki
    Last update: 15 Dec 2009