Introduction to Biomedical Named Entity Recognition

Yutaka Sasaki
Toyota Technological Institute

Introduction

Studies on Named Entity Recognition (NER) became active in the 1980s. The target of NER at that time was mainly proper nouns, such as person and corporate names, and numerical expressions, such as date and percentage. Since the 1990s, categories of the names targeted by NER have widened to technical terms, such as gene/protein names.

NER techniques are important because a lot of names are ambiguous and it is necessary to disambiguate them to handle the content of text at the semantic level. For example, "Washington" could be used for both a person name and a location name. NER tools decide named entity categories of the ambiguous names. For example, NER identifies that "Washington" is a person name in "We obtained new data from Mr. Washington." Without this step, computerized systems cannot decide whether "Washington" in the text should be treated as a contact point or a source point of the action. NER for biomedical terms is much harder to accomplish because many terms conflict with general English words, e.g, "cat" is a protein name, and with other named entity categories, e.g. "ER" is a protein name.

Data set

  • JNLPBA-2004 Shared Task data
  • BioCreative data
  • GeneRIF

    My publications

    Yutaka Sasaki, Yoshimasa Tsuruoka, John McNaught, and Sophia Ananiadou, How to make the most of NE dictionaries in statistical NER, BMC Bioinformatics, 9(Suppl 11):S5, 2008.(HTML)

    Relevant publications

  • Lafferty J, McCallum A, Pereira F: Conditional random fields: probabilistic models for segmenting and labelling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001) 2001, 282-289.
  • Baum LE, Petrie T: Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics 1966, 37:1554-1563.
  • McCallum A, Freitag D, Pereira F: Maximum entropy Markov models for information extraction and segmentation. Proceedings of the Seventeenth International Conference on Machine Learning 2000, 591-598.
  • Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y: Introduction to the Bio-Entity Recognition Task at JNLPBA. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 70-75.
  • Tjong Kim Sang EF, Veenstra J: Representing Text Chunks. Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (E-99); Bergen, June 8 – 12, 1999 1999, 173-179.
  • Franzen K, Eriksson G, Olsson F, Asker L, Liden P, Koster J: Protein names and how to find them. International Journal of Medical Informatics 2002, 67:49-61. PubMed Abstract | Publisher Full Text
  • Fukuda K, Tsunoda T, Tamura A, Takagi T: Toward information extraction: identifying protein names from biological papers. Proceedings of Pacific Symposium on Biocomputing 1998, 705-716.
  • Collier N, Nobata C, Tsujii J: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of the 18th International Conference on Computational Linguistics (COLING'2000); Saarbrucken 2000, 201-207.
  • Kazama J, Makino T, Ohta Y, Tsujii J: Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proceeding of ACL-2002 Workshop on Natural Language Processing in the Biomedical Domain 2002, 1-8.
  • Lee KJ, Hwang YS, Rim HC: Two-Phase Biomedical NE Recognition based on SVMs. Proceedings of ACL 2003 Workshop on Natural Language Processing in Biomedicine; Sapporo 2003, 33-40.
  • Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J: Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of the Forty fourth Annual Meeting of the Association for Computational Linguistics (ACL-2006); Sydney 2006, 465-472.
  • Tanabe L, Wilbur WJ: Tagging Gene and Protein Names in Biomedical Text. Bioinformatics 2002, 18:1124-1132.
  • Tsuruoka Y: GENIA Tagger 3.0. [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger]
  • Yamamoto K, Kudo T, Konagaya A, Matsumoto Y: Use of morphological analysis in protein name recognitionstar. Journal of Biomedical Informatics 2004, 471-482.
  • Tsai TH, Sung CL, Dai HJ, Hung HC, Sung TY, Hsu WL: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 2006, 7(Suppl 5):S11.
  • Zhou GD, Su J: Exploring Deep Knowledge Resources in Biomedical Name Recognition. Proceedings of the Joint Workshop on Natural Language Processing of Biomedicine and its Applications (JNLPBA-2004) 2004, 96-99.
  • Kim S, Yoon J: Experimental Study on a Two Phase Method for Biomedical Named Entity Recognition. IEICE Transactions on Informaion and Systems 2007, E90-D(7):1103-1120.
  • Finkel J, Dingare S, Nguyen H, Nissim M, Sinclair G, Manning C: Exploiting context for biomedical entity recognition: from syntax to the Web. Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 88-91.
  • Settles B: Biomedical Named Entity Recognition Using Conditional Random Fields and Novel Feature Sets. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 104-1007.
  • Rössler M: Adapting an NER-System for German to the Biomedical Domain. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 92-95.
  • Park K-M, Kim S-H, Lee D-G, Rim H-C: Boosting Lexical Knowledge for Biomedical Named Entity Recognition. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 76-79.
  • Song Y, Kim E, Lee GG, Yi B: POSBIOTM-NER in the shared task of BioNLP/NLPBA. Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 100-103.

  • (C)2009 Yutaka Sasaki
    Last update: 15 Dec 2009