Word and sentence boundaries in automatic text processing



Natural Language Processing, Computational linguistics, Preprocessing, Tokenization, Text segmentation


This paper aims to explore the major linguistic challenges involved in the preprocessing of a corpus composed of theses and dissertations from the Oil and Gas domain. Besides posing specific questions related to this domain and to scientific texts, we measured to which extent dealing with these matters hinders the automatic processing. We built a gold standard corpus of tokenization and sentence segmentation comprising several difficult cases, which are now available to the Portuguese NLP community. This corpus can be used to evaluate automatic tokenization methods, as well as to improve the quality of subsequent steps in processing.


Download data is not yet available.


BIDERMAN, Maria Tereza Camargo. Teoria lingüística: teoria lexical e lingüística computacional. Martins Fontes, 2001.

DE SOUZA, Elvis; FREITAS, Cláudia. ET: uma Estação de Trabalho para revisão, edição e avaliação de corpora anotados morfossintaticamente. In: VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (TILic 2019). TILic 2019, Salvador, BA, Brazil, Outubro, 15-18, 2019.

EUROPE PMC CONSORTIUM. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic acids research, v. 43, n. D1, p. D1042-D1048, 2015.

FREITAS, Cláudia; AFONSO, Susana. Bíblia Florestal: Um manual lingüístico da Floresta Sintá (c) tica. 2007. Disponível em: <http://www.linguateca.pt/Floresta/BibliaFlorestal/>. Acesso em: 14 jul. 2020.

GREFENSTETTE, Gregory; TAPANAINEN, Pasi. What is a Word, What is Sentence? Problems of Tokenization, Grenoble: Rank Xerox Research Centre. 1994.

HEARST, Marti. Untangling text data mining. in: Proceedings of the 37th Annual meeting of the Association for Computational Linguistics. 1999. p. 3-10.

KAZAMA, Jun'ichi; MIYAO, Yusuke; TSUJII, Jun’ichi. A maximum entropy tagger with unsupervised hidden markov models. In: Proc. of the 6th NLPRS. 2001. p. 333-340.

LOPES, Lucelene; VIEIRA, Renata. Building domain specific parsed corpora in portuguese language. in: Proceedings of ENIAC 2013, 2013, Brasil., 2013.

MANNING, Christopher.; SCHÜTZE, Hinrich. Foundations of statistical natural language processing. MIT press, 1999.

ROCHA, Luísa; FREITAS, Cláudia; SANTOS, Diana. Preparação para Leitura Distante em português: diálogos entre PLN e Humanidades Digitais. In: VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (TILic 2019). TILic 2019, Salvador, BA, Brazil, Outubro, 15-18, 2019.

SAMPSON, Geoffrey. Empirical Linguistics. London: Continuum, 2001.

SANCHEZ, George. Sentence boundary detection in legal text. In: Proceedings of the Natural Legal Language Processing Workshop 2019. 2019. p. 31-38.

SANTOS, Diana; BICK, Eckhard; AFONSO, Susana. Floresta Sintá(c)tica: apresentação e história do projecto. 2007. Disponível em https://www.linguateca.pt/Diana/download/SantosBickAfonsoFlorestaSet2007.pdf. Acesso em: 12 ago. 2020

SANTOS, Diana; FREITAS, Cláudia; BICK, Eckhard. OBras: a fully annotated and partially human-revised corpus of Brazilian literary works in public domain. In: CorLex, 24 de setembro de 2018.

SILVEIRA, Aline; DE SOUZA, Elvis; CAVALCANTI, Tatiana; FREITAS, Cláudia. Do PDF ao TXT: Desafios na extração de informação em textos técnico-científicos. In: VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (TILic 2019). TILic 2019, Salvador, BA, Brazil, Outubro, 15-18, 2019.

STRAKA, Milan; HAJIC, Jan; STRAKOVÁ, Jana. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In: LREC. 2016.

THOMPSON, Paul; ANANIADOU, Sophia; TSUJII, Jun’ichi. The GENIA Corpus: Annotation Levels and Applications. In: Handbook of Linguistic Annotation. Springer, Dordrecht, 2017. p. 1395-1432.



How to Cite

Cavalcanti, T., Silveira, A., de Souza, E., & Freitas, C. (2021). Word and sentence boundaries in automatic text processing. Revista Brasileira De Iniciação Científica, 8, e021033. Retrieved from https://periodicoscientificos.itp.ifsp.edu.br/index.php/rbic/article/view/348