TAN keywords for types of token definitions Definitive list of key terms used to name standard token definitions.

http://viaf.org/viaf/299582703 tag:textalign.net,2015:agent:kalvesmaki:joel Joel Kalvesmaki

Started file Revised to suit new <token-definition> Added U+200B ZERO WIDTH SPACE to token definitions

letters letters only general word characters only general ignore punctuation gwo General tokenization pattern for any language, words only. Non-letters such as punctuation are ignored.

letters and hyphens General tokenization pattern for any language, only word characters (as defined in Unicode) and the hyphen. All other characters are ignored.

letters and apostrophes General tokenization pattern for any language, only word characters (as defined in Unicode) and the apostrophe variants ' and ’. All other characters are ignored. Note, this pattern will produce misleading results for texts that use single quotation marks.

letters hyphens and apostrophes letters apostrophes and hyphens letters, hyphens and apostrophes letters, apostrophes and hyphens letters, hyphens, and apostrophes letters, apostrophes, and hyphens General tokenization pattern for any language, only word characters (as defined in Unicode), the hyphen, and the apostrophe variants ' and ’. All other characters are ignored. Note, this pattern will produce misleading results for texts that use single quotation marks.

letters and punctuation general non space characters general include punctuation General tokenization pattern for any language, treating not only series of letters as word tokens but also individual non-letter characters (e.g., punctuation).

nonspace General tokenization pattern for any language, treating any contiguous run of nonspace marks as a word.