Guide to the Text Alignment Network, Version 2021

Guide to the Text Alignment Network, Version 2021 Text Alignment Network: Official Guidelines 2015-present Joel Kalvesmaki Joel Kalvesmaki kalvesmaki@gmail.com All software, code, and dependencies (e.g., applications, functions, schemas, utiities, vocabularies) are released under a GNU General Public License, https://opensource.org/licenses/GPL-3.0. All other materials (such as this document), unless otherwise specified, are licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/ Latest stable version: http://textalign.net/release/TAN-2021/guidelines/. Development version: https://github.com/textalign/TAN-2021/tree/dev Version 2021 (alpha) 2021-09-07 Formats: HTML • PDF • Docbook (master) In case of contradictions, apparent or not, between these guidelines and the core TAN files, priority should be given first to the RELAX-NG schemas (compact syntax), then to the functions, and finally to these guidelines. General overview Introduction

Overview The Text Alignment Network (TAN) is a framework that allows users, working independently and collaboratively, to share, find, create, edit, and explore digital texts and annotations. A customized extension of Text Encoding Initiative (TEI) XML, TAN is particularly suited for organizing and aligning texts with multiple versions (copies, translations, paraphrases), and for creating and editing text annotations such as quotations, translation clusters (word-to-word), and linguistic features. The foundation of TAN is a suite of XML formats, each designed for a specific task. The extensive validation routines maximize the syntactic and semantic interoperability of texts, annotations, and language resources. TAN comes with applications and utilities that open new frontiers in scholarly publishing, research, and teaching. Why use TAN? Extensive error checking. Built-in TAN validation rules go well beyond the customary error-checking performed by other formats. Files linked in the network "talk" to each other, to let users know about changes and updates. More than one hundred types of content-based errors are checked. Through Schematron Quick Fixes, many of the problems can be corrected in a matter of seconds. Time-saving utilities. Enjoy enhanced editing functions in Oxygen XML Editor's Author mode. Highly customizable TAN utilities help you create, edit, and maintain TEI and TAN files. For example: Body Builder: write rules to convert plain text or Word docx files into a preferred TAN/TEI structure and markup. Body Remodeler: incrementally restructure a text to imitate an existing TAN/TEI file. In conjunction with Oxygen Author tools, this utility can save hours of labor in creating a collection of many versions of the same work. Body Sync: update a TAN/TEI file so its transcription exactly matches that of another TAN/TEI file. TAN-A-lm Builder: generate lexico-morphological data for a TAN/TEI file. Pathbreaking applications. Core TAN applications, written in XSLT, provide cutting-edge tools for textual research and analysis. For example: Diff+: identify, analyze, and visualize text differences between any number of versions of a text. Parabola: juxtapose in a single interactive HTML page all the versions of a work, along with annotations. Tangram: identify quotations, paraphrases, and common text between two groups of texts. Intuitive text referencing. Unlike TEI, HTML, or other markup systems that rely heavily upon arbitrary identifiers that can be difficult to navigate and maintain, TAN points to text portions using familiar reference systems, or user-customized tokenization rules. Application development. TAN is built upon an extensive and robust XSLT function library, one of the few of its kind. Do you already use Natural Language Toolkit, Classical Language Toolkit, or comparable packages in programming languages to develop tools for textual and linguistic research? Do you have to process, analyze, and transform texts that are in tree structures? With more than 250 public functions, covering a range of tasks, from numerics to maps, checksums to tree manipulation, the TAN function library might have everything you need, and more, and help you stay within an XML environment. Many TAN functions are extremely useful, even outside of TEI or TAN. Semantic Web. TAN was designed at the outset to ensure that texts and their annotations would be rooted in the practices of the Semantic Web. Unlike many other formats, whose attribute values are almost always only human-readable, most TAN file components are tied to URIs, making them suitable for use in Semantic Web applications.

Rationale and purpose Scholars frequently work with numerous versions of texts. Sometimes the original version has been lost, or survives only fragmentarily, and can be studied only through later translations, paraphrases, or quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined by generations and cultures who created, read, and circulated the versions. Such textual comparison requires texts whose words, sentences, paragraphs, and other segments are aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation and its original correspond might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires engagement with other works and other languages, and collaboration across projects and fields of study. Text Alignment Network (TAN) XML facilitates the exchange of multiple versions of texts and annotations on those texts. TAN syntax is suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. TAN is not a single format, but rather a suite of formats, one task per format. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications (see ). TAN has been designed to support two kinds of scholarly activity: creation and research. When we create our primary sources or analyze them, we normally want what we create to be useful to our colleagues. TAN was designed to assist scholarly creative activities such as: Creating and sharing a transcription of a particular version of a textual work that it is more likely to align with any other TAN version of that text created by someone else; Creating an index of quotations that is semantically rich and can be applied to any other version of the quoting or quoted works; Specifying exactly (e.g., word-for-word) where a source and its translation correspond, even with overlapping or ambiguous relationships, or where doubt or alternative possibilities of alignment need to be expressed; Listing the grammatical features of every word in a text or a language in a way that allows it to be compared easily against other languages and texts. Shared TAN files form a decentralized, interoperable corpus of texts, a kind of Internet of primary sources and annotations. As this TAN-compliant corpus spreads into different linguistic, chronological, and geographical regions, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as: For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones? How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts? How do Arabic translations of Greek texts from the Abbasid period differ from contemporaneous translations from Sanskrit into Arabic? Can an anonymous English translation of a modern French novel be identified with known translators from that period? How do present-day translations of official United Nations documents differ across languages? Neither the TAN format nor its applications answer such questions. But they can be used to start to work on answers, because the TAN function library includes many cutting-edge algorithms that cannot be found in other programming libraries, whether XSLT or not. What the Natural Language Toolkit (or the related Classical Language Toolkit) is for digital humanists using Python, TAN aspires to be for those using XSLT. For more on the function library see .

About the format TAN differs from other text formats such as HTML, Microsoft Word, PDF, or Docbook. Each of those formats are interoperable only in the sense that any file can be reliably opened and displayed by the same software. Despite such software compatibility, the content, structured by each user, looks very different from one file to the next. If you receive from different people two versions of a particular literary work in the same file format (e.g., Word or PDF), there would be little likelihood that you could align them in a new document without a lot of extra work. These are presentation formats, designed to let the creator use his or her imagination to shape, structure, and present the material in highly stylized, creative ways. The formats are laissez faire, concerned mainly to ensure that each component is rendered properly, without regard for the meaning of those components. Creating a text in TAN is like opening a word processor and telling it, "I don't care how the text looks. I want to ensure that it is in a meaningful structure that corresponds to any other version of that text. The appearance, which could take thousands of directions, can be worried about later." The closest analogue to the TAN formats is the XML format developed by the Text Encoding Initiative, whose design catalyzed and continues to inspire the development of TAN. TAN is, in fact, a customized extension of TEI. TAN takes a handful of TEI concepts and extends them via stand-off annotation, to allow for overlapping annotations, to engage with the Semantic Web, and to support cross-project interoperability. TAN reduces some of the repetition that tends to be necessary in TEI files. For more on comparisons between TAN and TEI see . Some other caveats: Although TAN comes with an extensive library of functions and templates, it is not what most people think of as a tool or application. It is not customer, off-the-shelf software. It does not come with graphic interface. Rather, it is a package of XML resources, particularly in XSLT, that allows programmers and developers to create customized applications and tools. If you work with an XML editor like Oxygen, your editing experience will be greatly enhanced by the TAN function library, which was designed in Oxygen, and optimized for it. The TAN formats are specialized. They are not meant to replace other common text formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Converting a TAN file into these formats is usually straightforward, but will usually entail loss. Conversely, most conversions from one of these formats into TAN will not entail loss, but will be imperfect or incomplete, because many of these formats lack the data required by TAN. Conversion must be given careful thought, and can only be semiautomated. Each TAN format has a restricted field of inquiry, defined and explained in these guidelines. TAN is not for everyone. For example, if you are working on developing a transcription that imitates a particular print edition, you are better off using only TEI, or a version of TEI that you have customized. But once you want to bring that transcription into close comparison with other versions and study it intertextually, then TAN might be ideal.

Participation Changes are made regularly to TAN, mainly in its development branch. If you have a TAN library, sharing it with other participants, particularly via Git, will help developers test any changes that have been made to the function library, and encourage others to contribute to your project. The TAN project is by no means finished. This version TAN merely scratches the surface of what is possible. New participants to test, use, and develop TAN's schemas, functions, guidelines, and applications are welcome. Inquiries about participation should be sent to the project director, Joel Kalvesmaki, by email: director at textalign.net. Official announcements are made by email (Google Group) and by Twitter.

Starting off with the TAN format If you think you are ready to jump in and get going, try . But if you are new to markup languages, or unfamiliar or uncomfortable with acronyms and technical terms such as XML, RDF, XPath, and Unicode, you should start with this chapter, which uses a simple example to illustrate the steps typically taken to create and edit TAN files, and to introduce new terminology. By the end of this chapter, you will have a sense of how to create and edit a small collection of TAN transcriptions and alignments. In the TAN system, a transcription is a plain digital text that replicates a text found somewhere else, usually reproducing its script and spelling. The following—"In pluribus unum"—is a (partial) transcription of a United States dollar. The term should be distinguished from a transliteration, which is a transcription rendered in a script other than the original. For example, εν πλουριμπυς ουνεμ, would be a Greek transliteration of the previous transcription. The chapter touches on a number of general concepts that are discussed only briefly. If you find a particular term new or confusing, follow the prompts for further reading. If you are already familiar with basic markup concepts, you should at least skim through the chapter, because TAN approaches some old problems in new ways.

Creating TAN transcription and alignment data Let us take a simple example, that of aligning two English versions of the nursery rhyme Ring-a-ring-a-roses, sometimes known as Ring around the Rosie. Our goal here is to publish two versions of the nursery rhyme in the TAN format so that they are most likely alignable with any other TAN version of the poem that might appear. Although the TAN examples below look much like files in the examples subdirectory of the TAN library, they have been adjusted, to explain the formats better. We begin by finding previously published versions that haven't been digitized. In this case we have taken an interest in the versions published in 1881 and 1987 (one published in the U.K. and the other, the U.S.). Each of these books have other rhymes, but we've decided to focus upon one nursery rhyme, so we type up (transcribe) that poem and nothing else:Ring around the Rosie 1881 (U.K.) version 1987 (U.S.) version Ring-a-ring-a-roses, A pocket full of posies; Hush! Hush! Hush! Hush! We're all tumbled down. Ring-a-round the rosie, A pocket full of posies, Ashes! Ashes! We all fall down.

We must be sure to save each of the two transcriptions as plain text. Do not bother with a word processor (Word, OpenOffice, Google Docs, and so forth), which is too fancy for our needs. Word processors sometimes generate erroneous data, even when you export to plain text. And we are not concerned with italics, colors, fonts, margins, and so forth. We would be better off with a text editor, which opens and saves only text. But even those do not check to see if the rules of the TAN format have been followed. So the best tool is an XML editor, which like a text editor takes and creates only text. An XML editor is designed to follow the rules of XML, and so saves a lot of typing, and prevents many errors. More important, an XML editor will tell us when our TAN file is invalid, and will provide important help as we edit. Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists published on Wikipedia for both text editors and XML editors. TAN was developed using Oxygen, which is very powerful. If you are a new user, you are likely to find it overwhelming. Take advantage of tutorials and documentation associated with the XML editor you have chosen. Our first task is to get these two versions into separate files with the appropriate markup. Each TAN transcription file has two major parts: a head and a body. For now, we focus on only the second part, the body, as well as a few of the necessary preliminary lines that stand at the opening of the file, before both the head and the body. First, the 1881 (U.K.) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring01"> <head> . . . . . . . </head> <body xml:lang="eng"> <div type="line" n="1">Ring-a-ring-a-roses,</div> <div type="line" n="2">A pocket full of posies;</div> <div type="line" n="3">Hush! Hush! Hush! Hush!</div> <div type="line" n="4">We're all tumbled down.</div> </body> </TAN-T> And now the 1987 (U.S.) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring02"> <head> . . . . . . . </head> <body xml:lang="eng"> <div type="l" n="1">Ring-a-round the rosie,</div> <div type="l" n="2">A pocket full of posies,</div> <div type="l" n="3">Ashes! Ashes!</div> <div type="l" n="4">We all fall down.</div> </body> </TAN-T> The examples above are in eXtensible Markup Language (XML). XML lets you take a text or a collection of data and structure it with angle brackets, < and >. In the examples above, such markup is in boldface. Each file begins with a prolog, the first few lines that begin with <?. The first line simply states that what follows is an XML document. The next two lines in each example are processing instructions that point to the schemas: files that will be used to check to see whether or not our XML follows TAN rules, a process called validation. We will skip the details of those first five lines. They will be identical, or nearly so, from one TAN file to the next. We can simply cut and paste them when we want to start a new TAN file. After the prolog comes an opening tag, signified by an angle bracket followed by a letter, here <TAN-T>. That opening tag, <TAN-T...> is answered by a closing tag, </TAN-T>, the last line. An opening tag and a closing tag mark the beginning and the end of one of the most important parts of an XML document, the element. For now, you can think of an element as a chunk of data. Every element is marked by a pair of tags. In this example, <head> is answered by </head>, <body> by </body> and each <div...> by </div>. Any element that has an opening tag must have a closing tag. If an element doesn't have anything between its opening and closing tags, the two of them can be collapsed into a single tag. That is, <a></a> can be simplified to <a/> (such empty elements are illustrated below). Elements and processing instructions are two of the seven basic XML ingredients, called nodes. The other five node types are text, comment, attribute, namespace, and document, some of which we will meet below. The element node is arguably the most important type. You will see it most often, and it is absolutely required for anything to be well-formed XML. Every XML file must have at least one element. (But it does not have to have attributes, text, comments, or processing instructions.) Elements nest within or beside each other, but they never overlap or interlock. That is, you cannot have <a><b>overlap</a></b>. The prohibition on overlapping elements is one of the cardinal rules of XML. The no-overlap rule keeps XML files tidy, and makes it easier for developers to write efficient applications. Any two nearby elements normally relate to each other either by one nesting inside the other or by one being adjacent to the other. Because of these different close relationships, every XML file can be thought of as a tree, with the root at the trunk and the nested elements as branches, terminating in metaphorical leaves—those elements that do not contain any other elements. It is helpful to use the tree metaphor when we describe the path we take, toward either the leaves or the root. In these guidelines, we may use the terms rootward and leafward when we want to trace movement up and down the levels of hierarchy in an XML document. You may also encounter the corresponding terms outermost and innermost. The metaphor is strengthened by the XML rule that there can be but only one root element, i.e., the element that contains all other elements and is contained by none. In our examples above the root element is named TAN-T. An XML document tree can also be profitably thought of as a family. Family names provide the most common terminology to describe the relationship between elements. In our examples above, <TAN-T> is the parent of <body>, and <body> is the parent of the four <div> elements. Likewise, each <div> is the child of <body>, and <body> is the child of <TAN-T>. Distant parental relationships can be described with the terms ancestor and descendant. <TAN-T> is the ancestor of every element it encompasses, and every element encompassed by <TAN-T> is its descendant. Paratactic relationships are also important. <head> and <body> are siblings to each other, and every <div> is a sibling to every other <div>. The terms "following" and "preceding" are the most common ways to describe the relationship of one sibling to another. You may notice that some characters are inside opening tags, but not closing ones. In the opening tags for the <TAN-T>, <body>, and <div> elements there appear sets of pairs: a word and something within quotation marks, each of them separated by an equals sign. These stretches of text are called attributes. On the left side of the equals sign is the attribute name, and on the right side, within the quotation marks, is the attribute value. In the example above <TAN-T> has three attributes, @xmlns, @TAN-version, and @id (it is customary to signal attributes by writing @). We will skip @xmlns for now. It looks like an attribute, but it's really a pseudo-attribute, because it specifies the namespace of the XML file. Namespaces are an important but advanced topic, not discussed in this chapter. (See .) The value of @TAN-version indicates that the 2021 version of TAN is being used. @id is quite important. Every TAN file has an @id that uniquely names and permanently identifies the document itself. It should not be changed, even if we make edits. If you change the filename or a copy of it winds up being incorporated into another project, a stable @id will be quite important for finding it. An @id should be unique. The only time the value should be repeated in a file is when you are pointing to another version of the same file. In the <TAN-T>, the value of @id must always be what is called a tag uniform resource name (tag URN). A tag URN begins with tag:, followed by an email address or domain name that we own or owned. It is okay to use an obsolete address or domain; its purpose is to allow users to identify you, perhaps centuries from now, not to contact you. But you can use a current email address if you want to be contacted by those who use your file. After that email address or domain name comes a comma (no spaces) and a date on which we owned it, in the form of numbers for the year, year + month, or year + month + date, each item joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be 01, the first of the month; if we leave off the month value it is assumed to be 01, January. In the examples above, parkj@textalign.net,2015 points to our fictive self, Jenny Park, who owned that particular email address on the stroke of midnight (Coordinated Universal Time) January 1, 2015. After that comes a colon, and then any name we wish to assign to the file. We have anticipated a simple collection of texts, so we've called the files ring01 and ring02. If we run out of names, or want to restart, we can simply use a new email-date preface, e.g., parkj@textalign.net,2015-01-02. Or we could change the way we build our tag URNs. Tag URNs are very useful. You do not need permission to create one. You don't need to register them. You are in control. You also signal who is responsible for the file. Hundreds of years from now, when that email will be defunct or perhaps owned by someone else, users might still be able to identify who was responsible. The element <body> contains our transcription. @xml:lang, required, specifies the principal language of the transcribed text. We use the standard 3-letter abbreviation for English. We could have used en, but the 2-letter convention supports only a handful of languages. (See for more.) Our transcription has been divided into four <div> elements. How we divide up the work is entirely up to us. But we must make sure that every bit of text is enclosed by a leaf <div> (i.e., one that contains no other <div>). Every <div> must be the parent of only other <div>s, or none at all. No <div> may mix text and other elements. An exception is made for text that is nothing but space (the space bar, the tab, or the new line). Space-only text can be mixed with elements as needed, which means that a TAN file can be indented however you like, without changing its meaning. The values of @type and @n indicate, respectively, the type of division and the name of the division. We have used line in the first example, but we could easily have also used l (as we did in the second) or ln or any other phrase that we think will be intuitive to other users. The value is arbitrary, but gets explained by what is in the header (we will how below). We have used arabic numerals for the values of @n, but the value, once again, could have been anything. Here we've opted for a reference system that seems intuitive and will most likely apply to multiple versions of the work. But the Arabic numerals are not required. We could have used Roman numerals, or some other numbering or naming scheme that is standard in the field. The idea is to use the term that is most like what other people encoding a different version of the same text might use. Aside from the <head> element (discussed below), that's all we need in the TAN-T transcription. We can now move to alignment and annotation. We now turn to a second TAN format, TAN-A. Whereas the first two examples, TAN-T, had to do with texts and transcriptions, TAN-A has to do with alignment and annotation. The TAN-A format allows us to align and annotate as many transcriptions as we wish, and to make claims about them. Let's begin, once again temporarily skipping <head>. Significant differences from the previous two TAN-T files are emphasized:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> . . . . . . . </head> <body/> </TAN-A> In the prolog, the first line is identical to the first line of our transcription files. The second and third lines, the processing instructions, are identical, except that href of the first of these points to the validation file specific to the TAN-A format. Even the fourth line looks like the two TAN-T files, other than the new name for the root element, <TAN-A>, and the new value for @id. The penultimate line, <body/>, is an empty element, and is equivalent to an opening tag immediately followed by a closing tag, i.e., <body></body>. The alternative form, <body/>, is a more succincty way to say that an element contains nothing. It will become apparent, when we discuss <head> below, why our <body> can be empty. Let's look at a third TAN format, TAN-A-tok. This particular alignment file allows you to state precise which words in one text correspond with the words in another. Because of this precision, they can take more time to create. But we even start, we need to decide what kind of relationship holds between the two texts. Let us pretend, for the sake of example, that the 1987 version is a direct descendant (and therefore variation) of the 1881 one. So our task is to show exactly what words or phrases in the older version correspond to those of the newer one. We will simplify here, and exclude punctuation (some linguists legitimately treat punctuation as words in their own right). The term word is notoriously difficult to define, so we will call them tokens, to avoid false connotations (hence the name of the file, TAN-A-tok, to refer to alignment of tokens). We now create a TAN-A-tok file:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> . . . . . . . </head> <body reuse-type="general_adaptation" bitext-relation="B-descends-from-A">  <align> <tok src="ring1881" ref="1" pos="1"/> <tok src="ring1987" ref="1" pos="1"/> </align> <align> <tok src="ring1881" ref="1" pos="2"/> <tok src="ring1987" ref="1" pos="2"/> </align> <align> <tok src="ring1881" ref="1" pos="3"/> <tok src="ring1987" ref="1" pos="3"/> </align> <align> <tok src="ring1881" ref="1" pos="4"/> <tok src="ring1987" ref="l" pos="4"/> </align> <align> <tok src="ring1881" ref="1" pos="5"/> <tok src="ring1987" ref="1" pos="5"/> </align>  <align> <tok src="ring1881" ref="2" val="A"/> <tok src="ring1987" ref="2" val="A"/> </align> <align> <tok src="ring1881" ref="2" val="pocket"/> <tok src="ring1987" ref="2" val="pocket"/> </align> <align> <tok src="ring1881" ref="2" val="full"/> <tok src="ring1987" ref="2" val="full"/> </align> <align> <tok src="ring1881" ref="2" val="of"/> <tok src="ring1987" ref="2" val="of"/> </align> <align> <tok src="ring1881" ref="2" val="posies"/> <tok src="ring1987" ref="2" val="posies"/> </align>  <align> <tok src="ring1881" ref="3" pos="1, 2"/> <tok src="ring1987" ref="3" pos="1"/> </align> <align> <tok src="ring1881" ref="3" pos="3 - 4"/> <tok src="ring1987" ref="3" pos="2"/> </align> <align> <tok src="ring1881" ref="4" pos="1"/> <tok src="ring1987" ref="4" pos="1"/> </align> <align> <tok src="ring1881" ref="4" pos="2"/> </align> <align> <tok src="ring1881" ref="4" pos="3"/> <tok src="ring1987" ref="4" pos="2"/> </align>  <align> <tok src="ring1881" ref="4" pos="last-1"/> <tok src="ring1987" ref="4" pos="last-1"/> </align> <align> <tok src="ring1881" ref="4" ord="last"/> <tok src="ring1987" ref="4" ord="last"/> </align> </body> </TAN-A-tok> Once again, the first four lines, the prolog and root element, should look familiar, with the only significant changes being the names of the validation files, the name of the root element (<TAN-A-tok>), and the value of @id. The heart of the data is <body>, which has two key attributes, @reuse-type, which describes the activity that was performed to change one version into the other, and @bitext-relation, which specifies how one book relates to the other. Our two values, general_adaptation and B-descends-from-A, are arbitrary names that we define in the <head> (discussed later). (To understand the concepts behind reuse types and bitext relations, see ). You will also notice some lines that begin . These are comments, and can be placed within or beside any element, and can enclose any text we like, including line breaks. You may put a comment anywhere you like, as long as it is not inside a tag or attribute. <body> is the parent of one or more <align> elements, each of which correlates a set of tokens in each of the two texts, pointed to by its <tok> children. Each <tok> has, in this example, three attributes. @src takes a nickname (an @id reference) that points to one of the two transcriptions; we have used ring1881 and ring1987 for our two texts, but we could have just as easily used anything else such as a and b, or uk and us. @ref has a value that points to a specific <div> in the source TAN-T transcription; and @pos or @val specify which token is intended, either by word number (@pos) or text of the actual word (@val). Either technique is fine, and @pos and @val can be mixed, as in the example. It is generally a good idea to use @val, because if you fix a typo, changing the number of tokens in the underlying transcription, @val might not be affected; with @pos alone, you can't. You may also notice that the comma and hyphen can be used in @pos to point to multiple words within the same <div>, and that last and last-X (where X is a digit) can be used to point to a token by position counting from the end of a <div>. Each <align> can establish one-to-one, one-to-many, many-to-one, or many-to-many relationships between tokens from the two texts. A token may feature in multiple <align> elements. And if an <align> has <tok> elements belonging to only one source, such as in the fourth-to-last <align> above, we have what is called, in these guidelines, a one-sided alignment. This one-sided alignment indicates that the second word of line four of the 1881 version is excluded from the act that we have called adaptation. If this were a translation, it would be as if we were saying that this word was excluded from the translation. (A one-sided alignment containing tokens only of the later source might point to words that the translator added, i.e., what in translation studies is called explicitation.) If in our TAN-A-tok file we say nothing about a particular word in one of the sources, that silence should not be interpreted to mean that it has no counterpart in the other source. As creators of this file, we make no claim to providing an exhaustive account, and we are under no obligation to indicate every word-for-word correspondence. If we fail to mention certain words, all that can be implied is that we opted not to say anything about them. We could have aligned the two texts in different ways. Perhaps further study will reveal that we were in error to associate the second "ring" with "round" in line 1. We can make corrections, even after publication, and notify other users of our data about the change. There are also ways to express doubt or alterative opinions, and to credit (or blame) the person making the assertion. We can even correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are discussed at .

TAN metadata (<code><link linkend="element-head"><head></link></code>) At this point, we have finished four TAN files: two transcriptions (TAN-T), one macro-alignment file (TAN-A), and one micro-alignment file (TAN-A-tok). We've avoided discussing the <head> in each of them until now. Before getting into details, some important concepts need to be covered first. Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> contains data about the data that is in <body>. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the standard XML ones), the metadata requirements are stricter than they are for Word documents, HTML, TEI, or other formats you might know better. They are stricter even than TEI rules. (But you'll be offered help that the TEI rules do not.) Scholars who find our file expect to know some things about it before they can responsibly use it. For example, what are the sources we have used? Who produced the data? When? What changes or adjustments have been made? What licenses govern the use of the data? The questions are not difficult to answer, but they require thought, care, and some time to answer. Some metadata questions apply only to one TAN format. For example, in a TAN-A-tok file, we ask what relationship holds between the two sources. But that question makes no sense for a TAN-T file, which is merely a transcription. Some questions apply universally across all TAN files, no matter what kind of data. TAN has been designed so that each <head>, no matter the format, handles metadata consistently. This reduces potential confusion, and helps other people using our data to find the information they want. More important, what we write in one file can be referenced by another, without duplication, and so will reduce the chance of errors. There is an old programmer's adage, Don't repeat yourself, oftentimes abbreviated DRY. The TAN format encourages you to be as DRY as possible. The opposite of DRY is WET: write everything twice or we enjoy typing. Another TAN principle is that each <head> should focus exclusively upon scope of the data in <body>, and not on other things. For example, in a TAN-T file, we are concerned only about the transcription, so our metadata too should be concerned only with the transcription. We should indicate its source, but because our file is not about the source itself, so we don't need to describe it further. We are not library catalogers, nor should we be. A TAN-T file is for transcribing, not for curating bibliographical data. Our obligation is merely to point a reader to complete and authoritative information, found elsewhere. TAN was also designed under the principle that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen (Ring around the Rosie) in a way that is comprehensible not just to the reader but to the computer. Computers have a difficult time with ordinary human names, so we have to approach the task in a special way. Take for example the 1881 book we have used for our first transcription. For the human reader we can write something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But this human-readable string is too complex and syntactically opaque for computers and algorithms. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs). The tag URN we made earlier is just one of many types of URNs. In this case we can cretae and ISBN URN as follows: urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below another way to give it a different kind of URN.) There are different URNs for different things: journals (via ISSNs, urn:issn:...), articles (DOIs, urn:doi:...), movies (ISANs, urn:isan:...), and so forth, which means that anyone can use them to refer unambiguously to a particular kind of thing. URN naming schemes must be registered with the Internet Assigned Numbers Authority (IANA) to ensure permanent, persistent, unique names for various types of things. It is okay for something to be assigned more than one URN, but never acceptable for one URN to be applied to more than one thing. (See IANA's registry and for a complete list of official URN schemes.) All URNs are simply names. They don't tell you where an object is. To provide a unique location we have universal resource locators (URLs), e.g., http://academia.edu. Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor). Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms are easily confused and conflated, even by veterans. URIs and IRIs are basically the same thing, and they encompass URNs and URLs, a relationship and function that can be remembered by the last letter in each acronym: URIs/IRIs Incorporate both Locators (URL) and Names (URN). If those acronyms are confusing, don't worry. For our purposes, they are pretty much all the same, and from this point onward we'll stick with the term IRI (unless we really mean a location to find a file, which we'll call a URL). IRIs are essential to a system frequently called the semantic web or linked (open) data, which uses them as the basis for a simple universal data model. The semantic web allows one to make assertions that computers can "understand." If people, working independently, happen to use the same IRIs to describe the same things, then computers can be programmed to make associations between disparate, heterogenous datasets. For example, if one scholar claims through IRIs that X is the mother of Y, and another claims in a different dataset that Y is the mother of Z, a computer can infer that X is the grandmother of Z, without the two scholars being aware of each other's work. The computer can also check for contradictions (e.g., someone claiming that Z is the mother of Y). When many scholars begin to use IRIs in their data, the result is a network that allows us or anyone else to discover connections across disciplines and projects, and to make discoveries that transcend any single project. TAN has been designed to be semantic-web friendly, and so requires in its <head> almost all data to be not just human-readable but also computer-readable, normally as an IRI. Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to those most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar. Highly familiar: The two books that provide the basis of our transcription are catalogued and generally well known. A number of services provided by librarians provide controlled IRI vocabularies that can be used by anyone to unambiguously identify a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found Library of Congress IRIs for both editions of Mother Goose: http://lccn.loc.gov/12032709 and http://lccn.loc.gov/87042504. Observe that these two IRIs are also, perhaps confusingly, URLs (locations). If we paste these strings into our Web browser, we retrieve a record that describes the book. This locator does not lead us to the book itself, only to information about the book. Nevertheless, the Library of Congress has decided to make this URL also a name for the book, which means that it does double duty, both as a location for a Web page with information, and as a name for a book. This practice that can easily confuse anyone new to the semantic web, because such URLs name in reality two types of things: an entity and a web resource to learn more about that entity. The idea is that hundreds of years from now, when the web page no longer exists, the name will still be valid. In the TAN system, you can apply as many IRIs to a concept as you like. In fact, it is a good practice to find and add as many IRIs as you think worthwhile, just in case someone can't figure out what you're trying to identify. Just make sure that any IRI you copy unambiguously points to the thing you have in mind. We now have IRIs for the sources. Let's now find an IRI for the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is a benefit. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which provides a unique URN for every Wikipedia entry in the major languages, and these can be used for naming. The DBPedia IRI in this case is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses. Once again, this is both a name and a locator. It names a specific, intangible, abstract work, namely, a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that IRI into your browser, you will get back more information about that named object. Familiar to specialists: We will need to have IRIs for some of the people who edited the file. Here we're not interested in the authors of the books we transcribed. We are interested in identifying the people who helped make the TAN files themselves. Most people who write and edit TAN files will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for authors, editors, and other persons identified in online catalogues curated by libraries around the world. Most contributors to TAN files, however, will not be listed in these databases. In those cases, we can name these participants with an IRI that we "own." We have already done something like this by assigning tag URNs to our four TAN files (the value of @id in the root element). Our editors can do the same thing. If a student Robin Smith has been helping with proofreading, Robin can take an email address (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as tag:smith.robin@example.com,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about this particular Robin, but it at least allows us to assign a name that will not be confused as another Robin Smith, for example the one identified by ISNI as http://isni.org/isni/0000000043306406. (If we want to go a step further, Robin could mint a URN from a domain name that she owns, and set up a linked data service that offers more information, human- and computer-readable. But this is not required, and it can be a hassle to set up and maintain.) Let's take a more difficult challenge for locating an IRI, that of describing the @bitext-relation in our TAN-A-tok file. @bitext-relation draws from the discipline of stemmatology, which studies how manuscripts were copied, one to another, and tries to place these manuscripts in a chain of transmission, a kind of historical stemma (tree). We have to find an IRI that describes the relationship that we claim holds between two text-bearing objects. Making that clear is important, because our perspective about the relationship between the two books affects the decisions we make when we align words, and other scholars using our files will want to know what assumptions we had when we aligned the two texts. For the sake of illustration we posit that the version published in the 1987 Mother Goose is a direct but not immediate descendant of the 1881 version. Because no suitable IRI vocabulary yet exists for the relationships between texts, TAN itself has coined an IRI that can be used by anyone wishing to declare that, given two ordered sources, the second descends from the first through an unknown number of intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b. (The arbitrary symbol / signifies a step from one version to the next, and the x+ represents one or more versions as intermediate steps.) We'll use that one for now. We face a similar issue when thinking about text reuse, @reuse-type. Here we are concerned with creative activities such as translation, paraphrase, adaptation, and so forth. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are no stable, well-published IRI vocabularies for text reuse. So we adopt an IRI that is part of TAN's standard vocabulary, tag:textalign.net,2015:reuse-type:adaptation:general. In the previous two cases, we could have come up with our own vocabulary. But the idea behind the semantic web is to use common, familiar vocabulary whenever possible. That's the same principle that led us to structure and label the poem in four consecutively numbered lines. We adopt conventions we expect others will likely follow. The built-in TAN vocabulary simply gives us a convenient lingua franca for describing some important, abstract concepts. For other examples of IRIs coined by TAN, see . Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps even us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we had only superficially thought about the category. One other possibility is to assign a randomly generated IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. UUID URNs are very useful. The likelihood that a randomly generated UUID will be identical to any existing UUID is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that UUID URN to name some other object or concept). Numerous free UUID generators can be found online. To humans, a UUID on its own is meaningless, unmemorable, and rather ugly. But it is a start. We always have the option, later, of supplementing it with other IRIs. It's perfectly fine to assign multiple IRIs to one object or concept. But the reverse is never true. One should never use one IRI to identify more than one object or concept. There is an exception when the IRI names a single class that has multiple objects or concepts. But even then, it should name only one class, not two or more of them.

Creating TAN metadata (<code><link linkend="element-head" ><head></link></code>) Now that we have explored various IRI vocabularies for concepts related to our files concerning Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring01"> <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/> <license licensor="park"> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Attribution 4.0 International</name> </license> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name> </work> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <vocabulary-key> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name>Jenny Park</name> </person> <div-type xml:id="line"> <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> <name>line of poetry</name> </div-type> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> <to-do/> </head> . . . . . . . </TAN-T> <name>, the human readable counterpart to the @id that is inside the root element, can be anything. And we can supply more than one <name>, in case we wish to provide alternative names of the file, or translations of them. One or more <master-location>s provide URLs where master versions of the file are kept (and maintained). We provide this as a courtesy to others who might be using our data. Anyone who validates their local copy of the file will be warned if it does not match the master version, and they will be told of the most recent changes. With a couple of keystrokes, they can update their local copy to match the master. This one-way communication system lets us silently and conveniently notify other users of changes. We do not have to keep track of who is using our file, and users do not have to pester us with questions about what changed when. <master-location> is mandatory only if we are finished with our to-do list, which is specified at <to-do>. If that element is empty, then we imply that we do not know anything further that should be done to the file. Conversely, any elements in <to-do> specify what remains to be done, and details will be returned to other users. That way you can release data that is useful but not completely perfect, and let users know about its deficiencies. This approach is ideal for formats such as TAN-A-tok, where you might have released only some of the data, and you are working on the rest. One day the link in <master-location> will be dead. But perhaps a copy of our file will be in circulation elsewhere. The document @id in the root element provides a way to identify files, independent of links, and perhaps locate them in unexpected places. <license> specifies the license under which we are releasing our data. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). That is, we are specifying what rights are attached to the data, not its source, i.e., if we have placed additional strictures on the content in <body>. In this example, we have released the data under a creative commons license. The child element <IRI> specifies a Creative Commons IRI, and <name> is the human-readable form. @licensor specifies who has granted the license, in this case our fictive Jenny Park (see below). The conjunction of <IRI> and <name>, the IRI + name pattern, recurs throughout TAN files. They are used provide identifiers for vocabulary items. In an element that takes the IRI + name pattern, we may include as many children <IRI>s or <name>s as we like. But if we do so, we are stating that they are synonymous, i.e., that they all name the same thing. (Once again, an IRI is unique, so it should never be used to identify more than one thing.) <work> uses the IRI + name pattern to name the work we have chosen to transcribe. <source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen. <vocabulary-key> contains vocabulary that we are using in our file. Inside, we can place more vocabulary items, and attach locally unique ids. For example, an IRI + name pattern is used for <person>, which identifies through a tag URN Jenny Park. The value of @xml:id allows us to use park any time we want to mention Jenny. In fact, we already have, at @licensor. Any mention of park will point to the appropriate item in <vocabulary-key>. There are a few other parts of <vocabulary-key>. <div-type> specifies an IRI + name pattern for line divisions, and the value of @xml:id means that we can use line any time we want to invoke the concept. Similarly, we have a <role>. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles. After the <vocabulary-key>, we get into parts of the file that specify who did what, when. First is a <file-resp>, whose value of @who, park, indicates that Jenny Park is the one primarily responsible for the file. <resp> specifies further who was responsible for doing what. If you decide to modify someone else's TAN file, you should credit / blame yourself for the changes. Your first point of order should be to add a <person> to the <vocabulary-key>, identifying yourself. You can then either add a <change> (see below) or a <resp> (you might need to specify a <role> in the <vocabulary-key>). You should not change the document's @id, unless your changes are so significant that it becomes altogether a new document, your document. TAN does not try to broker the age-old problem of determining the point at which a thing becomes something altogether different (e.g., the Ship of Theseus problem). Use your best intuition. Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who was responsible for the printed source. If someone wants to know anything about the book, they should pursue the IRI identifier we have provided under <source>. <change> has attributes @when and @who to specify who made the change and when. The value of @when is always a date or a date + time, formatted according to the ISO standard syntax: [YYYY]-[MM]-[DD] or [YYYY]-[MM]-[DD]T[HH]:[MM]:[SS]. @who always carries an IDref that points to a person or organization. <change> does not take the IRI + name pattern, or even any children at all. It takes simply a plain-text description of what changed. So now we have finished one transcription file's metadata. You may have found it to represent a lot of typing: many names, IRIs, and so forth. Is there any way to shorten that load? Yes, there is. TAN is a vocabulary-based format. That is, there are standard vocabulary items that come with the TAN format, and you can design your own vocabulary, so that you can shorten the work involved, and to adhere to the best DRY principles. Our second example will look similar to the first one, but notice some shortcuts:<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring02"> <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license which="by 4.0" licensor="park"/> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <adjustments> <normalization which="no hyphens"/> </adjustments> <vocabulary-key> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> </vocabulary-key> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> <to-do/> </head> . . . . . . </TAN-T> In this example, <name>, <master-location>, and <source> have been modified to describe this file. Note, we haven't had to change <work>. <license> looks different, but in reality it is identical to our previous example, and that is because the IRI + name pattern has been replaced with @which. You may replace any IRI + name pattern with @which; its value must match a <name> in customized or standard vocabulary (a TAN-voc file). In this case, "by 4.0" points to TAN's standard vocabulary for licenses (see ). Here is what that looks like under the hood: <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:licenses"> . . . . . . . <body affects-element="license"> <item> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <IRI>tag:textalign.net,2015:license:by/4.0/</IRI> <name>by 4.0</name> <desc>attribution 4.0 international</desc> </item> . . . . . . . </body> </TAN-voc> Because the validation rules for TAN-voc files require every <name> to be unique, that element can be treated as a unique identifier, similar to @xml:id. We could have repeated the <license> from the previous TAN-T file. But the @which method is much quicker, cleaner, and DRY. Before <vocabulary-key> comes a new element, <adjustments>, which contains a <normalization> statement whose @which says no hyphens. That too points to a standard TAN vocabulary for normalizations: an IRI + name pattern for eliminating discretionary hyphens (see ). Here's what that vocabulary item looks like (invisible to you, but you can look at it any time you like in the vocabularies subdirectory of the TAN files): <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:normalizations"> . . . . . . . <body affects-element="normalization"> <item> <IRI>tag:textalign.net,2015:normalization:hyphens-discretionary-removed</IRI> <name>no hyphens</name> <desc>Discretionary word-break line-end hyphens have been deleted.</desc> </item> . . . . . . . </body> </TAN-voc> As you might have inferred, the element <normalization> specifies how we have changed the data, namely, that we have opted to remove word-break line-end hyphenation. In other transcriptions we could use <normalization> to declare other kinds of changes we felt compelled to make, such as removing editorial comments or footnote signals. A healthy list of <normalization>s is a courtesy to users of our data, some of whom might passionately care about keeping or removing line-end hyphenation. Back to our example. <div-type> has a new value for @xml:id, the letter l, and in it too the IRI + name pattern has been replaced by @which, whose value, line (poetry), is a standard vocabulary item (see . A line of poetry is to be contrasted with a physical line on the page. Some lines of poetry take up two or more physical lines. For the physical line you would specify: which="line (physical)". There is a also new <comment> element, which is built much the same as <change>. (A <change>, after all, is just a comment about what has been changed.) That seems to be all there is. But if you've been attentive, you will have noticed that <role> from our first TAN-T file (inside <vocabulary-key>) is missing. That's because we don't need it, based on the same principle that lets us resolve @which. A vocabulary <name> can be invoked not only in @which, but in any attribute that points to values of @xml:id, in this case @roles. There is already a standard TAN vocabulary item with the <name> creator, so we can use it directly without having to declare an intermediate vocabulary item with an @xml:id. If we had defined something else in <vocabulary-key> with a @xml:id of creator, that item would take precedence and override the built-in TAN vocabulary. But we haven't, so the standard TAN vocabularies are the default.

Building TAN vocabulary The first TAN-T transcription had a longer <head> than the second one did, and that is because for the former we used an explicit method, that of specifying every IRI and name, and then in the latter adopted shortcuts that took advantage of TAN vocabulary. TAN vocabularies are meant not merely to be a convenience; they are intended to avoid problems that beset projects that create many files with repeated data patterns. When (not if) you make changes to one file, you shouldn't have to remember all the other places where you might need to make the same changes. Move repeating data patterns to one master place, and let the other files point to that pattern. Then, when we need to make changes, we do so only once, at the master location. Stay DRY. The previous examples drew from standard TAN vocabulary, which is written in one of the other TAN formats, TAN-voc. There is a whole collection of standard TAN-voc files in the project subdirectory called vocabularies. We can write our own TAN-voc files, to collect the vocabulary items that we will use repeatedly from one file to the next. For example: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="../../schemas/TAN-voc.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="../../schemas/TAN.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:TAN-voc:standard"> <head> <name>Keywords for TAN files edited by Jenny Park</name> <license licensor="park" which="by 4.0"/> <vocabulary-key> <person which="Jenny Park" xml:id="park"/> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2019-10-08" who="park">Started file</change> <to-do> <comment when="2020-01-04" who="park">Need to check files for new vocabulary items.</comment> </to-do> </head> <body> <group affects-element="person"> <item> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </item> </group> <item affects-element="work"> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring a Ring o' Roses</name> <name>Ring Around the Rosie</name> </item> </body> </TAN-voc> In this example case, updates have been made to @id and <name>, and a <comment> has been added to <to-do>. The most significant difference is the <body>, which has two <item>s, one of which is wrapped in a <group>. Each @affects-element specifies one or more names of elements that the enclosed items affect, and the <item>s have the standard IRI + name pattern. <group>s may nest as you like. The difference between a grouped and ungrouped <item> is purely a matter of taste and convenience. The example above illustrates both methods. Whether you group your items or you do not, the practical effect does not change. The <vocabulary-key> has a <person> whose @which points to the body of the first <item>. That is, a TAN-voc file can draw from its own <body> for vocabulary, without repeating it in <vocabulary-key>. Let's return to the <head>s of our two TAN-T files, and see how to incorporate our new TAN-voc vocabulary file. <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring01"> <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/> <license which="by 4.0" licensor="park"/> <work which="Ring around the Rosie"/> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <vocabulary> <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI> <name>Vocabulary for TAN files edited by Jenny Park</name> <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/> </vocabulary> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <div-type xml:id="line" which="line (verse)"/> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> <to-do/> </head> . . . . . . . </TAN-T> <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring02"> <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license which="by 4.0" licensor="park"/> <work which="Ring around the Rosie"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <vocabulary> <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI> <name>Vocabulary for TAN files edited by Jenny Park</name> <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/> </vocabulary> <adjustments> <normalization which="no hyphens"/> </adjustments> <vocabulary-key> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> <to-do/> </head> . . . . . . </TAN-T> In each TAN-T file, a new <vocabulary> points to the project TAN-voc vocabulary file we have just created. Along with the customary IRI + name pattern is a new element, <location>, which specifies where the digital file was accessed and when (through @accessed-when). We may include as many of these <location> elements as we wish, with the most preferred or reliable one at the top. The validation process will consult only the first one that leads to an available document. The @accessed-when value is important, because TAN files talk to each other. The validator will look for changes in the file since we last accessed it, and if any changes are found, a warning with a summary of the changes will be returned. It is then up to us to determine if the alterations merit any action on our part. Similarly, anyone using or dependending upon our file will be notified of any changes we make, through the same validation process. Once the <vocabulary> is in place, we can draw from our predefined vocabulary. Our revised versions of the <head>s are a bit more DRY, and certainly more compact and easier to read. The longer the TAN file, the more noticable the improvement. And when our library grows into dozens, hundreds, or thousands of files, we'll be grateful that a change that affects all the files needs to be made only once. In general, when you share your files with other people, you need to make sure that you also share your vocabulary files too. There is an alternative method, that of sending what is called a resolved TAN file, which encapsulates all the vocabulary, but that is a slightly more advanced topic. See . Now that we have created the metadata for our transcriptions, let's turn to the alignment files. Those <head>s will look slightly different, because they are not concerned with transcriptions per se. We start with the TAN-A file:<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A> Much of the code above will look similar to the previous two examples. The file's <name> and <master-location> are updated. Just like TAN-T files have <source>s, so TAN-A files do as well, except that those sources are always TAN-T transcription files, and they take the IRI + name + location pattern we saw above in <vocabulary>. Because alignment files take only TAN transcription files as sources, each <source>'s <IRI> always takes the @id value of the target TAN-T transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other source names, or the role of the source within the TAN-A file. Our TAN-A file could contain any number of <source>s, and not necessarily for the same work. The order in which we put the <source>s does not necessarily mean anything. This <head> explains why the <body> of our TAN-A file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. Any user or processor of a TAN-A file may assume that every <div> in every source should be automatically aligned upon the basis of shared values of @n. Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> might look like this:<TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A-tok/ringoroses.01+02.token.1.xml"/> <license which="by-nc-nd_4.0" rights-holder="park"/> <token-definition src="ring1881 ring1987" which="letters"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <vocabulary-key> <bitext-relation xml:id="B-descends-from-A" which="a/x+/b"/> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <change when="2015-01-20" who="park">Started file</change> </head> . . . . . . </TAN-A-tok> The TAN-A-tok <head> looks similar to the previous examples, except that <vocabulary-key> has some new content. <bitext-relation> states through @which or an IRI + name pattern the stemmatic relationship we think holds between the two sources. We have used @which and the value a/x+/b, pointing to a standard TAN vocabulary item for bitext relations: <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:bitext-relation"> . . . . . . <item> <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI> <name>a/x+/b</name> <desc>direct descent, B descends from A, one or more mediaries</desc> </item> . . . . . . </TAN-voc> <token-definition> specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources. @which points to this standard TAN vocabulary item: <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:tan-voc:tokenizations"> . . . . . . <item> <token-definition pattern="[\w‍]+"/> <name>letters</name> <name>letters only</name> <name>general word characters only</name> <name>general ignore punctuation</name> <name>gwo</name> <desc>General tokenization pattern for any language, words only. Non-letters such as punctuation are ignored.</desc> </item> . . . . . . </TAN-voc> Up until now, all vocabulary items have taken the IRI + name pattern. The one above does not have an IRI, only a <token-definition> with a @pattern. The value of @pattern, which may look like gibberish, is a regular expression. "Regular" here does not mean ordinary; rather it relates to Latin regula, rule. Regular expressions are rule-based patterned text searches. This particular pattern says that a token is defined as any contiguous string of word characters (\w), soft hyphens (), zero-width spaces (), or zero-width joiners (‍). This is TAN's default tokenization pattern, and it will be assumed for any TAN-A-tok file that lacks a <token-definition>. TAN adopts this default to capture what we commonly mean in ordinary conversation by "word." When we refer someone to the nth word in a sentence, we most often ignore punctuation marks. For more on token definitions see and . See also . In our <vocabulary-key> we could have also included a <reuse-type>, but we have intentionally omitted it here, because we have

<body
                  bitext-relation="B-descends-from-A" reuse-type="general_adaptation">

. The value for @reuse-type, general_adaptation, corresponds to a <name> in a standard TAN vocabulary item for reuse types. We don't need to invoke a <reuse-type> in the <vocabulary-key> because we are going directly to the name of the vocabulary item. Notice that general_adaptation has an underscore instead of a space. That's because <reuse-type> can take multiple values, which are signified by spaces. So spaces in names need to be replaced by an underscore, or a hyphen if we prefer. The values of <name> are never case-sensitive, and the space, hyphen, and underscore are treated as equivalent. (@id values, on the other hand, are always case-sensitive, and never have spaces.)

Aligning across projects We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc file for vocabulary shared across the files. Let us imagine what it might be like to connect our TAN collection to a TAN file made by someone else. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </license> <licensor who="schmidt"/> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <numerals priority="letters"/> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <adjustments> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </adjustments> <vocabulary-key> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <div-type which="poem" xml:id="Gedicht"/> <person xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </person> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> </vocabulary-key> <file-resp who="schmidt"/> <resp who="schmidt" roles="Produzent"/> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> <to-do/> </head> <body xml:lang="deu"> <div type="Gedicht" n="1"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </div> </body> </TAN-T> It seems that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or that for a given alignment we must align everything. In this case, we choose not to worry about word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for example, we can use the built-in TAN application to display the three versions in parallel, a reading tool to study more closely intertextual relationships. To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, <numerals> specifies by its value for @priority that any ambiguous numerals should be interepreted as letter numerals, not Roman (that's important, e.g., for a <div> with an @n value c, which could mean 3 [a, b, c, ...] or the Roman numeral for 100). Next, the lines are wrapped in a <div> for the whole poem (Gedicht) and they have been lettered instead of numbered. And last, the editor seems to have made a typographical error, making the last line e instead of the expected d). These five differences typify inconsistencies one commonly finds in digital texts from different projects of the same work. There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a UUID. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions. Also, the <div-type> is defined as http://dbpedia.org/resource/Gedichtzeile (Gedichtzeile = line of poetry), so it doesn't intersect with our IRIs for the vocabulary item line. But <div-type> is not used to align versions, and validation isn't affected, so we do not concern ourselves here with trying to reconcile the different IRIs. These are points we can easily reconcile in our TAN-A file, which we now expand to include the German version. We make the following adjustments (emphasized):<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <adjustments src="ger"> <skip div-type="Gedicht"/> <rename n="e" by="-1"/> </adjustments> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <alias id="ring" idrefs="ger eng-us"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A> The first major change is the insertion of a third <source>, pointing to the new file and specifying its name and IRI. Note that two <location>s have been provided, one for the original and another for a local copy we have saved. Validation will take into account only the first document available. If we wanted to work primarily off our local copy, we would have put that <location> first. By placing it second, we allow the validation engine to work primarily off the master version and therefore look for updates and changes. If that version is unavailable, validation will be made against second, local copy. <adjustments> specifies through its @src that only the German version should be adjusted by the contained instructions. The enclosed <skip> says, in effect, to ignore the wrapping <div> for purposes of alignment. The <rename> takes care of the apparent typographical error, and anchors the German version to the U.S. one. Note that the German version uses e, but we have used 5. But we could have used e, or even the Roman numeral v, had we wished to. Every TAN file's numeration system is evaluated locally, independent of any external files. We need not reconcile the a, b, and c @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format supports four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems are interpreted as a two-tier numbering system. The second major change, to address the German version's different value of <work>, is the addition of an <alias>, which allows us to assign one or more vocabulary items a common id. Wherever the value ring is used, it stands in for ger and eng-us, which point to the two TAN-T files. You may be familiar with this concept from critical editions, where a siglum, e.g., A might stand for several other sigla, e.g., a, b, and c. So every time you see something said about A, you know that by implication it is true of a, b, and c. Every TAN-T file has only one work and only one written source. So if you wish to make a claim about a particular work or source, you can use a TAN-T's id as a surrogate. That is, the @id in <source> can stand it to represent either the work or the book or manuscript from which the text has been taken. So if we make claims in our TAN-A file about a written source or a work, ring would assert the claim to be true for the works pointed to by the German and the U.S. version. (We do not need to specifically mention eng-uk in the <alias>, since it has the same work IRI as the U.S. version does.) Alternatively, instead of <alias>, we could simply have adjusted our TAN-voc file, adding the German version's <IRI> value to the appropriate vocabulary item, and use that id. The last major insertion is a new <change>, documenting when we made the alterations. Its @when effectively updates the version of our TAN-A file. With these additions, the German version is now aligned with the other two. We could have made our work simpler just by directly modifying our local copy of the German version. But such a change would not have affected the master copy. What happens when the owner of the German file makes changes? At that point we be faced with version conflict: changes in the original, and our own changes in the copy. We would struggle to reconcile the differences. And we would have to repeat that exercise every time the German file was updated. By keeping our local copy of the German file unchanged, and making simple adjustments in our TAN-A file, we can keep our local copy synchronized with the master file and yet make the adjustments needed to coordinate with ours. The purpose statement in these guidelines says that TAN was "designed to maximize the syntactic and semantic interoperability of texts, annotations, and language resources." Here we see the importance of the qualifier "maximize." In no world will there ever be (nor should there be, it seems) a single, indisputable way to divide a given work. The TAN format does not change that reality. Rather, it provides a convergent ecosystem in which different practices can be easily reconciled, to help editors and authors enhance cross-project interoperability without artificially forcing conformity, or suppressing legitimately different outlooks. Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g., through his tag URN). We do so, and we suggest that he modify the version to make it align better. Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent typographic error in the last line. Or perhaps we're the ones in error. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch" at line 3, the other, "Holderbusch".) If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made. At this point we have a network of six TAN files, five from our collection and one from outside. Although simple and small, this network could be extended to address some creative and complex research questions. Applications based on XSLT stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. What you've read so far is only a cursory introduction to TAN features. Study the rest of these guidelines, as well as example TAN libraries, and you will find numerous ways to develop TAN files, and to use them to enhance your research, teaching, and writing.

Detailed description This part of the guidelines provides a detailed description of the design and structure of the formats of the Text Alignment Network. The material follows the organization of the schema files (kept in the schemas subdirectory), so both can be studied in tandem. outlines, in a non-technical way, the principles and technical foundations of the TAN format. , , , and describe each TAN format, class by class. Each chapter starts with theory or scholarly context before expanding on technical points. The chapters in this part have been written with the assumption that you have already read the previous part () and that you have already started to create or edit a TAN collection. Because readers will come from different specialties, all acronyms, abbreviations, and concepts are defined and explained, albeit tersely, to explain how they affect the use of TAN. Suggestions for further reading are provided for those who want a more thorough introduction to a topic. General underpinnings This chapter retains something of the introductory spirit of the previous one by providing an overview of the fundamental principles and technologies behind TAN. The goal is to explain the design of the format. Although this chapter assumes on your part no prior knowledge of any particular technology, it is also not meant to be a tutorial. Links to further reading will take you to good introductory material.

Design principles The TAN formats have been designed around a few basic principles: Scholarly habits Be patient. Simplify. Stay focused. Don't repeat yourself. Don't state the obvious. Use familiar conventions. Scholarly freedom Express doubt. Offer alternatives. Exercise independence. Invite interdependence. Scholarly responsibility Declare your assumptions. Make your work citable. Satisfy scholars' expectations: Who did what when? What are your sources? How do you define your terms? What alterations have you made to your sources? What rights do I have to use your material? General utility Use stable technology. Keep design predictable, consistent. Make each datum human readable. Make each datum computer actionable.

Format organization The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: texts (class 1), text alignments and annotations (class 2), and everything else (class 3). Class 1: representations of textual objects, i.e., transcriptions. (See note on transcriptions versus transliterations.) Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum; see ), whether physical or digital. There are two types of transcription file: a standard generic format (TAN-T) and a gentle customization of TEI All (TAN-TEI). These two types are differentiated by the root element, <TAN-T> and <TEI> respectively. Class 2: annotations on class-1 texts, and alignment declarations. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A> as the root element, aligns any number (one or more) of class-1 files, and allows one to annotate those files. The latter, <TAN-A-tok>, aligns only pairs of class-1 files, on a word-for-word basis. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part-of-speech) forms of individual words from a single class-1 file, or of a language in general. Class 3: everything else. <TAN-voc> collects and labels vocabulary items used in other TAN files. TAN catalog files have the root element <collection>, and they index locally available TAN files, and selective parts of their metadata. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for lexico-morphological codes in dependent TAN-A-lm files. TAN adopts a stand-off approach to annotation or markup. In the alternative method, inline markup, which you may be familiar with from TEI or HTML, an annotation is applied directly to the base text, e.g.,

<p>He said
                  <quote>"Jump!"</quote></p>

, where the inner element <quote> annotates the third word. In stand-off annotation, however, <p>He said "Jump!"</p> would be left as-is, and somewhere else there would be an annotation that states that the third word is a quotation. If the stand-off annotation is in the same file, it is an internal stand-off annotation. If the annotation is in a different file, it is an external stand-off annotation. For many common, simple cases, inline annotation is simple, convenient, and straightforward. But as inline annotations are added, the benefits slowly diminish. When parts of a file attract multiple markup elements, the file can become difficult to read and navigate. Stand-off annotation provides several benefits: An editor can focus on a limited set of closely related questions. A source text without inline annotations is less cluttered, and therefore easier to read, than one with inline annotations. Editors can work on separate annotation files based upon the same master transcription file, even if they have very different research interests. A single annotation refer to two or more texts (e.g., identification of quotations), and not have to prioritize, or be located in any single one. Complementary or competing annotations can be made, and those annotations may point to concurrent or overlapping spans of text (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another). A corpus of stand-off external annotation files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project. Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently. When an error is found in a transcription file, it can be corrected in a single place, in the master. Anyone using a copy of that master file will be notified in the validation process of changes that have been made and they can deal with them accordingly. Any data file can be updated independent of any other that points to it, or to which it points. Cross-file links required in stand-off annotation networks files, which can then be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis). The stand-off approach works toward a principle often valued in computer science, that of the disaggregation of data. That is, in a master format, data of a particular type should not be entangled with other types of data. It can later be reaggregated in all kinds of ways, but that is an end product, not the way master data should be stored and managed. It is analogous to the way any well-run kitchen keeps its ingredients separate, until it is time to cook or bake a variety of products. We keep separate our flour, eggs, sugar, and so forth, until we find out what a recipe calls for, at which point we combined those ingredients in a variety of ways. It would be terrible if you were asked to make muesli (or granola), and found that someone had already turned the ingredients you wanted into a cake! Stand-off annotation is not without problems and vulnerabilities. For example: When (not if) the base text changes, the editor is unaware of how that change will affect any stand-off annotations. Not having the annotated text and an the annotation in the same reading space can be an inconvenience. When searching for, or querying, the base text, standoff annotations can be difficult or impossible to incorporate to refine a selection. When using the material for other purposes, it can be cumbersome or challenging to reintegrate annotations with the base text. Linking an annotation to its base text requires extra work and maintenance. Normally this involves building and administering a library of identifiers. Adding and removing ids, or checking them for errors, can be time-consuming and confusing. These are important challenges, but TAN validation rules have been designed to mitigate such problems. The last problem listed above is perhaps the greatest barrier to stand-off annotation. TAN approaches pointing in a much different way that is closer to current scholarly habits. See . Furthermore, TEI inline annotations are supported. In general, you are encouraged to use TEI inline annotations where they are simple and make sense. But when the markup accumulates, threatens to create overlapping structures, or pose other difficulties, TAN class 2 files can be an ideal way to build and curate annotations.

Assumptions in the creation of TAN data All creators and users of TAN files are expected to share few basic assumptions. First, all TAN-compliant data is to be understood as largely derivative. That is, data files express no originality or creativity independent of their sources (but see below about interpretation). A TAN file should be created with the intent of adhering as closely as possible to some model or archetype. For example, a transcription is assumed to replicate faithfully some earlier digital edition or text-bearing material object (e.g., stone, papyrus, manuscript, printed book for written text; audiovisual media for oral or performative texts). Morphological files and alignment files should describe as clearly and as reliably as possible their source transcriptions. In creating and publishing a TAN file you claim to have offered a good-faith representation or description of something; in using a TAN file, you hold the creator to that expectation. Second, all core TAN files are interpretive. That is, they are permeated by editorial assumptions and opinions that might not be shared by everyone. If there is any resemblance of originality or creativity in a TAN file it is in that interpretive outlook. For example, if you edit a transcription file you must decide how to handle unusual letterforms and other visible marks. Your decisions will be influenced by your perspective on the original text and its native writing system, and how you interpret and use Unicode. If you write an alignment file, you must make decisions about what factors caused one text to be transformed into another. Lexicomorphological files require you to commit to one or more grammars and dictionaries, which adopt certain perspectives on language, and you must discern how best to handle cases of vagueness and ambiguity. No TAN file ever stands completely outside the interpretive act. In creating and publishing a TAN file you claim to have disclosed as best you can the assumptions behind your interpretive outlook; in using a TAN file, you hold the creator to that expectation. Third, all core TAN files are applicable. That is, the interpretive impluse is assumed to be coupled with an equally strong desire to make the data as useful to as many users as possible, even those who may not share your assumptions or interpretation. TAN files are intended for intertextual comparison, so idiosyncrasies of a particular text-bearing object will be regarded by some users as either uninteresting or an obstacle. A creator of a transcription file should normalize and segment texts, adopting the most widely used reference systems, so as to optimize the alignment process. Morphological files should depend whenever possible upon commonly accepted grammars and lexica. Alignment files should work with comprehensible categories of text reuse. No TAN file will always be applicable to everyone, but it should be as suitable to as many as possible, for as many purposes as possible. In creating a TAN file you claim to use common, shared conventions whenever possible, and to note any departures; in using a TAN file, you hold the creator to that expectation. Fourth, TAN data is to be considered accurate, but not necessarily precise or complete. For example, if a TAN-A file claims that the opening of Plato's Republic book 3 quotes from Homer's Iliad, the claim is true and accurate, but is neither precise nor complete. Parts of the opening of book 3 are certainly not quotations, and the whole of the Iliad is not quoted in the Republic. Or take a TAN-A-tok file. The token-for-token alignment of two texts might be selective, and focus only on the points of interest to the editor. Although the TAN formats permit a great deal of both precision and comprehensiveness, neither is mandated, except where explicitly noted by the TAN specifications. In creating a TAN file you claim to make accurate assertions; in using a TAN file, you should hold the creator to that expectation, but you must assess for yourself how precise and complete it is.

Core technology TAN depends upon a set of relatively stable technologies. Those technologies and the underlying terminology are briefly explained below, with attention paid to interpretive decisions that affect validation rules.

Unicode

What is it? Unicode is the worldwide standard for the encoding, representation, and exchange of digital texts. The standard is maintained by a nonprofit consortium whose goal is to represent all the world's writing systems, living and historical. The Unicode standard allows us to share texts in any alphabet, syllabary, or ideographic system reliably, regardless of how that text is rendered (e.g., fonts, display). With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular script or group of characters. Within each block, characters may be grouped further. Each character is assigned a single number called a codepoint. Codepoints are numbered according to the hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The decimal number 10 is hexadecimal A; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It is helpful to think of Unicode as a very long table of sixteen columns, a glyph in each square; this is illustrated nicely in this article. It is common to refer to Unicode characters by their value and perhaps by their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four hexadecimal characters. When the official Unicode name is given, it is normally in uppercase. Examples: Unicode characters Character Unicode value Unicode name " " (space) U+0020 SPACE ® U+00AE REGISTERED SIGN ю U+044E CYRILLIC SMALL LETTER YU

In an XML file, nearly any Unicode codepoint may be used, either by typing or pasting the character directly, or by using XML entities. An XML entity is a proxy for some other text, marked by an ampersand, some text, and then the semicolon. For example, & represents the ampersand and < stands for <. To access specific Unicode characters an entity may start &#x followed by the hexadecimal codepoint (if you prefer to work with decimal codepoints, leave off the x). For example, the XML hex entity ю (or ю in decimal) is a proxy for the Cyrillic small letter yu.

Unicode normalization Unicode rules provide guidance on how text should be normalized, to identify equivalent variations. For example, the character o (U+006F: LATIN SMALL LETTER O) followed by the combining accent ¨ (U+0308: COMBINING DIAERESIS) should be treated as identical in meaning to the single character ö (U+00F6: LATIN SMALL LETTER O WITH DIAERESIS). There are two codepoints that could be used for the Greek question mark (;), and normalization converts the less preferred codepoint to the other. TAN validation rules require all data to be normalized according to the Unicode NFC algorithm (the most common of the four normalization methods). Any text in a TAN file that is not NFC normalized will be marked as invalid. A supplied Schematron Quick Fix will let users automatically normalize text (for editing tools such as Oxygen that support Schematron Quick Fixes). This enforcement of NFC normalization helps to make sure that texts are fairly compared.

Unicode characters with special interpretation The characters U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, and U+00AD SOFT HYPHEN placed at the end of a leaf <div>, perhaps followed by space that will be ignored (see below), signal that the text is to be joined with any subsequent text (i.e., the next leaf <div>). Accordingly, any TAN function that needs to extract text from a leaf <div> structure will delete from the end of its text the U+200B, U+200D, or U+00AD character and its trailing space. (By contrast, text from a leaf <div> that does not end this way will first be space-normalized, then a single space will be appended.) Because these special line-end characters are difficult to distinguish visually from spaces and hyphens, their XML entities, , ‍, and  should be preferred in any XML output. Much has been written about the different ways U+00AD SOFT HYPHEN has been or should be used and interpreted. Debate will no doubt continue. TAN design assumes that the soft hyphen marks a place in a word where a line break has occurred, is allowed to occur, or both. In situations where the text is printed or displayed, any soft hyphen that does not mark a word broken by a line should not be displayed.

Combining characters At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently, or as part of a base character, nor do core XML technologies. In most cases, this point is negligible. But it can affect regular expressions and XPath expressions (see below). Two of the class-2 formats allow the counting of characters. Such counting is assumed to be made exclusively of individual base (non-combining) characters (each perhaps followed by one or more combining characters). Therefore one character is defined as the regular expression \P{M}\p{M}*, bound to global variable . Any numerical reference made in a TAN file to an individual character, i.e., through @chars, is interpreted by counting only non-combining characters. When the nth character is requested, TAN functions will return the nth base character along with any combining characters that immediately follow. For example, a̳b̈́c͠d consists of four base characters, interleaved with three combining characters, technically seven total. But @chars, which counts characters, there are a maximum of four characters. A value of 1 picks both the base character and its combining character, a̳. TAN rules stipulate that combining characters must have a preceding base character. Any <div> that, after any initial space, starts with a combining character will be marked as invalid. See also .

Unicode points not allowed Because TAN files are not scriptum-oriented (see ), the following characters will generate an error if found in a TAN file: U+00A0 NO-BREAK SPACE U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE

Further reading Unicode Consortium Unicode (Wikipedia)

eXtensible Markup Language (XML)

What is it? Defined by the W3C, the eXtensible Markup Language (XML) is a markup language that that can be extended to allow anyone to define the structure and rules of a document type. For a quick, simple introduction to XML see . XML is one of many formats that can be described as tree-based formats. Others include JSON, HTML, YAML, and Markdown. All of the preceding formats can be expressed in XML, but not the other way around. This does not mean that XML is inherently superior. (For some purposes, it is overkill.) But it does mean that XML is the lingua franca for treelike data structures. For more on the relationship between XML and other treelike formats, especially JSON, see the Invisible Markup Community Group.

Schemas and validation TAN validation files are found in the schemas subdirectory. Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type, written in RELAX-NG, the other with more complex, detailed rules, written in Schematron. The RELAX-NG rules are written primarily in compact syntax (*.rnc), and then converted to XML syntax (*.rng). For TAN-TEI, the special format One Document Does it all (TAN-TEI.odd) is used to adjust the rules for TEI All. The ODD file is then processed by TEI stylesheets into compact and XML RELAX-NG formats. The Schematron files are generally quite short. The primary work is done by an extensive function library written in XSLT. For the most part, the Schematron files arbitrate between the file and the validation results calculated by the TAN function library. For a detailed overview of this process, see . Some validation engines that process a valid TAN-compliant TEI file may return an error such as

conflicting ID-types for attribute "who" of
                        element "comment" from namespace "tag:textalign.net,2015:ns"

. Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. An XML editor may be configured to ignore this discrepancy. (In Oxygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

Space characters and normalization By default in XML, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom to format XML documents as they like, balancing human readability against compactness. In XML, space normalization is performed by stripping leading and trailing whitespace and replacing sequences of one or more whitespace character with a single space,  . All TAN formats assume space normalization, with an extra caveat for leaf <div>s. Initial space is always stripped. If a leaf <div> ends in the soft hyphen or the zero width joiner (see ) the character is suppressed along with any ending space, otherwise the text is normalized to end in a single space character (whether or not there are space characters in the leaf <div> itself). If retention of multiple spaces or spaces of specific sizes is important for your files and research, then you should not be working with the TAN format, which cannot be used to replicate the appearance of a scriptum (see ). Pure TEI (and not TAN-TEI) is a better alternative, since it allows for a literal use of space, and supports the creation of scriptum-oriented XML files. Once you finish with that scriptum-oriented transcription, you might be ready to prepare a second one oriented toward intertextual analysis, at which point TAN would be ideal. For more on space see guidance in the W3C recommendation.

Mixed, non-mixed, and semi-mixed content In many popular XML formats such as TEI, XHTML, and Docbook some elements allow a mixture of elements and nonspace text as children, e.g., <div>Some <span>text</span></div>. These are called mixed content models. The TAN formats, aside from TAN-TEI, are committed to a non-mixed content model, e.g.,

<div><span>Some
                        </span><span>text</span></div>

. Nonspace text nodes and elements are never siblings. The practical effect of this decision is TAN files may be indented as you like, and whitespace text may be placed anywhere, without altering the meaning. The exception are TAN-TEI files, which allow any kind of TEI constructions, including mixed content. Many projects do not consider the implications of how they render space, however, and you should study the topic closely. An expanded TAN file (see ) may include what we term a semi-mixed content model, in which any element may have one and only one nonspace text node along with any children elements. That nonspace text node may appear at the beginning or the end of the children nodes. This applies only to the expansion of TAN files, not to TAN files themselves.

Namespaces

What are they? XML allows users to create document types of whatever kind. One person may wish to use the element <band> to refer to a musical group; another might use this element to encode radio frequencies. Perhaps someone wishes to mention a musical group and a radio frequency in the same document, which would entail mixing two very different types of elements, each named band. XML allows users to mix vocabularies, even when those vocabularies use the same element names. Disambiguation is accomplished by associating an element name with a kind of family name. That family name is an IRI (see below). The actual full name of an element, then, is the local name plus the IRI that qualifies its meaning, e.g., band{http://music-example.com/terms/} and band{http://frequency-example.com/terms/}. The IRI—the family name—is called the namespace, a term that might seem vague or confusing. It has nothing to do with space. It is merely a term of art to qualify a name. In the world there are many cities that have the same name. We use the name of the state, region, or even country to explain which city we mean. As region names are to city names, so namespaces are to element (and some attribute) names. Namespaces can be declared in an XML document. When they appear, they look a lot like attributes. (They aren't.) They take the form xmlns="http://music-example.com/terms/" (this defines the default namespace) or xmlns:[PREFIX]="http://frequency-example.com/terms/" (this assigns a namespace to a prefix) placed inside an opening tag. For example, <band xmlns="http://music-example.com/terms/">...</band> declares http://music-example.com/terms/ to be the default namespace for <band> and all descendants, unless explicitly overridden. To return to our example, different <band>s can be combined through namespaces: <band xmlns="http://music-example.com/terms/"> <band xmlns="http://radio-frequency-example.com/terms/"> ... </band> </band> <band xmlns="http://music-example.com/terms/" xmlns:e2="http://radio-frequency-example.com/terms/"> <e2:band > ... </e2:band> </band> <e1:band xmlns:e1="http://music-example.com/terms/" xmlns:e2="http://radio-frequency-example2.com/terms/"> <e2:band > ... </e2:band> </e1:band> Namespaces allow us to mix elements as we like. But it also means that when you point to, or refer to an element, you should always be aware of what its namespace is.

TAN namespace and prefix The TAN namespace is tag:textalign.net,2015:ns. The recommended prefix is tan. The namespace does not change from one version of TAN to another. The TAN-TEI format uses as its default the TEI namespace, , normally given the prefix tei. But in a TAN-TEI file, the head and its descendants are in the TAN namespace. All TAN functions and core global parameters and variables are set in the TAN namespace.

The Text Encoding Initiative

What is it? The Text Encoding Initiative (TEI; ) is consortium of scholars and scholarly organizations that maintains the rules and documentation behind a collection of XML formats intended for encoding texts. TEI files have been used widely by libraries, museums, publishers, and individual scholars to prepare and publish texts for online research, teaching, and preservation. In addition to the guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software. TEI provided the impetus for the creation of TAN, and continues to inspire its development. TEI was designed to be highly customizable, to suit the needs of individuals or communities of practice. One of the TAN formats, TAN-TEI, is one such customization, based as it is on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release. TAN-TEI files and standard, out-of-the-box TEI All files are not automatically interchangeable. TAN-TEI expects all metadata to be human- and computer-readable, whereas TEI metadata is geared primarily to human readability. TAN-TEI tightly regulates the structure of the text, whereas TEI allows for a variety of structures. In any conversion process to and from TEI and TAN-TEI, some human intervention may be required, and conversion in either direction may entail loss. For more about the strictures placed upon the TEI All schema see . See also and .

Further reading Text Encoding Initiative

Data types Being written purely in XML technologies, TAN uses data types defined in the W3C's official specifications, e.g., strings, booleans, integers. The following data types require some special comments.

Languages TAN adopts for language identification Best Common Practices (BCP) 47, which standardizes identifiers for languages and scripts. For most users of TAN, this will be a simple two- or three-letter abbreviation, sometimes supplemented with a hyphen and an abbreviation designating a script or regional subtag. For example, eng, eng-UK, and eng-UK-Cyrl refer, respectively, to English (in general), English from the United Kingdom, and English from the United Kingdom written in the Cyrillic script. As a general rule, values of this type should begin with a three-letter language code, preferably lowercase. (The two-letter codes cover only a few dozen languages; the three-letter codes support thousands of them.) ISO codes for human languages appear in @xml:lang and <for-lang>. The former states what language the enclosed text is in. The latter is an empty element that simply points to a specific language. For example, <for-lang> in the context of a TAN-mor file indicates which languages the file was written for. TAN has several global variables and functions useful for working with language codes. See .

Dates and times For dates and dates + times, TAN adopts the corresponding XML data types, which follow ISO syntax. That syntax begins with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest). The simplest date takes this form: YYYY-MM-DD. If a time is included, it is specified by continuing the string, first with a T (for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm). For example, the following is 2016-09-20T20:38:27.141-04:00 is an ISO date-time for Tuesday, September 20, 2016 at 8:38 p.m., Eastern Time Zone.

Further reading BCP 47 official specifications BPC 47 technical details W3C specification Wikipedia entry on ISO 8601

Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs) TAN makes extensive use of the following identifiers: IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987 URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986 URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource. URN: Uniform Resource Name, a term that originally referred to persistent names that used a bare urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs. UUID: Universally Unique Identifier, a computer-generated 128-bit number that may be attached as an identifier to any entity. UUIDs can be built into a URN by prefixing them with urn:. The TAN format makes extensive use of all the above. See also .

Resource Description Framework (RDF) and Linked Open Data

What are they? Identifiers are used in many contexts for many purposes. One such purpose is called Linked Open Data (LOD), also known as the Semantic Web, which aims to allow cross-project interoperability of data. It relies upon a very simple data model called Resource Description Framework (RDF), recommended by the World Wide Web Consortium (W3C). The term "Resource"—the R in RDF—refers to any person, place, concept—anything at all, whether you think of it as a resource or not. "Description" is overly specific, too, since RDF was designed to support general assertions, descriptive or not. Perhaps it is easiest to think of RDF as a standardized way to make assertions, as if the name were simply "Assertion Framework." It is a way to make claims about things in the world. The RDF data model rests upon the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that name things. The object may take an identifier or just data. As people independently identify concepts with the same URLs, they create RDF datasets can be combined, synthesized, and compared. RDF statements found across the web allow inferences no individual project could ever anticipate. The Semantic Web recommends the use of URLs as identifiers. That way, if a computer encounters a URL naming a concept, it can be programmed go to the web resource and retrieve other RDF statements, recursively. So URL identifiers look like a web page address (e.g., http://...), but they are first and foremost names for things. Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found. Although RDF statements must be made of only three components, it is possible in a roundabout way to create more complex assertions. In one technique, the assertion itself is given a URL, and then RDF statements are made about the assertion. Such assertions are in some cases not easily integrated with other RDF statements. Users who query an RDF database will not find relevant complex RDF statements unless they build their queries to anticipate such situations (or the query engine has been customized).

TAN claims and RDF Much of TAN can be converted to RDF statements. In fact, TAN may be one of the most human-friendly ways to read and write RDF. For example, consider how one might express "Person X's name is 'Dave Smith'." Compare this snippet (taken from ), written in Turtle, the RDF syntax generally regarded as the most human-readable, ...@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/dave-smith> rdf:type foaf:Person ; foaf:name "Dave Smith" . ...with the TAN equivalent:<person> <IRI>http://biglynx.co.uk/people/dave-smith</IRI> <name>Dave Smith</name> </person> These TAN and RDF expressions are interchangeable. But in more complex claims, it is, at this time, not clear whether all assertions in TAN can be losslessly converted to the RDF model. Every class-2 file makes a claim about the text, and there must always be attached to the claim someone that must be blamed or credited for the assertion. TAN also permits such claims to be modified through traditional adverbs. This is best seen in the TAN-A <claim>, which allows a person to nuance a claim to a degree that is difficult or impossible to express in traditional RDF. For example, RDF does not allow one to say "Person X is not the author of text Y," but TAN does. TAN claims can also be quite complex. Whereas the standard RDF claim consists of three components—subject, predicate, object—most TAN claims have more. Every TAN claim must have at the minimum: a claimant (no RDF counterpart; the person, organization, or algorithm that asserts the claim), a subject (counterpart to RDF subject), and a verb (counterpart to RDF predicate). Verbs can be defined to permit, require, or disallow other claim components, such as adverbs or objects, many of which are permitted by default. Most TAN claims involve more than three components, so converting a TAN claim to RDF requires creating a complex RDF statement. In many cases, this requires the use of RDF* instead of RDF (link below). Many TAN claims involve textual subjects or objects. References to parts of text can be quite complex, and they must be made with reference to other entities. It doubtful whether a given specific textual subject or object can be satisfactorily reduced to an unambiguous IRI, because such an IRI would need to include a mechanism to resolve the meaning of the syntax. Such an IRI must not only explain the work's reference system, but also identify the chosen version, scriptum, and perhaps token definition and numeration system. Many texts have more than one "canonical" reference system, so an IRI might point to two different textual passages, thereby breaking a cardinal rule of IRIs: although an entity may be given multiple IRIs, it is never acceptable for an IRI to be ambiguous. There is, at present, no widely accepted solution to this problem, although attempts have been made through CTS URNs and DTS URNs. For more details see and <claim>.

Further reading W3C recommendation Linked Data Linked Open Vocabularies RDF* CTS URNs DTS URNs

Tag URNs TAN files make extensive use of tag URNs (see ). In fact, TAN's namespace is itself a tag URN (). A tag URN has two parts: Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned + :. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01. Name of the subject. An arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for subject (e.g., the file, a work, a scriptum). If you are providing a tag URN for a TAN file, that name can be the same as the filename, but it is a good practice not to do so, because filenames need to be changed. You should pick a name that is at least somewhat intelligible to human readers. It is a good idea to build a name via categories, from most general to most specific. For example tag:pat@example.com,2014:work:aristotle-pseudo:secreta-secretorum might be used as an IRI to name the work the Secret of Secrets attributed to Aristotle. A TAN file that transcribes a particular version of this text might look like this: tag:pat@example.com,2014:transcription:scriptum:badawi-1954:work:secrets. Although you may use any tag URN coined by someone else, when you create a tag URN, you may use only namespaces you own or owned. Care should be taken in choosing the name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple identifiers, but never acceptable for an identifier to name more than one thing. It is a good practice to keep a master checklist of tag URNs you have created. If you find yourself forgetting, or think you run the risk of creating duplicate tag URNs, you should start afresh by creating a new namespace for your tag URNs, if only by changing the date in the tag URN namespace. Tag URNs tag:jan@example.com,1999-01-31:TAN-T001 tag:example.com,2001-04:work:usc22.1 tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts tag:bbrb@example.org,1995-04-01:pos-grc The first example comes from someone who owned the email address jan@example.com on January 31, 1999 (at the stroke of midnight, Universal Coordinated Time). The other examples follow a similar logic. The namespace of the second and third examples are tied to the owners of specific domain names. The 2014 in the third example is shorthand for the first second of January 1, 2014. TAN files are identified and named via tag URNs, not URLs, for several reasons: Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries from now, well after most domain names today have changed ownership or fallen into obsolesence, and well after the creators are dead. URLs are not designed for such longevity. Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, a group of persons, or an algorithm). A tag URN connects the identifier with the responsible person or group. URLs cannot identify the person or organization responsible for the name. Accessibility. Tag URNs have almost no barriers. They can be created by anyone who has an email address. No one has to register with a central authority. You can begin naming anything you want, any time you want, without anyone's approval, and without paying anything. Ease. Tag URNs are easy to use. All you need is an email address, which is very easy to get. You can use a domain name too, but many potential TAN authors never have owned a domain name, and never will, barring them from creating or publishing linked open data under the classic model, where you coin URLs in a domain you own. Many of those who do own domain names cannot or do not wish to configure, populate, maintain, and troubleshoot servers with the referral mechanisms recommended by Semantic Web advocates (see ). Scholarly citation norms. In the Semantic Web, the conflation of URL qua name with URL qua location is considered by many a virtue because the single string does double duty, both naming the resource and pointing to a location where more can be learned. Although the combination is elegant from the perspective of an engineer, it is confusing to many others. URLs are commonly thought to be merely locations for data, not names for things. It also goes against an important principle in scholarly citation practices, namely, the name of a publication should always be distinguished from where it might be found. Further reading: RFC 4151, the official definition of tag URNs

Regular expressions Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, alluding to the Latin root regula (rule), it refers to a rule-based method of finding and replacing text through patterns. Regular expressions come in different flavors, and have several layers of complexity. TAN regular expressions adhere closely to the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Functions 3.1. XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the codepoints in the range U+0020..U+00FF that are considered word characters according to XML (and therefore TAN): Word characters (\w):

$ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W):

! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § «  ¶ · »
                           ¿

The placement of some of these characters may seem to you counterintuitive or wrong. But at this point complaining will not change the conventions. Any apparent mistakes are definitive ones. Just familiarize yourself with the conventions. A regular expression search pattern is treated just like a normal search pattern until the computer reaches a special character:

. [ ] \ | ^ $ ? * +
                     { } ( )

. Here is a brief key to how those special characters behave in regular expressions when they are first found. (Some of these special characters change their meaning if they are found inside square brackets; on this point, see the recommended reading below): Special characters in regular expressions Symbol Meaning . any character | or (union) ^ start of line or string (doesn't capture any characters) ? zero or one * zero or more + one or more [ ] a class of characters ( ) a group ^ beginning of a line or string (doesn't capture any characters) $ end of a line or string (doesn't capture any characters)

If you need to use any of those special characters as characters in their own right, then you need to escape them, by prefixing the character with an escape character, \. Special characters in regular expressions Symbol Meaning \\ backslash (an escaped escape character) \^ a caret sign (must be escaped with the \) \$ dollar sign (escaped) \( opening parenthesis (escaped) \[ opening square bracket (escaped)

The escape character appearing before some letters accesses certain classes of characters: Special characters in regular expressions Symbol Meaning \w any word character \W any nonword character \s any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D) \S anything not a spacing character \d any digit (0-9) \D anything not a digit \p{IsGujarati} any character from the Unicode block named Gujarati

Some examples of regular expressions: Examples of Regular Expressions Expression Meaning What the expression matches when applied to "Wi-fi, good. A_hem* isn't!" ^.+$ one whole line of characters "Wi-fi, good. A_hem* isn't!" [ae] a or e "e" [a-e] a, b, c, d, or e "d", "e" [^ae]+ one or more characters that are anything except a or e "Wi-fi, good. A_h", "m* isn't!" .i any character followed by i. "Wi", "fi", " i" (.i) when a character followed by an i is found treat it as a capture group (used only in a search pattern) "Wi", "fi", " i" [aeiou]\w* any lowercase vowel along with every word character that follows "i", "i", "ood", "em", "isn" [t*]. any t or * and the following character "* ", "t!" Note that the asterisk, if inside a character class, represents itself. \s+ one or more space characters " ", " ", " " \w+ one or more word characters "Wi", "fi", "good", "A_hem", "isn", "t" \W+ match one or more nonword characters "-", ", ", ". ", "* ", "'", "!" [^q]+ one or more characters that are not a q "Wi-fi, good. A_hem* isn't!"

The examples above provide a taste of how regular expressions are constructed and read. Regular Expressions and Combining Characters A regular expressions might be ambiguous in the context of combining characters. Suppose we have a string of three characters, áb (i.e., an acute accent over the a; the codepoints are, in XML entities, áb). The regular expression a. will in some search engines include the b and others not. Unicode has differentiated three levels of support for regular expressions (see official report). Only level-one conformance in XPath and therefore TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see ). TAN includes several functions that usefully extend XML regular expressions. See . Further reading: Various tutorials on Regular Expressions Wikipedia, Regular Expressions Regular Expressions in XSLT 3.0 Unicode and Regular Expressions XML Schema Datatypes A New \u: Extending XPath Regular Expressions for Unicode

Common patterns and structures This chapter provides general background to the elements and attributes that are common to all TAN files. For more detailed discussion, see . This chapter does not discuss TAN catalog files, on which see .

Common patterns

IRI + name pattern Both humans and computers need to read and write TAN metadata. Very often what is readable to humans is unreadable to computers, and vice versa. So the TAN format requires that all metadata be provided whenever possible in both forms. Although this rule may appear to introduce redundancy and therefore opportunities for error, the clarity is critical. It is the only way at present to ensure that any person or algorithm that approaches the data can parse and use it. In addition, doubly expressed metadata provides a safeguard much like a checksum: human- and computer-readable descriptions should comport. Any discrepancy signals a problem that should be checked. Some metadata, such as that inside <comment> or <change>, are neither easily nor profitably translated into a computer-actionable string. In such cases only the human-readable form is required. Other metadata involve regular expressions (e.g., @pattern) or ISO-compliant dates (e.g., @when), both of which are well formed and are usually human-legible. Such data are not repeated, although they may be explained via <desc> or <comment>. Those exceptions aside, all other metadata takes what is called the IRI + name pattern: one or more <IRI>s followed by one or more <name>s then zero or more <desc>s. This is the core pattern for nearly all TAN vocabulary items.

Digital entity metadata pattern Some entities identified by the will be digital resources. In those cases, the IRI + name pattern is extended. There must be one or more <location>s, with @href and @accessed-when, which signals where the resource is and when it was last consulted. In validation, only the first document available will be used. Extra <location>s might prove helpful for applications. There may be an optional <checksum>, to more accurately specify which version of a file was consulted. If the entity is a TAN file, then <IRI> must be a valid tag URN that matches the @id value of the TAN file being referred to. Because there is only one @id in a TAN file, any IRI + name pattern that points to it will have only one <IRI>. If the entity is not a TAN file, then any IRI may be used, including its resolved URL. @accessed-when states when a file was last accessed. During validation, the target file will be checked. Any changes before that date will be ignored; those after will be reported, normally as warnings. See . All these requirements may seem excessive, since in other formats (HTML, TEI), to refer to another file one needs simply a link, via @href or @src. But TAN files are meant to be valid long after their creation, when @href points to broken links. An <IRI> might allow one to find a missing file. It also helps specify which file is intended. Sometimes one file gets overwritten by a different one.

Edit stamp Most TAN elements allow for an optional edit stamp, an @ed-who and an @ed-when, stating who created or edited the enclosed data and when. Neither attribute is allowed without the other. @ed-when is one of the attributes that help determine a file's version. See . An edit stamp is much like a <change> without a narrative. The attributes simply mark the element where a change has been made. If a description of the alteration is considered necessary, <change> should be used.

Overall structure All TAN-compliant files, no matter the type or class, follow a common basic structure: (1) a prolog normally with at two processing instruction nodes; (2) a root element; and (3) a head, a body, and an optional teiHeader and tail. Prolog and processing instruction nodes: The standard prolog of every XML file should begin:

<?xml version="1.0"
                  encoding="UTF-8"?>

XML version 1.1 is a permissible alternative, and encoding="UTF-8" is optional. After that come two processing instructions specifying the two schema files required for validation

<?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
                           c]"?>

<?xml-model href="[PATH]/TAN.sch"?> The first processing instruction node points to the RELAX-NG schema that declares the major, structural rules. The second points to the finely tuned rules, written in Schematron. Both processing instructions are required, except in systems where those processing instructions are implicitly understood (e.g., an Oxygen project or framework). [PATH] represents the pathname to the schema file, whether local or on a server, and [ROOT-ELEMENT-NAME] stands for the name of the file's root element (the element that is the ancestor of all other elements in the document and the descendant of none). It is your choice whether you use .rnc or .rng as the extension for the RELAX-NG schema. The former is the compact syntax and the latter, the XML format. They are equivalent. The schemas are written initially in the compact sequence, then converted to the XML format. TAN files permit three different levels of Schematron validation: terse, normal, and verbose. A phase may be specified with a pseudoattribute phase in the prolog, e.g., <?xml-model href="TAN.sch" phase="verbose"?>. But it is customary not to specify the phase, since most users will want to pick the level of validation desired at a given time. Verbose takes the longest time, and terse the shortest. Verbose provides the most feedback, terse the least. But some files will not show any difference in results from one phase to the next. For more on validation, see . Root element: The name of the root element identifies the type of TAN file:Root TAN elements Root element name Type of data TAN class <TAN-T> plain text transcriptions 1 <TEI> TEI transcriptions 1 <TAN-A> division-based alignments and annotations 2 <TAN-A-tok> token-based alignments 2 <TAN-A-lm> lexico-morphological annotations 2 <TAN-mor> part of speech / morphology patterns 3 <TAN-voc> glossaries 3 <collection> catalog of TAN files 3

<collection> is provided here only to complete the table. None of the material in this chapter applies to this special class 3 format. See . Each root element takes a mandatory @id and @TAN-version. On @id, see below. @TAN-version must be 2021, the current version of TAN. All TAN elements fall under the namespace tag:textalign.net,2015:ns. In most cases, the namespace is declared in the root element. (The only exceptions are TAN-TEI transcription files, which take as a default namespace http://www.tei-c.org/ns/1.0 everywhere but in /TEI/head, which takes the TAN namespace.) For more about namespaces, see . Root element children: Most root elements take two mandatory children: <head> and <body>, the latter containing data and the former, metadata (data about the data). Root elements of TAN-TEI files take three children: <teiHeader>, <head>, and <text>. The apparent duplication of a head element is necessary: the <teiHeader> does not satisfy TAN metadata requirements, and the TAN header does not try to do what the teiHeader does. See . All TAN files may take one final optional child, <tail>, a private use element that allows any well-formed XML. It was introduced initially to experiment with methods in improving the efficiency of validation and applications, but it can be used for a variety of tasks or applications. Nothing in a TAN file should be dependent upon the <tail>. That is, if you are editing a TAN file and you add a <tail>, assume that it will be disregarded by other users. Similarly, you may delete any TAN file's <tail> without consequence.

Identifying TAN files: <code><link linkend="attribute-id" >@id</link></code> Every TAN file requires in its root element an @id, which must take the form of a tag URN (see for syntax). The file's @id is the primary way other TAN files will refer to it, and it may be used in RDFa, JSON-LD, and linked open data (see ). A tag URN begins with a namespace component, and concludes with the identifying string. The namespace of @id must match at least one other tag URN namespace from the <IRI> of a <person> identified by <file-resp>. See . In choosing a value for @id you might imitate the filename, but this is normally not a good idea, since files are frequently renamed, often with good reason. A TAN file's @id should not be changed, especially after public release. The name should remain permanent and stable, even if flaws in the name are recognized. On occasion during editing, it will become clear that revisions are so deep that the file is altogether a different kind of thing. If a previous version has been published, then coining a new @id is advised, to make a clean break. You may document the connection by supplying <predecessor>, which establishes a line of ancestry. If you take someone else's data and alter it then you should not change the @id. To ensure that you are credited with any revisions you make to the file (if you are allowed—see <license>), you should add yourself as a <person> and then document your alterations through <change> or @ed-when and @ed-who. You might also add a <predecessor> element, pointing to the previous version of the file. The @id is the only file-specific metadatum positioned outside <head>. It is placed as rootward in the document as possible to make clear that it names the entire document.

TAN file versions The version of a TAN file is identified by the most recent date in a file's @when, @ed-when, and @accessed-when. Whenever you change a TAN file that has already been published, provide at least an edit stamp () in the part of the file you changed, or add a new <comment> or <change>, so that anyone validating a TAN file dependent upon yours will be warned that changes have been made. The user may then either continue to process the file (the changes may be minor or inconsequential) or pause and see if anything on their end needs to be changed.

Attribute inheritability and priority Some attributes affect not merely their parent element but all their parent's descendents. This phenomenon is called inheritability. Some attributes are non-inheritable. That is, the attribute relates only to the parent element. Examples: @pattern, @flags. If TAN schema documentation for an attribute does not state anything about the inheritability of an attribute's values, it should be treated as non-inheritable. Most inheritable attributes are weakly inheritable. That is, inheritance stops at any descendant that has the same attribute. For example, @xml:lang set to eng specifies that its text nodes are in English, but it might contain another element whose @xml:lang is set lat. If text has multiple ancestors with different @xml:langs, the closest (leafward-most) is the only one that counts. Other inherited attributes are cumulative. That is, their values combine as one goes from root to leaf. For example, if an element with @cert wraps another, and each one has a @cert value of 0.5, it means that claim behind the wrapped element has only 25% certainty. @n in a <div> is indirectly cumulative for the purposes of resolving values of @ref. Any given <div> has one or more implied references, formed by all permutations of concatenating values of inherited @ns. Cumulative inherited attributes are infrequent, and the documentation specifies how each one behaves. Some attributes within the same element have interpretive priority. @claimant, for example, has priority over @cert. That is, the two attributes in the same element are to be interpreted to mean: "@claimant has @cert confidence about the following claim:...." It does not mean that one is uncertain whether the claimant made the claim.

Defining words and tokens At the heart of interaction between class-1 and class-2 files is the need to identify words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the context or language. For example, "New York" and "didn't" can each be reasonably defined as being either one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., medieval manuscripts or modern editions of ancient texts). In the end, the many meanings for "word" reflects the diversity of scholarship. TAN follows the field of corpus linguistics and avoids word in favor of the proximate term token—one or more characters defined not according to grammar but according to a regular expression (see ). In TAN, a token is purely a string definition, used to segment and to point. A token in TAN does not entail any linguistic categories. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed. TAN was developed with a concern for ancient literature, where punctuation is generally ignored as being late or not central to the text. Happily, even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters (\w), the soft hyphen, the zero-width space, or the zero-width joiner, formally defined by : <token-definition regex="[\w‍]+"/> This pattern closely resembles what is ordinarily thought of as words, but perhaps with some surprises (see above, ). If no <token-definition> is explicitly given, the default token definition above will be used. If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword letters and punctuation: <token-definition regex="[\w‍]+|[^\w‍\s]"/> This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string (I go!) would have five tokens: (, I, go, !, and ). For other standard TAN token definitons see <token-definition>s. You may customize your own <token-definition>. But keep in mind that TAN files were meant to be shared across fields and disciplines. You should define tokens in a way users of your texts expect. Two class-2 TAN annotation files with different tokenization systems can be challenging to collate.

Metadata (<code><link linkend="element-head"><head></link></code>) No matter how much one TAN format differs from another, the metadata follows the same basic structure. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore to find easily and predictably, the following: the stable name of the file; its version; its sources; other files upon which it depends or otherwise has an important relationship; the most significant parts of the editorial history; the linguistic or scholarly conventions that have been adopted in creating and editing the data; the license, i.e., who holds what rights to the data, and what kind of reuse is allowed. the persons, organizations, or entities that helped create the data, and the roles played by each. To answer these questions completely, consistently, and predictably, the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, making TAN files predictable across a variety of formats. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. TAN metadata centers on the data itself and not on other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. Merely give good <IRI>s to point to authoritative sources that provide background information. The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. The <teiHeader> supports the creation of metadata that has little or no relevance to the content of <body>, has its own unique structure, has very few metadata that are required, and is not designed to incorporate IRIs. Although <teiHeader>and TAN's <head> overlap in some respects, they cannot be mapped onto each other. Each has a different purpose, so both must be retained. In what follows we provide a general overview of the TAN <head>, focusing on its general structure, and some of the principles that affect other parts of the TAN ecosystem.

Key Information Key information about the file as a whole is the first section of a <head>. This includes <name>, perhaps one or more <desc>s, and perhaps one or more <master-location>s, which point to locations for authoritative versions. <master-location> is optional, but not if <to-do> (see below) is empty.

Key Declarations Each <head> in a TAN file has a declaration section, pertaining to how the file should be used: <license> and <numerals>. <license> stipulates the license(s) under which the persons or organizations listed in its @licensor are releasing the data. The license applies only to the data in <body>, not to its sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to speak for others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional, best handled in a <desc> or <comment>. When using a TAN file, you should investigate the entire chain of rights. You may find discrepancies between the license of a TAN file and that of its sources. For example, you might create a complete TAN-based lexico-morphological analysis of a 20th-century novel, and legitimately release the TAN data under a public domain license, even though the novel itself is under copyright. Users must be aware of and respect licenses, and know that the license in a TAN file may not be the license of its sources. TAN adopts the Creative Commons licenses as its default license vocabulary. See . <numerals> may be used to declare whether an ambiguous numeral should be interpreted as an alphabetic numeral or a Roman numeral (default). See the entry for <numerals> as well as the section on numeration systems. Many TAN files allow in this section <token-definition>, which specifies a definition for tokens, perhaps tailored via @src to a specific class-2 file. See and <token-definition>.

Networked Files The third major section of <head> accommodates links and references to other files. Some files are essential to processing the TAN file, while others are less important. The two most critical types of files are marked by <inclusion> and <vocabulary>. The files pointed to by these elements should be considered constituent parts of the dependent TAN file. In the validation process, failure to access any one of them (calculated recursively) is a fatal error. <inclusion> and <vocabulary> were developed to reduce duplication (and therefore potential error) in collections of TAN files. Many if not most TAN files are created alongside or in the context of a project, where certain data patterns are repeated. Explicit repetition from one file to the next makes them prone to error. Changes might be made in one file but not in another, introducing version conflicts. <inclusion> and <vocabulary> provide a specialized method of inclusion that leads to cleaner, smaller files. In general, you should first try using <vocabulary>, which points to TAN-voc files that collect vocabulary items common to the project. If that element does not do what you want, then try <inclusion>. It is normally easier to diagnose a complex set of <vocabulary>s than a complex set of <inclusion>s.

Vocabularies Oftentimes, from one file to the next, an editor needs to refer repeatedly to a common set of things, e.g., manuscripts, works of literature, or persons who helped edit the files. Projects are advised to create their own <TAN-voc> files, populated with commonly used vocabulary. Once set up, the TAN-voc file must be linked to via a <vocabulary> in the <head> of each TAN file that draws from the vocabulary. Vocabulary items can then be invoked either by pointing to <name> values, or by assigning an @xml:id to a vocabulary item placed in the <head>'s <vocabulary-key>. If you draw upon <name>, you may make alterations to capitalization. Hyphens, spaces, and underscores are treated as interchangeable. Capitalization and spelling of @xml:id, however, must be strictly followed. Vocabulary (TAN-voc) files tend to require frequent change and expansion, so it is recommended that you depend upon only those TAN-voc files that are part of your project, and not those from a different project. In the host file, any attribute that takes multiple IDrefs, e.g., @who, @type, @subject, may take a mixture of values that refer to numerous vocabulary items via @xml:id or <name>. But in these attributes spaces are reserved to delimit multiple values, which means that if you refer to a <name>, spaces must be replaced with the underscore or hyphen. A @which in the host file, however, can take no more than one value, so using spaces is fine. @id and @xml:id are case-sensitive, and do not allow spaces. @which and therefore <name> are not case-sensitive, and the space, hyphen, and underscore are equivalent. If you point to @id or @xml:id you must respect case and punctuation. If you are pointing to a <name> you can ignore case, and you should probably replace the space with a _. TAN includes a number of standard vocabulary (TAN-voc) files for a variety of concepts commonly used in textual scholarship (see ). Vocabulary items have been defined for more than one hundred types of textual divisions, and any of these can be invoked simply by using their names (see ). <vocabulary> itself may take @which, but only to point to one of the extra TAN vocabularies listed in . You cannot point to a customized TAN-voc file via @which. This restriction avoids some complexity in the validation routine. See on how to use this feature. Files pointed to by <vocabulary> are considered an essential part of any TAN file. Failure to find the target file will throw a fatal error during validation.

Inclusions Whereas vocabularies do not change the host document, inclusions do. Unlike other forms of inclusion you might be familiar with, TAN inclusion is targeted at select elements, never an entire file. TAN inclusion is a two-step process. First, a TAN file is linked to, and therefore made available for inclusion, via <inclusion>s (inside <head>). Like <vocabulary>, an <inclusion> does nothing on its own. It merely points to a file that is eligible for inclusions. No actual inclusions occur until the next step. Second, select parts of the included file are invoked in the dependent file. To do so, insert an element X in a valid location, but with nothing but @include, with one or more values (space-delimited), each pointing to an @xml:id values of an <inclusion>. In the validation process, that element X will be replaced with all element Xs found in the inclusion file, resolved recursively, and ignoring duplications (deeply equal elements). For example, a TAN-T file might have a

<div
                     include="poem1">

. The validation routine will replace that element with every rootmost <div> in the included file called poem1. Any host file that includes elements from another file inherits any vocabulary associated with the inclusion, and along with it @xml:id values. This may result in IDrefs pointing to two or more distinct vocabulary items, which may be a benefit or a hindrance. Be familiar with the items you are including. TAN inclusion is very practical for texts. Textual works commonly nest inside each other. By setting up your class-1 files as a series of inclusions, you can reduce validation time, both in the file and in class-2 files that depend upon the transcriptions. See the examples subdirectory for a sample of a Gospel of Matthew including the Sermon on the Mount including the Lord's Prayer. The inclusion technique is also especially useful for vocabulary (TAN-voc) files. A single master TAN-voc file can include other vocabulary files, each devoted to a particular type of item (e.g., one for works, one for scripta). Project files then need to link merely to the master TAN-voc file. You can include a TAN file that itself includes other TAN files. Inclusion is recursive. In any recursive system, circularity is fatal. That is true for TAN inclusion as well, but only within the scope of specified element names. It is perfectly legal for two files to include each other, as long as they do not try to include (directly or indirectly) the same elements, or try to consult each other to resolve any vocabulary. Files pointed to by <inclusion> are considered an essential part of any TAN file. Failure to find the target file will throw a fatal error during validation.

Other related files A TAN file may point to a number of other types of files. The more that are mentioned, the richer the network. <predecessor> and <successor> point to versions of the file that precede and postdate it. <source> is another type of related file, but it may or may not link to another file. In class-2 files <source> always points to a class-1 TAN file. In class-1 and class-3 files, <source> may point either to a file or to a scriptum (see ). <see-also> can be used to point to any file that has some relationship to a TAN file. The required @relationship points to one or more <relationship> vocabulary items. There is no standard TAN vocabulary for relationships. Normally, when a file-to-file relationship is considered important, it becomes a full-fledged standard TAN element. Some TAN formats allow special types of related files (e.g., <redivision> and <model> for class-1 files). See metadata descriptions under specific classes or formats.

Adjustments The fourth major section of <head>, which is optional, consists of <adjustments>, which specifies changes that have been made (class 1), or should be made (class 2), to the sources. In class-1 files, these consist of <normalization>s and <replace>s; see . Class-2 files allow <skip>, <rename>, <equate>, and <reassign> as adjustments; see .

Local vocabulary items and ID assignments: <code><link linkend="element-vocabulary-key"><vocabulary-key></link></code> The fifth major part of <head>, <vocabulary-key>, allows you to declare any vocabulary items specific to the file. It also allows you to take vocabulary items existing in other TAN-voc files (whether defined in <vocabulary> or standard TAN vocabulary), and assign them @xml:ids that are valid only in the current file. Anything in <vocabulary-key>, and any TAN-voc files pointed to via <vocabulary>, will overwrite default TAN vocabulary. These id assignments can be supplemented with <alias>es, which are used to assign an id to one or more ids. This practice resembles what text editors do when naming groups of manuscripts. Each manuscript is given a siglum, say a single lowercase Greek or Latin letter, and the manuscripts are grouped together into families, with each family given its own siglum, say an uppercase letter. If the editor wishes to indicate that a whole family of manuscripts departs from a particular reading, the family siglum is all that is needed. An <alias> works much the same way, and can be used for any vocabulary items. For example, if a textual division can be legitimately called both a rubric and a heading, you could assign rubr and hd as ids in the <vocabulary-key> to the vocabulary items for the rubric and the heading, and then insert

<alias
                     xml:id="rubrichead" idrefs="rubr hd">

. Then, in that file, <div n="1" type="rubrichead"> would identify that <div> as being both a rubric and a head. Unlike other pointing attributes, the @idrefs of an <alias> cannot point to the <name> value of vocabulary items. They can refer only to the id values of locally defined instances of @xml:id. This restriction reduces confusion, and avoids some complexity in the resolution and validation of a TAN file. <alias>es may recurse, as long as there is no circularity. That is, @idrefs in an <alias> may refer to any @xml:id or @id, not only to a vocabulary item but to another <alias>. In most cases <alias> should refer to items of the same type. In a few situations mixed groups do not pose a problem, for example mixing <person>s, <algorithm>s, and <organization>s. TAN validation will indicate whether mixed typology introduces errors. Because @xml:id may not contain certain types of characters, such as common punctuation marks, and because <alias> must be able to coin unusual ids (especially for grammatical features), @id may be used instead of @xml:id in <alias>.

Responsibility The sixth section of a <head> declares who is responsible for the file. It consists of a <file-resp> and one or more <resp>s. The persons, organizations, or algorithms pointed to in <file-resp> must include at least one who has a tag URN whose namespace matches the namespace in the tag URN of the root element's @id. This requirement strengthens the effort to make sure that each TAN file is associated with the person or persons who are or were responsible for the file. <person>s so identified by <file-resp> are called primary agents, and are bound to the global variable $primary-agents. If a claim is made in a TAN file, and no @claimant is explicitly declared, it is assumed that the $primary-agents are making the claim.

Change log The change log, the seventh section of the <head> consists of one or more <change>s, which provide a partial history of the file. The entire history is calculated from every attribute that has a date or timeDate value, which can be fetched via the function tan:get-doc-history() or the global variable $doc-history. The change log is an effective way to communicate with those who might use your files. In all likelihood, a user will download from the master location a local copy. You might make changes or updates to your master copy. Anyone depending upon a copy will be warned, during Schematron validation, of each <change> that postdates the value of their @accessed-when. If you have introduced an important or disruptive change, you can mark your <change> with @flag, that allows the following values: warning (default value), error, info, fatal. By marking a change as info, you lower the level of a change's importance; error raises the level. The value fatal will halt the validation process in the dependent file altogether. If you receive change messages during validation, and you want to stop them, merely update the value of @accessed-when to the current date.

Pending work The last section of a <head> lists all pending tasks that yet need to be applied to a file. These are itemized as a list of <comment>s in <to-do>. A file with an empty <to-do> is assumed to be no longer in progress, so there must be a <master-location> provided. Like the change log, the <to-do> effectively communicates cautionary notes to those who might use your files. Anyone depending upon a copy will be warned, during Schematron validation, of each item in the list. The report is not dependent upon when the file was last consulted (@accessed-when), because this is a collection of standing, unresolved issues. One benefit of <to-do> is that you can release your material before it is finished. Other users will have fair warning about what is imperfect or incomplete.

Class-1 TAN files, representations of textual objects (<emphasis>scripta</emphasis>) This chapter provides general background to class-1 TAN files and their elements and attributes. For detailed discussion of individual elements or attributes, see . Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Class-1 files are the foundation of any TAN project. No TAN-A-tok or TAN-A-lm file can be created without at least one class-1 file, and most TAN-A files depend upon many of them. There are two types of class-1 formats, identified by the root element. <TAN-T> is a simple, generic format, with plain text inside a simple tree structure. <TEI> (also referred to in this manual as TAN-T(EI)), on the other hand, can be complex and highly expressive. Because the two formats function almost identically, the generic TAN-T format is described first, followed by supplemental comments on TAN-TEI.

Principles and assumptions

General (For more general principles and assumptions applying to all TAN files, not just class 1, see .) Class-1 formats are designed for faithful but judiciously normalized digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a single, preferably familiar, reference system. Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the text of a primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how the transcription might be used in other scholarly fields, e.g., translation studies, corpus linguistics. Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicography. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills are more profitably applied to other TAN formats. TAN-T(EI) editors stand at the foundation level of the Text Alignment Network. Because other files will depend upon TAN-T(EI) files, careful proofreading is important. Eliminating as many typographical errors as possible before publication will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed with the assumption that most files in circulation have typographical errors that can and should be corrected as they are found. If you are aware that a text needs proofreading, but you still want to make it available, simply leave a <comment> in the <to-do> part of the <head>. If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate alignment and annotation, which requires use of a suitable reference system (see reference systems). Transcription files should be segmented and labeled according to a reference system that is familiar and can be easily applied to other versions of the same text in other languages. If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should be prioritized over visual (lines, columns, pages, volumes). Any transcription can be furnished multiple reference systems, but it is advisable to do so on the basis of separate files, linked by <redivision>s in the <head>. See .

Domain model Contributors and users of TAN files must sharply distinguish between a scriptum (text-bearing object) and a conceptual work, e.g., between a specific printed copy of the Iliad and the Iliad concieved generally. The former has materiality (digital files are treated here as being material) and the latter does not. Even though both are constitutively necessary for any transcription, the two are always differentiated in the TAN-T(EI) format: <source> and @src point to physical exemplars; <work>, @work, and <version> to the conceptual. Adherence to this distinction is quite important. Some readers may be reminded at this point of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies in its Group 1 (Products of intellectual & artistic endeavor) four types of entities: work, expression, manifestation, and item. A work is "a distinct intellectual or artistic creation" and an expression is the conceptual, immaterial realization of a work. Both work and expression are terms for conceptual, non-material entities. A manifestation, on the other hand, is "the physical embodiment of an expression" and an item is a single exemplar of a manifestation. Quotations in this section come from International Federation of Library Associations and Institutions, Functional Requirements for Bibliographic Records: Final Report, amended and corrected (February 2009), . Examples of FRBR Group 1 Entities Work Expression Manifestation Item Iliad Caroline Alexander's English translation of the Iliad. the print run identified with ISBN 978-0062046284 A specific copy The Psalms The (Hebrew) Masoretic Psalter The 1820 printing of George Offor's edition of the Hebrew Psalms Biblioteca Palatina Cod. Parm. 1699 A River Runs Through It Norman MacClean's original version The 1992 film version Print run ISBN 0226500608 Blue Ray disc UPC code 004339632533 Author's personal print copy Reference print CGB 7432-7438 (deposited in the Library of Congress)

TAN's domain model differs slightly. The most important difference is abandonment of FRBR's expression, which was found to be problematic when developing sample TAN data. The term expression was intended to describe a conceptual, non-material entity, but the FRBR guidelines defined and explained it in vague or material terms. The problems are illustrated by wording in the specifications: "Expression encompasses, for example, the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text....defined, however, so as to exclude aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such." (ibid., p. 19, emphasis added) That is, expression includes integral aspects of physical form (e.g., typeface that is integral to the realization). "Inasmuch as the form of expression is an inherent characteristic of the expression, any change in form (e.g., from alpha-numeric notation to spoken word) results in a new expression." (p. 20, emphasis added) Even the very term expression and FRBR's preferred synonym, realization, imply materiality (nothing can be expressed or realized without a material medium). Further, FRBR's expression does not easily handle creative adaptations of works that are themselves arguably works in their own right. For example, Euripides' Medea was adapted several centuries later by Seneca the Younger. Seneca's Medea is arguably merely an expression, yet it has itself been subject to various editions and performances, i.e., expressions. But FRBR does not accommodate expressions of expressions. If Seneca's Medea is treated as a work in its own right, its expression relationship to Euripides' origin is lost, since FRBR does not accommodate works that are expressions of other works. In the TAN domain model, expression is altogether dropped. There is only one type of conceptual, non-material entity, namely, a work. The term version in TAN is applied to a work that substantially follows some other work, e.g., translations and adaptations. But such versions are themselves still works. One work is indicated to be the version of another in a class-1 file through the <work> and <version> declarations. As for material entities, FRBR's manifestation and item are combined in TAN through the term scriptum. A scriptum is a text-bearing object, e.g., book, manuscript, pamphlet, tombstone, traffic sign, digital file (digital media is interpreted as being material). When scriptum is used in a TAN file, it points either to a single physical item or to a set of physical items that for all intents and purposes are indistinguishable (i.e., a scriptum reproduced mechanically). A scriptum that points to a manuscript points only to that one particular manuscript. But a scriptum that points to a printed book or a digital file is understood as applying to all copies of that printed book or digital file. There is at present no formal mechanism to specify whether a scriptum points to one object or a set of objects. The distinction must be inferred from a scriptum's IRI + name pattern. In cases of potential ambiguity, it is up to creators of a TAN file to assign to the scriptum IRIs that avoid confusion. For example, to point to Edward Gibbon's personally annotated copy of the 1763 edition of Herodotus (now held by the Wren Library, Trinity College, Cambridge University), one should not use or , which point to the set of all copies. In this case, one may need to mint their own IRI, based on the Wren Library's acquisition number, RW.50.15. In summary, the TAN domain model defines two kinds of entities: works and scripta. Works, which are immaterial, conceptual entities, may contain other works, or they may be versions of other works or work-versions. Scripta, which are material entities, may contain other scripta, and they may refer either to a single object or to a set of copies. A work may be instantiated in many scripta, and similarly, a scriptum may contain many works. Most work-scriptum relationships can be inferred from the <head> of a class-1 file, and they may be expressed in a <TAN-A> file. Examples of TAN Entities Work Scriptum Iliad Caroline Alexander's English translation of the Iliad. the print run identified with ISBN 978-0062046284 a specific copy The Psalms The (Hebrew) Masoretic Psalter The 1820 printing of George Offor's edition of the Hebrew Psalms Biblioteca Palatina Cod. Parm. 1699 Norman MacClean's A River Runs Through It The 1992 film A River Runs Through It Print run ISBN 0226500608 Author's personal print copy Blue Ray disc UPC code 004339632533 Reference print CGB 7432-7438 (deposited in the Library of Congress)

One version, one work, one scriptum, one reference system Every TAN-T(EI) file must be restricted to a transcription of a single version of a single work found on a single scriptum, segmented and labeled according to a single reference system. The principle above is critical to the the success of the network. It reduces the risk of confusion and simplifies the files. It follows the generally advisable principle, that complex data should disaggregated into several different simple data structures. Different types of complexity can be built later, as needed.

One scriptum Each TAN-T(EI) file must transcribe one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created by the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies (what some cataloguers call a manifestation) is to be regarded as the scriptum. Identifying and naming a scriptum might require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; other manuscripts are composites, made of several manuscripts. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files. The scriptum is declared via <source>, which either takes the IRI + name pattern, or points to a <scriptum> vocabulary item. It is a good idea to name your scriptum with an <IRI> value in the form of an http URL that points to a detailed entry in a library catalogue. Doing so allows users to retrieve extensive, structured bibliographical information. You also save yourself the hassle of having to write a detailed, structured bibliographical description. If a URL cannot be found for <IRI>, you may simply coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses the same scriptum-source, you can add its <name>s and <IRI>s with the existing IRI + name pattern. Multiple <name>s and <IRI>s for a vocabulary time are encouraged. If you need to specify exactly where on a scriptum a work-version appears (e.g., page range), <comment> or <desc> should be used.

One Work The transcription must be restricted to a single creative work, identified by <work> (part of the declarations section of <head>). Many scripta have more than one work. Identifying the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. You define the work. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or merely the Beatitudes. Use whichever work you like, but make sure that the TAN-T(EI) file contains nothing but the work you have declared. It should be a complete representation of what is found on the object, even if only partially preserved, and respect as far as is practical the order of the text in the scriptum. Normally the order the text appears in the scriptum will match the logical order provided by the reference system (see below). But when there are discrepancies, the order of the scriptum should take priority. The requirement to provide the entirety of the work-version as found on the scriptum is a significant departure from the fourth principle of , which allows for incomplete assertions or data. The transcription in a class-1 file should include the entirety of the work-version chosen, within the particular scriptum. If you are aware that the transcription is incomplete, leave a <comment> to that effect in the <head>'s <to-do>, identifying which portions are missing from the transcription. Well-known works may have a suitable IRI already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN.

One version The transcription must be restricted to a single version of the work, identified perhaps by <version> (part of the declarations section of <head>). In most cases, <version> is unnecessary, because <work> in conjunction with <source> will normally identify a particular work-version. But if the source carries multiple versions (e.g., a bilingual edition of a text), then <version> should be included, to specify which version has been transcribed. <version> can also be used to declare explicitly that the work mentioned in <version> is a version of the work mentioned in <work>. If you have a scriptum with multiple versions of a work, and you wish to transcribe them all, each version should be given its own separate TAN-T(EI) file. There may be cases where individual textual divisions are repeated, not so much because they represent a different version, but because they are variants that are integral to the work-version chosen. For example, an edition of a poem may occasionally have a line that is repeated by the editor as a possible local variation. Creating a separate file for such individual cases would be both impractical and misleading. Standard TAN vocabulary for div types includes as a standard item variant, to accommodate occasional variants. For example: . . . . . <div type="title" n="title"> <div type="variant" n="orig">The Place</div> <div type="variant" n="subscript" xml:lang="grc">Ὁ Τόπος</div> </div> . . . . . Notes should be included only if they are an integral part of the primary work (i.e., by the same author, not by a later editor). If you think the notes to a work are important, and legitimately a work in their own right, consider putting them in their own TAN-T(EI) file, or converting them to claims in a TAN-A file. Very few work-versions have IRIs. It is advisable to assign a tag URN or a UUID. If the IRI you have used for <work> is in a namespace that you own or control, then you are entitled to modify it, and you may wish merely to add a suffix to the work IRI. For example, you might have tag:urn:example.com,2001:work:a defined for the work; a 1987 German translation might be specified as tag:urn:example.com,2001:work:a:ver:1987:deu.

One reference system Every TAN transcription must be segmented into a hierarchy of labeled divisions, defined in the <body> through <div>s and their @n values. Those divisions, whenever possible, should align with the reference system that prevails for the work across different versions or translations, in what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute, the term canonical is problematic. It is avoided in these guidelines. We refer simply to a work's reference system. If you have your choice, preference should be given to reference systems that follow the semantic contours of the work, not the physical features of a particular scriptum. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other versions of the work (e.g., translations, paraphrases) will only roughly, if at all, follow a reference system based on features found in a particular scriptum. Sometimes a scriptum-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on Aristotle's Categories). It is perfectly acceptable to adopt that system, but it may entail more labor during the alignment process. Translations using this system will rarely correspond to the points of division. If a given work has more than one common reference system (e.g., the works of Plato and Aristotle, which have two reference systems—logical and scriptum-oriented—both of which are standard and important), then one good practice is to create two class-1 files with identical transcriptions, each one structured by its own reference system. Place in each file a <redivision> pointing to the other. When you validate either file in the verbose phase, you will be notified if there are textual discrepancies between the transcriptions. If you are using Oxygen or another XML editor that supports Schematron Quick Fixes, you will be provided a way to update one text to match the other with just a few keystrokes. Having two or more alternatively divided editions can be quite useful. They could serve as the basis for reference cross-indexes, or to help convert other versions of the work from one reference system to the other. Alternatively, you can use the TAN-TEI approach. Choose one reference system as the primary way to label your <div>s, and convert the other references to anchors such as <pb>, <cb>, <lb>, <milestone>. Under this method, the logical references (those based on logical units such as paragraphs, chapters, sections) are best given to the <div>s, and the material ones to the anchors. Bear in mind, however, that typological semantics are diminished with anchors, and there is no convenient way to retain hierarchical structures, or disambiguate one anchor-based reference scheme from another. If there is a good reference system, but the divisions are overly lengthy, you may introduce subdivisions. But there is no guarantee that the provisional subdivisions you introduce will be adopted by other editors who create or edit TAN versions of the same work. Editors working independently upon the same text and subdividing it will likely produce their own schemes. Class-2 formats provide a mechanism via <adjustments> to reconcile some basic differences. But a discordant scheme might be best handled simply by creating a copy, and restructuring it according to the preferred system, making sure related files refer to each other through <redivision>. If a work does not have a reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to design it so that it can be easily applied to any version of the work, including translations. Prefer logical divisions of text over scriptum-based ones. TAN supports five major methods of reference numeration: Arabic numerals. 1, 2, 3, etc. Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value will be deducted from the total value; all others are added). Alphabetic sequences. The 26-letter Latin alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed. This is not the hexavigesimal (base 26) system, where a is 0, b is 1, z is 25, aa is 00, ab is 01, etc. Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c.... Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals. See tan:letter-to-number() and references there to TAN functions for converting numbering systems. The TAN validation process attempts to convert all values of @n to Arabic numerals. Some values are ambiguously Roman numerals or alphabetic sequences. For example, c could mean 3 (alphabetic sequence) or 100 (Roman numeral). Such numerals are assumed to be Roman, unless you supply in the <head> a <numerals> and assign @priority to specify letters (or roman, the implicit, default value).

Extra <code><link linkend="attribute-n">@n</link></code> vocabulary Sometimes @n is given not values consisting not of numerals but of names, commonly to identify works within works, e.g., poems within a cycle, books in the Bible, or Surahs of the Qur'an. For non-numerical values of @n, different conventions are a common problem. The abbreviation chosen by one person or project is rarely the same as that adopted by the next. To avoid this long-standing issue, you may want to use extra TAN vocabulary for @n. If you include in the head of your TAN file

<vocabulary
                           which="bible eng"/>

, then any non-numeric values of @n will be checked against the corresponding TAN-voc file (in this case, the TAN-voc file at /vocabularies/extra/n.bible.eng.tan-voc.xml, one of several available in that directory). This, in turn, will will allow other files to refer to that <div> by any other <name> that is a synonym. For example, in a class-1 file pointing to the TAN English Bible vocabulary above, a

<div type="book"
                           n="matt">...</div>

would be regarded as containing the work the Gospel of Matthew. Any class-2 file that refers to that class-1 file as a source may use any synonym listed in the extra vocabulary file n.bible.eng.tan-voc.xml, i.e., Mt, Mat, Matt, or Matthew (or their lowercase equivalents). An extra benefit of this method is that such <div>s are also marked as works in their own right, identified by the <IRI>s of the target TAN vocabulary items. If you use extra TAN vocabulary, it is recommended you include in the declarations section of your <head> an <n-alias>. This element, along with its @div-type, specifies exactly which types of <div>s are eligible for this kind of aliasing on @n. Technically, it is not necessary, but including it can considerably speed the validation process on long files. The goal behind the extra vocabularies is to eliminate the need to worry about what abbreviations are used to name well-known, unnumbered <div>s. It is hoped that in future releases of TAN these extra vocabularies will grow in number and quality. You can write your own TAN-voc file to build your own set of n aliases. The standard extra TAN @n vocabularies should provide a good model:

Normalizing transcriptions You should declare how you have normalized the transcription via <adjustments> and its children, e.g., <normalization> or <replace>. (For suggestions on values for <normalization> see .) Generally speaking, normalization entails the suppression of things extraneous to or separable from the work-version you have chosen. You are encouraged to omit parenthetical editorial insertions (especially quotation references inserted by a modern editor), stray handwritten remarks, discretionary word-breaking hyphens, editorial comments, inserted cross-references, and reference numerals (page numbers, section numbers, etc.). If chapter 4 of a text begins "4." or "IV" then leave out that labeling numeral—you've already indicated it in @n, so there's no need to clutter the transcription with it. Remember, scholars who use your file will be concerned with things like word-for-word alignments and lexico-morphological analysis, and putting in a modern editor's "4" might contaminate research results. For the same reason, you should resolve ligatures and correct unintended typographical errors. The goal is a transcription whose text is free of the interpretive voice of later editors. You should remove from the text anything that is not part of the work proper and would interfere with detailed word-for-word alignment, or would require extra preprocessing or postprocessing work for other users. If you are breaking a transcription into individual lines, and you are required to break a word, do so with either the soft hyphen (), the zero-width space (), or the zero-width joiner (‍). TAN processors will automatically normalize the space of ever leaf <div> . If either of those special characters are found at the end then it will be deleted and the text from the next leaf <div> (if there is one) will immediately follow without intervening space; if those two characters do not occur at the end, then a space,  , will be added at the end. Regardless of how a leaf div ends, the rest of its space will be normalized. For more details, see . In a digital source, variable lengths of special spacing marks (e.g., General Punctuation U+2000..U+200B) and other Unicode points not allowed by TAN (see ) should be converted to ordinary characters, and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see ). In some ambiguous areas, you can use TAN-TEI both to normalize and to preserve what is in the scriptum. Suppose, for example, a manuscript has reference numerals that are sui generis. That is, these reference numbers do not correspond to the "canonical" reference scheme, and are scribal adjustments to the text's structure (sometimes mistaken). On the one hand, such reference numerals are metadata, and should arguably be deleted; on the other, they are part of the text, and witness to how a text was read and changed over time. A middle-ground approach would move these references to TAN-TEI's

<milestone
                     rend="[TEXT]">

, substituting [TEXT] for the reference text. In that way, the numerals are properly removed from the main text, but the information is retained. Generally speaking, TEI's @rend is an excellent way to remove something from a transcription while keeping it in the file. Overall, normalization is a difficult, understudied topic. Scholars are not in the habit of documenting everything they normalize. Many have so internalized their normalization principles that they are unaware of them. Not all decisions will be clear-cut. You may justly hesitate before normalizing orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode that permit different conventions may need special consideration. You may need to deliberate on whether an unusual or rarely used Unicode character might be misinterpreted or hinder searches. Document any decisions in the <adjustments>. Whether you use <normalization> or <replace> is up to you. The former can be used to apply a class of changes to a vocabulary item. The latter provides a precise, regular-expression-based method of describing exactly what has been changed, and the order in which those changes took place. A <replace> might help users to better understand the path that led from the input to the output, but the process cannot be reverse-engineered to produce the original. If it is important to document exactly what the pre-normalized version of a text was like, use <predecessor> or a similar element available in the key links section of the <head> (see ) to point to the original. If you find it very difficult to bring yourself to normalize to the depth advised above, try first making a (non-TAN) TEI file, and create the transcription you have in mind as the ideal. Once that is finished, create a second, TAN version, and be more aggressive in your normalization, with <see-also> pointing to the first approach.

Normalizing annotations The footnotes or endnotes in a scriptum should be normalized. Many, most, or all should likely be deleted. Before deciding, distinguish between those that are an intrinsic part of the work you're transcribing from those that aren't. Those that aren't can be removed, or they can be put into a separate TAN-T(EI) file, perhaps linking the two through <see-also>, and hopefully structuring both files with the same reference system, to facilitate alignment. Another way to approach the task is to convert some or all of the notes you're removing into <TAN-A> <claim>s. Footnotes, endnotes, glosses, or marginalia that are intrinsic parts of the work present special challenges for encoding in general, and normalization in particular. First is the issue of connecting an annotation to the text annotated. When we encounter a superscript number—a note signal—while reading the text of a printed book, we infer that we are being invited to find a companion footnote that comments on the text we have just read. But specifically what text? Is it only the preceding word? Is it a word or phrase that occurs earlier in the sentence? Does the annotation cover earlier sentences, the entire paragraph, or even prior paragraphs? For some notes, identifying the text being annotated requires interpretation. In a digital file, the connection between an annotation and its text cannot be so vague; it requires a decision and a commitment. Here are three possible ways to approach annotations in a TAN file: Use the <note> feature of TAN-TEI (see related TEI documentation). This will allow you to connect the annotation to merely an anchor in the text, i.e., to no text whatsover. <div n="1" type="p"> <p>The process occurred in New York, among other places.<ref rend="1"/> <note><p><ref rend="1"/>On New York, see: X.</p></note> </p> </div> Move each annotation into a <div> with a @type that implies that it is an annotation (e.g., scholium) and place it immediately after the <div> it annotates. <div n="1" type="p">The process occurred in New York, among other places.</div> <div n="n1" type="footnote">On New York, see: X.</div> Note in the example above that n1 is used to make sure that 1 unambiguously points to only one <div>. As #2, but also write a <TAN-A> file that more precisely connects each annotation to the text it annotates.<claim verb="annotates"> <subject src="text" ref="n1"/> <object src="text"> <from-tok ref="1" val="The"/> <through-tok ref="1" val="York"/> </object> </claim> The first option is expeditious, and will allow you to be as precise or imprecise as you like. Validation is not affected, but you should be aware that the <note> will be treated as a constituent part of its parent <div>. The second option is also relatively easy, but it entails a decrease in precision. The third option provides immense precision, permits multiple annotations on the same text range, and allows notes to target overlapping ranges of text. But the task could be time-consuming, if only because you will need to determine the range of text targeted by each annotation, and the targeted text might be quite messy or vague. You will need to take stock of how precise and comprehensive you choose to make your connections. (See also accuracy, precision, and comprehensiveness.) Remember that the note signals in the main text and in the footnote area are metadata meant to help readers link corresponding passages of texts and, in the spirit of normalizing, should be deleted. In a TAN-TEI file you can replace a note signal with <ref> (see above).

Class 1 metadata The <head> of a class-1 file is much like that of other formats, with some extra options. In the key declarations area (see ), class-1 files may allow <n-alias>. See for context on how to use this element. In the section devoted to links to other digital resources (see ), class-1 files allow several extra types of files. One <model> is allowed, to point to another class-1 file that provides a model for the reference system that has been adopted. The model should be the same work. It may be in a different language, or come from a different source/scriptum. During verbose validation, any differences between a class-1 file and its model will be presented as warnings, since small differences are nearly always inevitable. Zero or more <redivision>s are allowed. Each one points to an alternative transcription that restructures the same transcription in according to a different reference system. A class-1 file and any redivisions must have identical text in the <body>. <redivision> is an important alternative to the knotty, longstanding problem that besets texts that admit multiple reference systems. In a traditional TEI file, one must adopt a primary reference system, and add other reference systems through milestone-like anchors. TEI anchors do not have the semantic underpinnings needed to cycle through the milestones from one primary reference system from one to another. The TAN alternative is to encode same transcription in multiple files, one per reference system, linked through <redivision>. This may appear to contradict another principle, that one should not repeat themselves. But that is the easier principle to repair. During verbose validation, a file's text will be checked against every <redivision>, and specific areas that differ will be flagged. Should users wish, a Schematron Quick Fix will allow a user to synchronize a local file against a redivided version. Zero or more <annotation>s point to class-2 files that use the file as a <source>. This type of linked resource is helpful for keeping track of key alignments and annotations. Zero or more <companion-version>s point to different versions of the same work in the same scriptum. This feature is useful for correlating multiple versions of a work that appear in a single scriptum, e.g., the original text and a facing translation in a bilingual edition. The adjustment section of the <head> (see ) allows zero or more <normalization>s and <replace>. See .

Class 1 data The sole purpose of the <body> of a class-1 file is to contain an ordered, segmented transcription of a single version of a single work from a scriptum. <body> must take @xml:lang, specifying the predominant language of the text. If a change in language occurs in a descendant <div>, ensure that its @xml:lang also changes. <body> takes one or more <div>s, each of which govern either other <div>s, or text (or TEI elements), but never both. TAN files adopt a non-mixed content model (see ). The term leaf div refers to those <div>s that contain only text, and not other <div>s. Within this treelike structure of <div>s, the concatenation of @n values, starting from the most rootward <div>, provides the reference system used by class-2 files to refer to parts of TAN-T(EI) files. A given <div> may have more than one reference, if its @n or any @n it inherits has multiple values. Every permutation is calculated, and they are treated as synonymous ways to refer to that <div>. The rule of combinatorial inheritance also applies when @n has as its value a range of numbers. For example, if @n has the value "1-3" then it will match for 1, 2, and 3. Such ranges are important for translations, where there might not be precise one-to-one correlation with the divisions in the original. Applications that handle texts with one-to-many alignment mappings can used different strategies to reconcile the differences. See tan:merge-expanded-docs() for discussion. In previous versions of TAN, there was a requirement that each leaf <div> should have a unique reference. That requirement has been relaxed, because there are cases where non-unique leaf <div>s are required. Some scripta are encoded such that leaf divs are broken up (see Bodëús's edition of Aristotle's Categories, at 2a35, 2b5, and 2b6b). And some translations must be encoded so that leaf divs interleave. Further, one TAN-T's leaf divs might easily become another TAN-T's non-leaf divs, and vice versa. The distinction between leaf and non-leaf div is arbitrary, so both types should be expected to adhere to the same kind of reference system rules. In a TAN-T(EI) file, for any two <div>s that share the same reference, it is not allowed that one be a leaf <div> and the other not. To do otherwise would entail a mixed content model. It is also further assumed that all <div>s that share the same reference are consecutive, constituent parts of the same <div>. That is, any two <div>s with the same reference are not alternatives to each other, but are rather disjoint parts. For true alternatives, see discussion above on using variant in @type.

Transcriptions using the Text Encoding Initiative (<code><TEI></code>) This section is to be read in conjunction with and , which address related technical issues.

TAN-TEI Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further. Some may already have a library of transcriptions whose annotations are desirable to keep, even if uninteresting to every user. In these cases, you should use TAN-TEI, a customization of the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in textual scholarship. TEI was designed to be maximally expressive and flexible, to serve the detailed needs of scholars in the humanities. In serving this mission, TEI has come to define more than five hundred different elements, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library. Although TEI XML is oftentimes described as a standard, it lacks charactistics one normally expects of a standard. It is very flexible, admits flavors and interpretation, and is best used when it is customized. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations, based on TEI All. The major difference between TEI All and TAN-TEI is that the latter imposes extra strictures, to ensure that transcriptions are maximally likely to be interchangeable with other TAN-TEI files. All TEI files are validated against a TEI-conformant schema normally as an XML DTD, RELAX NG, or W3C Schema. TAN's TEI-conformant schema is based upon the TAN-TEI.odd file in the schemas directory, converted to a RELAX-NG file, TEI.rnc and TEI.rng, to define the structural rules of TAN-TEI files. There is an additional layer of validation, through the related Schematron process (TEI.sch), which performs detailed validation not possible in a TEI-conformant schema. In the discussion below, it is important to distinguish between structural validation and Schematron validation. See .

TEI customization TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace, http://www.tei-c.org/ns/1.0): Synopsis of TAN-TEI customization TEI element Strictures <TEI> must have @id with tag URN must have @TAN-version takes a new child element, <head>, placed between <teiHeader> and <text>; it and its descendants must be in the TAN namespace, xmlns:tan="tag:textalign.net,2015:ns" <text> There are no extra strictures, but during Schematron validation (not RELAX-NG), this element and any children <front> and <back> will be ignored. Of its children, only <body> will be Schematron validated. <body> must take @xml:lang any non-<div> children will be ignored during Schematron validation; most often only <div> should be children contents must be restricted to a single version of a single work any and all text nodes will be treated as part of the transcription <div> may encompass a textual division of whatever size you like (TEI defines <div> as being larger than block-like or paragraph-like textual divisions; TAN's <div> is much more like HTML's). must take elements; either they all are <div>s (perhaps interleaved with anchors such as <pb>) or none of them are <div>s (non-mixed model) must take @type and @n (or only @include) @type may take multiple values, space delimited, pointing via IDref to a vocabulary item @n allows synonyms, sequences, and ranges, and must match the regular expression defined by $tan:attr-n-regex. If @n is to be given more than one value, those items must be separated by a space or a comma. A hyphen-minus, - (U+002D, the most common form of hyphen), always has special meaning in @n, specifying a range. This feature is useful for cases where a <div> straddles more than one standard reference number (e.g., a translation of Aristotle that cannot be easily tied to Bekker numbers). If you need to use a hyphen-like character in an @n that does not specify a range of numbers, consider ‐ (U+2010 HYPHEN), ‑ (U+2011 NON-BREAKING HYPHEN), ‒ (U+2012 FIGURE DASH), – (U+2013 EN DASH), or − (U+2212 MINUS SIGN).

TAN-TEI files have two heads, each designed for different purposes. Whereas the TAN <head> is meant to be brief and restricted to only those matters relevant to the transcription, the <teiHeader> permits quite an expansive range of metadata, and may be used to encode a variety of things, including those that are tangential or irrelevant to the data. Unlike the TAN <head>, whose data is designed to be both computer- and human-readable, <teiHeader> was designed for data to be read principally by humans; although it can accommodate IRIs, it was not designed around them. Further, a TAN <head> can never be empty and valid; a bare-bones <teiHeader> with no actual text content, such as the following, is considered valid:<teiHeader> <fileDesc> <titleStmt><title/></titleStmt> <publicationStmt><p/></publicationStmt> <sourceDesc><p/></sourceDesc> </fileDesc> </teiHeader> TAN's Schematron validation process ignores the contents of <teiHeader>, since its contents are unpredictable and therefore not reliably parsable. If your <teiHeader> has any kind of metadata that needs to appear in the TAN <head> (see and ), the conversion needs to be performed manually, since (as mentioned above) the two headers are incommensurate, and writing each one requires a different mentality. In a TAN-TEI file, the TAN <head> must be in the TAN namespace, i.e.,

<head
                  xmlns="tag:textalign.net,2015:ns">

. Alternatively you might write <tan:head xmlns:tan="tag:textalign.net,2015:ns">, but this would require all descendant elements to be prefixed tan:. Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. Most users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. TEI files are flexible, permitting different approaches to markup. A TAN-TEI file should not be scriptum-oriented, i.e., it should not try to replicate how the text appears or looks on the object. That is because the TAN-TEI file will be used in intertextual comparisons, where the transcription is compared to transcriptions from a wide variety of sources.

Converting TEI to TAN-TEI You may have a TEI file that you wish to convert to TAN-TEI. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps: Structure: insert new processing instructions (pointing to files to perform TAN-TEI structural and Schematron validation); adjust root element by supplying a tag URN for @id and @TAN-version. Metadata: create new

<head
                              xmlns="tag:textalign.net,2015:ns">

and populate it. Data: edit <body> to make sure all text nodes are restricted to the content of a single version of a single work; restructure <body> content into nesting <div>s with correct @type and @n values. It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming, particularly in finding suitable IRIs. But step 3 should not be underestimated, either. Many people write TEI files with a focus on the original textual object, and they do not normalize to the level expected in a TAN file. Some TEI files have been written with little attention paid to space and space normalization. Some TEI files are so laden with annotations that the text is impossible to read. In general, the more simple the TEI file the better, with annotations pushed to external files. Some TEI markup is already implicit, or is easily calculable (e.g., <w> to mark words, which should already comport with the tokenization declared in the <head>; users of <w> easily lose track of where space is and isn't). Some TEI markup can be expressed in a class-2 file (e.g., lexico-morphological data, which should be expressed in a TAN-A-lm file). If you have a TEI odd file that you wish to preserve, but incorporate the TAN .odd file, you may be able to do this manually, integrating your odd file with TAN's. In the future, an application may be written to assist in this process. When you write your new odd file, you will want to generate a set of .rng or .rnc files and place them in the TAN schemas directory. Be sure to give them a unique name, something other than TEI.* or TAN-TEI.*, so that your generated schema files do not overwrite the standard TAN ones.

Class-2 TAN files, annotations of texts This chapter provides general background to class-2 TAN files. For detailed discussion of individual elements and attributes see . There are three types of class-2 files: TAN-A files provide broad, macroscopic alignment of multiple versions of any number of works. It also supports a wide variety of annotations on texts. TAN-A-tok files provide narrow, microscopic alignment of any two class-1 files, annotating word-for-word or character-for-character correspondences between the two texts. TAN-A-lm files express annotations pertaining to lexico-morphology (grammatical part-of-speech), for either a single class-1 file or a language in general. In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for situations where it may not be clear which text is the target and which is the source. Further, there is a more generic use of source and target that prevails in many other contexts. In these guidelines, therefore, the term target never refers to a text as such (rather, it normally refers to a file that is being pointed to), and when we use the word source, we are referring only to one of the class-1 files upon which a class 2 alignment depends.

Common elements

Class 2 metadata (<code><link linkend="element-head" ><head></link></code>) Class-2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system discussed below. All class-2 files have as their sources nothing other than class-1 files. Therefore each <source> must take the . Editors of class-2 files must be able to name or number word-tokens in a transcription, and to determine an appropriate definition of "token," via an optional <token-definition>. See . Inevitably, some class 1 sources for the same work will differ from each other. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows an idiosyncratic reference system. If sources need to be reconciled, alterations may be specified in <adjustments>, which stipulates a set of actions that should be applied to the sources that have been named. The following adjustment actions are supported: <skip>, to allow you to ignore specific <div>s, deeply or shallowly. <rename>, to allow you to rename specific <div>s. <equate>, to allow you to provisionally establish some @n values as being synonymous. <reassign>, to allow you to split leaf <div>s and move their parts elsewhere in the structure. These adjustment actions allow you to reconcile discordant sources without changing them directly. Skips, renames, and equates are first applied to the source as received. If a particular source <div> is the target of more than one adjustment action, only the first one will be applied according to action priority: <skip>, <rename> based on @ref, <rename> based on @n, then <equate>. This action priority also corresponds to the amount of time needed to process the adjustments. Numerous <skip> actions are applied very quickly. Numerous <reassign>s however can be time-consuming, because it requires tokenizing the text. Because of this priority order, some actions might not be performed. For example, if you deeply skip a <div>, no renaming adjustments will be made to its children. Skips, renames, and equates are applied in one pass, based on the original reference system, then <reassign>s are applied to the the newly adjusted source. If you rename a div, then want to reassign it, you must do so based on the new name, not the original. Each adjustment action adds time to the validation routines. On lengthy texts these can become quite time-consuming. Take, for example, the Tanakh / Old Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of these differs from the other in the names of books, and the numeration of some chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To completely reconcile these three versions requires at least 1 <skip>, 237 <rename>s and 3 <equate>s, and 31 <reassign>s. Applying these actions to all three versions can take about two minutes (tested on computer with an Intel i5-8250U, 12 GB ram), before any other significant validation checks on anything insed the <body> of the class-2 file. In earlier generations of TAN, this process took upwards of an hour. If such processing times are unacceptable, you are advised to keep <adjustments>s to a minimum or to apply them to relatively small texts. Further, adjustment actions were intended primarily to address common irregularities between files, to apply some last minute touches, or perhaps to drop certain parts of texts. Adjustments were not designed to provide extensive, deep corrections. If a source must be changed in numerous places to reconcile it with other sources, you should create a new version of the source, reorganized as you prefer. Then in both the new and original versions of the class-1 files insert <redivision>, <predecessor>, <successor>, or <see-also> to link the two versions. There is a TAN application that remodels one text in the image of another. See applications/remodel/remodel text.xsl. The output of that application requires editing, but it can reduce the amount of work required. TAN tools for Oxygen's author mode can also be used to correct that newly segmented text.

Class 2 data (<code><link linkend="element-body" ><body></link></code>) Data types differ greatly between the class 2 formats. However, they all share one thing in common: the <body> consists of a series of claims, and responsibility for those claims should be attributed to the persons, organizations, or algorithms making the claims. Therefore, each <body> may take @claimant and perhaps @claim-when, specifying by IDref who should be credited or blamed with the material. If either attribute is missing, it is assumed that the claims are the responsibility of the persons listed in <file-resp>. The values of @claimant and @claim-when are weakly inheritable.

Class 2 pointer syntax: referencing texts The class 2 formats have been designed to be human readable, particularly text references. In ordinary conversation, when refering to specific parts of a work, we prefer to use the numbers or names of pages, paragraphs, sentences, lines, words, letters, and so forth, and sometimes relational words (e.g., "first"). We might say, for example, "See page 4, second paragraph, the last four words." Sometimes we quote the very text itself: "See page 4, second paragraph, first sentence, second occurence of 'pull'." Those familiar conventions are the basis for the TAN pointer syntax, which differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers apply common reference terminology to four strata of a text: works, divisions, word tokens, and characters. Works, defined above (see ), are declared by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of the text in those divisions, defined according to one or more <token-definition>s declared in the class-2 file. And characters are defined as individual base letters in a word token (any modifier character is treated in concert with the last preceding base character; see ). This approach not only makes the syntax human readable but mitigates the effect of changes to the sources. For example, if a <div> is deleted, moved, or changed, the alteration affects only references specific to that <div> and its descendants; the rest of the reference system remains intact. The four parts of TAN's reference system are explained below, but you should consult other parts of the guidelines, or study TAN examples, to see how they are used in practice.

Referencing works: <code><link linkend="attribute-work" >@work</link></code> This section applies only to TAN-A files, because the other class-2 files do not make claims about works per se. TAN-A files refer to works via meaningful IDrefs that point to the class-1 sources that transcribe the work/work-version, e.g., work="hamlet". The reference is understood to apply not merely to that particular source, but to any TAN-T file that claims to transcribe that work or work-version. (On the relationship between works and work-versions see .) Thus, the id of the source-scriptum becomes a proxy or alias for the work. Any work may also be defined through a vocabulary item <work>, either locally in the <vocabulary-key> or in a TAN-voc file linked via <vocabulary>. The work would then be referred to by @xml:id or <name> of the particular vocabulary item.

Referencing textual divisions: <code><link linkend="attribute-ref" >@ref</link></code> Portions of text, i.e., <div>s, perhaps altered if <adjustments>s have been invoked (see , are pointed to via @ref. A @ref is constructed by taking the values of @n in the <div> in question along with its ancestor <div>s, and joining them with non-word characters. For example, @ref="I.1.1" might point to the following: <div type="act" n="1"> <div type="scene" n="1"> <div type="line" n="1"> . . . . . . </div> . . . . . . </div> . . . . . . </div> A @ref can express sequences and ranges of <div>s. In the example ref="1.2-4, 1.5", the hyphen and comma, which are reserved to signify ranges and series, are reserved. A hyphen always means "from...through" and a comma always means "and". In the TAN format, commas are always paratactic, not hypotactic. For example, if referring to Hamlet, ref="I,2,3" is not a single reference to <div>, act I scene 2 line 3, but rather three of them: act I, act 2, and act 3 (notice how the commas in the attribute value behave like the commas in the written phrase). If you mean to say act I, scene 2, line 3 try ref="I.2.3" or ref="I 2 3". The periods (full stops) in @ref="I.1.1" are hypotactic markers, but they are arbitrary, and could be replaced with any mix of non-word character you like (except the hyphen or comma), including spaces, e.g., ref="I:1 1". The numeral system is also arbitrary. You may use any supported numeration system (see section on numeration systems), even if the source uses a different one. Semantic equivalents to the preceding example are ref="A I i" and ref="1:a:I". Just remember, if you use either the Roman numeral system or alphabetic sequences, include a <numerals> in the <head> to specify which system should prevail in case of ambiguities (e.g., whether c means 3 or 100). Roman numerals are the default, but it is a good idea to be explicit.

Referencing tokens: <code><link linkend="attribute-pos">@pos</link></code> and <code><link linkend="attribute-val">@val</link></code> To point to a token one normally uses <tok>, with one or more attributes, in three possible configurations: @val or @rgx alone: one or more tokens are pointed to by value. For example, val = "bird", points to every occurence of the token bird; rgx = "b.+d" finds every word that begins with a b, ends with a d, and has some characters in-between. Every value of @rgx is implicitly bound to the beginning and end of the string (see below). @pos alone: one or more tokens are pointed to by numerical position, via one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in a passage. The numerical value to which the keyword last resolves depends upon the context length. @val or @rgx combined with @pos: a combination of the previous two methods. For example, @val="bird" @pos="2, 4" picks the second and fourth occurences of the token bird. During Schematron validation, if @pos is missing, it is assumed to mean * or 1 - last; if neither @val nor @rgx appear, the assumption is @rgx with value .+ (any characters). That is, by default, @pos points to every instance and @val/@rgx to every token. When using @pos make sure you know the context. For example, the attribute combination

val="bird"
                        pos="last-1"

will produce an error if the token bird does not occur at least two times in the given context. It is advisable to use @val or perhaps @rgx, and not merely @pos. If your source's text changes, and there is no @val, it may be difficult to determine the original intent of a claim, to determine whether changes need to be made. @val is easier than @rgx to process in applications, particularly when compiling statistics or estimating probabilities. Furthermore, @val is generally speaking more efficient to process than is @rgx. A @rgx is more efficient only if it replaces numerous instances of @val. @rgx is a regular expression that must match an entire word-token. For example, @rgx="re.d" will match the tokens "rend" and "read" but will not match "already", "rends", or "bread". If you wish to allow for characters at the beginning or end, use ".*re.d.*". For more on regular expressions, see .

Referencing characters: <code><link linkend="attribute-chars" >@chars</link></code> Individual letters are always specified by @chars, which points to a specific position, e.g.,

chars="2,
                        7, last"

. Combining characters are excluded from these counts; see .

General annotations and alignments (<code><link linkend="element-TAN-A" ><TAN-A></link></code>) TAN-A is the format for macroscopic, division-based alignment and annotations of class-1 sources. It allows you to align any number of versions of any number of works on the basis of <div>s. The A also stands for annotations, because the TAN-A format allows you to make general assertions, usually but not necessarily about texts. TAN-A is a type of advanced RDF for textual scholarship (see ).

Root element and header The root element of a TAN division-based alignment file is <TAN-A>. TAN-A's <head> has zero or more <source>s. Any concepts that will be mentioned in the <claim>s (the only children of <body>) need to be supplied in <vocabulary-key> or an associated TAN-voc file invoked by <vocabulary>.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A file takes, in addition to the customary optional attributes (see ), @claimant, @object, @subject, or @verb, stipulating the default values for any enclosed claims. The rest of the body consists of zero or more <claim>s, each of which represents one or more claims. Claims can be made about a variety of things, e.g.,: to index quotations and allusions; to specify the subjects and topics dealt with particular textual passages; to connect commentary or notes from one source to another; to indicate where other scripta have different readings (apparatus criticus); to establish work-version relationships. <claim>'s data model is inspired by the Resource Description Framework (RDF; see ), where each statement consists of three items termed a subject, a predicate, and an object. The first and third are thought of as nodes, and the second as a connector (or edge) between the nodes. RDF adopts a graph model, where the connector (edge) always links exactly two nodes. RDF is adequate for a limited range of scholarly assertions. An RDF statement lacks context or qualifiers. No simple RDF statement, called a triple, can indicate who made the assertion, or when, or if it was uttered with any doubt or nuance. Sometimes we wish to claim a bare negation, e.g., "Aristotle was not the author of De mundo"—which cannot be expressed in RDF. Any TAN claim that is exported to any RDF format should adopt the principles of RDF*, which allows for complex, reified RDF statements. As of this writing, the specifications for RDF* are still being written. TAN's <claim> extends the graph RDF model into a hypergraph, where the connector (edge) links two or more nodes. The following adjustments are made: Every claim must have at least one claimant, some person, organization, or algorithm to be credited/blamed for the assertion. Every claim must have at least one subject, the topic of the claim. Every claim must have at least one verb (in RDF called predicate), specifying something about the subject. Every claim may have at least one adverb, qualifying the verb. Every claim may assert a level or range of certainty, between zero and one, reflecting how certain the claimant is of the claim. Every claim may have at least one object, an entity or value expected by the verb. Every claim may have at least one temporal qualifier, restricting the claim to a specific time. Every claim may have at least one locative qualifier, restricting the claim to a specific geographical region. Every claim may have other components specially defined by the verb. Currently, this entails for select verbs a language qualifier (@in-lang, <in-lang>) and a reference qualifier (<at-ref>). Items 1-3 above are required parts of any claim. Items 4-9 may be rendered as being required, optional, or disallowed by a <verb>'s definition. For example, a <verb> representing an idea that in normal discourse is intransitive (e.g., sleep) can be defined such that <object> is not allowed. Furthermore, a <verb> may be defined to restrict what kinds of objects or subjects are allowed. For example, the standard TAN verb lacks_text_at (see vocabularies/verbs.TAN-voc.xml) is defined to allow only scripta as a subject. No objects are allowed. Rather, a <claim> with this verb expects one or more <at-ref>s, which restricts the claim to a particular passage in a TAN-T file. Examples: A <verb> can specify that an object must be data, and it can also define the type of data allowed and its permitted lexical form. <verb>s take a special extension to their IRI + name pattern, permitting constraints that specify what is allow, required, or prohibited. Some examples of <verb> vocabulary items: Examples of verb vocabulary items <verb xml:id="wrote"> <IRI>http://rdaregistry.info/Elements/u/P60663</IRI> <name>is author of</name> <constraints> <subject status="required" item-type="person"/> <object status="required" item-type="work version"/> </constraints> </verb> . . . . . . . <verb group="zero_objects"> <IRI>tag:textalign.net,2015:verb:lacks-text</IRI> <name>lacks text</name> <name>lacks text at</name> <desc>At the <at-ref>, the textual entity referred to by the subject lacks any text. The claim takes no object.</desc> <constraints> <subject status="required" item-type="scriptum"/> <object status="disallowed"/> <at-ref status="required"/> </constraints> </verb> . . . . . . . <verb xml:id="survives-in-original-language"> <IRI>tag:kalvesmaki.com,2014:verb:work-survives-in-original-language</IRI> <name>original work is extant to what degree</name> <desc>This verb is used to describe the degree to which a work survives in the original language of composition. It takes as object an xs:double between 0 and 1, representing the approximate percentage that is extant. This property does not stipulate how close to the first or earliest version the extant material is.</desc> <constraints> <subject status="required" item-type="work version"/> <object status="required" content-datatype="double" content-lexical-constraint="[01]\.0*|0\.\d+"/> </constraints> </verb> Other examples of <verb>s can be found at vocabularies/verbs.TAN-voc.xml. Claims may refer to other claims. That is, <claim>s can nest inside each other (e.g., X claims that Y claims that Z claims that...). Or a <claim> may take an @xml:id, whose value can then be cited as the object or subject of any other <claim>. If a <claim> is about a work or source in general, as a whole, one or more IDrefs may be placed in @subject or @object. But if the claim is about a specific part of the textual object, then more information is needed, so the attributes cannot be used. Such textual references come in three flavors: assertions pertaining to a work, assertions pertaining to a work in only some versions, and assertions pertaining to scripta. In the first case, <subject> or <object> must take @work, with IDrefs pointing to vocabulary items for <work>s. In the second case, @src is used, pointing by IDref to the applicable <source>s. In the third case @scriptum is used, pointing to vocabulary items for <scriptum>. Remember, you may combine commonly grouped IDrefs in an <alias>. A @work means that the claim applies to any versions of the work, whether a source or not; a @src specifies that the claim applies only to the specific <source>s, and not to every possible version. In each case, <subject> or <object> may be given more attributes and elements to restrict the claim to specific parts of the work or source, with @ref, <tok>, @val, @pos, and @chars, following the conventions used in pointing to parts of texts (see ). If a <subject> or <object> points via @scriptum to a scriptum, specifying the claim necessarily takes a different approach than that used for @work or @src. Bear in mind, it is encouraged in these guidelines to avoid scriptum-oriented methods of dividing class 1 files. Therefore, clarifying a portion of a scriptum (e.g., a particular manuscript folio number) requires an apparatus that likely does not correspond to a TAN file. Therefore, a <subject> or <object> with a @scriptum can be restricted to a particulary region through descendant <div>s that specify via @n and @type specific parts of the scriptum. These scriptum filters, unlike TAN-T <div>s, are always empty; their sole purpose is to point in native terms to a specific region on a scriptum. Multiple values in any component of a <claim> are distributed, which means that one <claim> might contain multiple assertions. For example,

<claim subject="A B" verb="taught promoted"
                     object="X Y Z"/>

has within it twelve claims (the combinatory permutations of the three attributes' individual values). The exception to this general rule is @adverb, whose multiple values are taken as ampliative and restrictive. For example,

<claim subject="A" adverb="probably not" verb="taught"
                     object="X"/>

is a single claim, not two, even though @adverb has two values. A limited set of verbs have been defined in standard TAN vocabulary; see . The strictures defined in these verbs are checked during Schematron validation. For a brief discussion on defining your own verbs in a TAN-voc file see . Aspects of the discussion can be illustrated with select examples of claims: Examples of claims <claim subject="cpg2440-syr" verb="translates" object="cpg2440"/> . . . . . . . <claim subject="Λ" adverb="perhaps" verb="reads"> <at-ref src="grc" ref="1 a 5"> <tok pos="1-2"/> </at-ref> <object>τις ἀποδιδῷ</object> </claim> . . . . . . . <claim subject="cpg2430" verb="has-incipit"> <object xml:lang="grc">Ἐπειδή μοι πρώην δεδήλωκας ἀπὸ τοῦ ἁγίου ὄρους ἐν τῇ Σκίτει καθεζομένῳ</object> </claim> . . . . . . . <claim verb="edits" adverb="partially" object="cpg2430"> <subject which="Muyldermans_1932"> <div n="84-89, 91-92" type="page"/> </subject> </claim> . . . . . . . <claim verb="paraphrases"> <subject work="pr" ref="13"/> <object work="nt" ref="1Th 2:6"/> </claim> . . . . . . . <claim verb="quotes"> <subject src="grc-Mu1931" ref="I 87"/> <object work="lxx" ref="Wis 13:5"/> </claim> . . . . . . . <claim verb="alludes_to"> <subject work="KG" ref="II 86"/> <object work="lxx" ref="Ex 25:30"/> <object work="nt" ref="Heb 9:2"/> </claim>

Token-based annotations and alignments (<code><link linkend="element-TAN-A-tok" ><TAN-A-tok></link></code>) TAN-A-tok files facilitate the microscopic alignment of two related sources. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions. A TAN-A-tok file takes two class-1 sources, which should be two different versions of the same work. Most often, one will be a translation of the other, but the format can be used for two versions of the text in the same language, e.g., paraphrase, revision. Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. They should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A files are because of the level of detail involved. To simplify the file, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. Annotating a bitext is oftentimes not easy, and requires you to consider and declare assumptions you have made in two key areas: the relationship that holds between two scripta and the types of reuse that was involved in turning one version into the other (or a common ancestor into both). Relationship of sources' scripta. What is the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, what common ancestor do they share? Here you should consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's. See . Types of reuse. What categories of text reuse do you consider operative? Users of your data should be informed of the paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using generic terms such as translation, paraphrase, quotation, without much specificity. On the other hand, you may subscribe to a detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See .

Root Element and Header The root element of a token-based alignment file is <TAN-A-tok>. The TAN-A-tok header builds upon the core and class 2 headers (see and ). TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id. <vocabulary-key> takes, in addition to all the elements allowed in class-2 files (see ), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scripta. The second attends to the qualitative aspect of the bitext relationship. See above.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see ), required @bitext-relation and @reuse-type, which take one or more IDrefs from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment. <body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, in case you or someone else wishes to refer to a particular <align>. Nothing should be inferred from silence in a TAN-A-tok file. There is no requirement that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit yourself to saying everything possible about the bitext. You might choose to encode only a few token clusters. Tokens that are not referred to should not be interpreted as gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens. (See discussion on comprehensiveness.) In fact it is oftentimes preferable to have a TAN-A-tok file that points to only a selection of tokens; a file with tens of thousands of <align>s could take a very long time to validate, or to process in applications. Any token may be a member of as many <align>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives. If you wish to declare that one or more words in a source were omitted from a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a one-sided alignment, i.e., a token cluster that has tokens from only one source. A one-sided alignment implies insertions or omissions. If there are multiple values in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="translation paraphrase" would indicate that the token cluster results from an activity that is both translation and paraphrase, not one or the other. If a particular <align> might be one reuse type or the other, but not both, then create two <align>s, qualifying each one with a different value for @reuse-type. Then add @cert, indicating through a decimal number between 0 and 1 how confident you are that that particular reuse-type is accurate. @cert2 can also be added, in case you do not want to commit yourself to such a precise number. Commonly, <tok>s include @ref, pointing to a leaf <div>. But this is not required. The @ref may point to a <div> that takes other <div>s, or @ref may be altogether absent. If a <tok> lacks a @ref then it means that the claim is true for all instances of that word in the source, no matter where found. Examples of TAN-A-tok anas <align> <tok src="ring1881" ref="2" val="pocket"/> <tok src="ring1987" ref="2" val="pocket"/> </align> . . . . . . . <align reuse-type="stylistic_minus"> <tok src="grc" ref="Col 1 4" pos="11 - 12"/> <tok src="syr" ref="Col 1 4" pos="7" chars="last-2 - last"/> </align>

Lexico-morphology (<code><link linkend="element-TAN-A-lm" ><TAN-A-lm></link></code>) TAN-A-lm files are used to annotate a class-1 source by specifying the lexical and morphological properties of its tokens or morphemes. Every TAN-A-lm file has two different types of dependencies: a class 1 source (optional) and the grammatical rules defined in one or more TAN-mor files. This section therefore should be read in close conjunction with ). TAN-A-lm files are either source-specific or language-specific. Source-specific TAN-A-lm files depend exclusively upon one class-1 source. Source-specific TAN-A-lm files are useful for closely analyzing the grammatical properties of the words in one particular text. Well-curated source-specific TAN-A-lm files are enormously useful for other applications, e.g., quotation detection. Any source-specific TAN-A-lm file can be converted into a language-specific one, to be used as noted below. Language-specific TAN-A-lm files depend upon an unknown number of sources. Some language-specific TAN-A-lm files might be based upon a small, specific corpus, perhaps just one text. Others might rely upon a vast, general one. Language-specific TAN-A-lm files are useful for building language resources for computer applications. Many language-specific TAN-A-lm files become the basis for a local language catalog, which can be used to populate a new source-specific TAN-A-lm file.

Principles and assumptions Editors of TAN-A-lm files should understand the vocabulary and grammar of the languages of their sources. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files being used. Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to the analysis your own expertise and supply lexical headwords unattested in published authorities. Although TAN-A-lm files are simple, they can be laborious to write and edit, more than any other type of TAN file. They can also be hard to read if the morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file to use tools to create and edit the data.

Root Element and Header The root element of a lexico-morphological file is TAN-A-lm. If the file is source-specific, <source> points to the one and only TAN-T(EI) file that is the object of analysis. If the file is language-specific, <for-lang> is used in the declarations section of the <head> to indicate the languages that are covered. For highly inflected languages, language-specific TAN-A-lm files can be enormous in size or quantity. To improve performance when validating and processing numerous or large language-specific TAN-A-lm files, the <head> may also include <tok-starts-with> and <tok-is>. It is common for language-specific TAN-A-lm files to be cataloged in a <collection> file. These become part of the local language catalog, bound to the global parameter $tan:lang-catalog-map, found in parameters/params-application-language.xsl. By including in that parameter your collections to language-specific TAN-A-lm files, you open up those resources to use in a variety of other applications. In that <collection> file, the individual <doc>s that point to language-specific TAN-A-lm files should include as children any <tok-starts-with> and <tok-is> as in the original. Example of a catalog entry for a language-specific TAN-A-lm file <doc href="lat-tan-a-lm-abu.xml" TAN-version="2021" id="tag:kalvesmaki.com,2015:tan-a-lm:lat:perseus:abu" lexicon="LS" morphology="perseus-dik" claimant="xslt1" root="TAN-A-lm"> <name xmlns="tag:textalign.net,2015:ns">Perseus lexico-morphological permutations devoted exclusively to abu</name> <license xmlns="tag:textalign.net,2015:ns" which="Attribution-ShareAlike 3.0 Unported" licensor="perseus"/> <for-lang xmlns="tag:textalign.net,2015:ns">lat</for-lang> <tok-starts-with xmlns="tag:textalign.net,2015:ns">Abu</tok-starts-with> <tok-starts-with xmlns="tag:textalign.net,2015:ns">abu</tok-starts-with> <tok-starts-with xmlns="tag:textalign.net,2015:ns">abú</tok-starts-with> </doc> Conversion from a source-specific TAN-A-lm to a language-specific one is a one-way operation. There is at present no mechanism for automatically reconstructing the corpus that underlies a language-specific TAN-A-lm file. <vocabulary-key> takes the elements other class-2 files take (see . It also permits two elements unique to TAN-A-lm: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential. There is, at present, no TAN format for lexica and dictionaries. So even if a digital form of a dictionary is identified through the , the Schematron validation routine will not attempt to check the TAN-A-lm data against the lexical authorities cited. Because you or other TAN-A-lm editors are likely to be authorities in your own right, <person> can be treated as if a <lexicon>, and be referred to by @lexicon.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-lm file takes, in addition to the customary optional attributes found in other TAN files (see ), @lexicon and @morphology, to specify the default lexicon and grammar. <body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes zero or more <l>s followed by one or more <m>s). An <ana> may take a @tok-pop, to specify the number of tokens that the assertion applies to. This is particularly helpful for language-specific files based upon a limited corpus of texts, where the underlying data for the assertion might be difficult or impossible to retrieve. The token population can be used to calibrate levels of certainty, or to compare statistical profiles of one TAN-A-lm file against another. If you wish to point to a linguistic token that straddles more than one token, you should use multiple <tok>s, wrapping them in a <group>. Any token may be the object of as many <ana>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives. Claims within an <ana> are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok> or <group>. If an <lm> lacks an <l>, the token value its itself, calculated by each <tok>, is taken to be the default value of the lexeme. All assertions are assumed to be made with 100% confidence unless @cert is invoked. This still holds even when a <tok> is the subject of multiple <ana>s, because it is possible to be completely confident that a given word has two different grammatical profiles in the target text (e.g., puns, wordplay). Many TAN-A-lm files will be generated by an algorithm that automatically lists all possible morphological values of each token. It is advised that such automatic calculations always include in their output @cert, with weighted values. That is, if an algorithm identifies two possible lexico-morphological profiles for a word, but one occurs nine times more than the other, then it is advised that this be reflected in the two resultant elements, e.g.: <lm cert="0.9">...</lm> and <lm cert="0.1">...</lm>. If an algorithm is written with a more sophisticated way to weigh possibilities, then adjust the value of @cert accordingly. Be certain that the <algorithm> is credited in the <vocabulary-key> and in a <resp>. As with TAN-A-tok files, not every word needs to be explained or described. In fact, this is oftentimes undesirable, to avoid files that are overly long and time-consuming to validate or process. A TAN-A-lm file is rendered more efficient when claims can be grouped. If a particular token invariably has a single lexico-morphological profile, this can be declared once, in a <tok> that does not have @ref. If the token has a particular profile in a given region of text, it can be specified through a @ref that encompasses the specified region. You do not need to provide a <tok> for every token, which would entail restricting @ref to leaf divs. You may do so, but such an approach can result in very long files that are time-consuming to validate, process, and edit. It is more advantageous to declare lexico-morpological properties more generally, thereby replacing numerous leaf-div <tok>s. The benefits in processing time are significant. In early versions of TAN, the lexico-morphogical values of the Greek Septuagint (8.3 MB) were converted to a TAN-A-lm file of 407,811 <tok>s, one per token per leaf div, grouped in 52,703 <ana>s (25.8 MB). Early 2020 validation routines took about 25 minutes (2018 validation routines took hours). The long processing time is due primarily to the TAN-A-lm file itemizing every single token in the text. That same file was revised to be more declarative along the lines advocated above. If a particular token had only one lexico-morphological profile throughout the text, then every instance was reduced to a single <ana>, with no @ref in <tok>. When a particular token value had different lexico-morphological profiles, @ref targeted the rootmost <div> that encompassed them all. This revision resulted in a smaller file (15.8 MB; 158,376 <tok>s in 54,335 <ana>s) that validated in about a third of the time (8.5 minutes). In general, there is always a trade-off between convenience and efficiency. If your priority is speed, you should break a large file into several smaller ones, perhaps recombining them in a master file via <inclusion> (see ). Applications can be written to convert TAN-A-lm <m> data from one morphological system to another. This is a two-step process facilitated by the functions tan:morphological-code-conversion-maps() and tan:convert-morphological-codes(). See documentation in these guidelines or in functions/language/TAN-fn-language-extended.xsl. Examples of TAN-A-lm data <ana> <group> <tok ref="1" pos="1 - last-1"/> </group> <lm> <l>ring-a-ring-a-rose</l> <m>NNS</m> </lm> </ana> . . . . . . . <ana> <tok ref="10 6 3 2" pos="4"/> <tok ref="10 6 3 3" pos="15"/> <tok ref="10 6 4 2" pos="37"/> <lm> <l>Σωκράτης</l> <m>n e - s - - - m g -</m> </lm> </ana> . . . . . . . <ana> <tok val="τούτῳ"/> <lm> <l>οὗτος</l> <m cert="0.358311302048909457">p d - s - - - m d</m> <m cert="0.241688697951090546">p d - s - - - n d</m> <m cert="0.2">p - - s - - - m d</m> <m cert="0.2">p - - s - - - n d</m> </lm> </ana> . . . . . . . <ana> <tok val="ABERRO"/> <tok val="Aberro"/> <tok val="aberro"/> <lm> <l>aberro</l> <m>v - 1 s p i a</m> </lm> </ana>

Class-3 TAN Files, Varia This chapter provides general background to class-3 TAN files, which are devoted to formats that do not fit the other two classes. For detailed discussion of specific elements and attributes, see .

Vocabulary (<code>TAN-voc</code>) All too often, a project has a set of vocabulary it draws from time and again. To repeat the can be both tedious and treacherous. If a project with hundreds of TAN files decides to change or augment its vocabulary it could take a long time to find and make all the changes, everywhere and consistently. The TAN-voc format addresses that problem. It is intended to allow a project to define, edit, and augment the IRI + name patterns for recurrent vocabulary. TAN includes several standard TAN-voc files under the subdirectory vocabularies, supporting commonly used concepts such as token definitions, div types, licenses, and many more. For a complete list of predefined TAN keywords, see It is quite common for a person or team to build vocabulary items gradually while developing a corpus, which means that TAN-voc files tend to change and grow. You can organize your vocabulary in whatever manner makes sense. You might create one large TAN-voc file that has all your project's vocabulary. Or you might break out the vocabulary, one file per type. Each approach has strengths and weaknesses. If you break your vocabulary into many files, you should designate one of them as your point of main import, and include the other TAN-voc files via <inclusion>s (along with <group


                  include="[IDREFS]"/>

or <item include="[IDREFS]"/>, pointing to the IDrefs of the included TAN-voc files). Doing so prevents you from having to insert numerous <vocabulary>s in your other TAN files. For more details on how this format relates to other TAN formats, see .

Root Element and Head A TAN-voc file has <TAN-voc> as the root element. The <vocabulary-key> of a TAN-voc file takes, in addition to core vocabulary items, any number of <group-type>s. A TAN-voc file may draw directly from the vocabulary in its body, as if it were referring to itself via <vocabulary>.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-voc file consists simply of <item>s or <verb>s, perhaps gathered into groups via <group> or @group. These groups have, at present, no effect upon other TAN files that use them, but they have been valuable in certain applications. For example, the standard TAN-voc file for <div-type> (vocabularies/div-types.TAN-voc.xml) groups textual division types into a rudimentary typology that allows applications to be designed to decide programmatically whether a particular division should be treated as a block or inline element, or whether it should be indented. The @affects-attribute or @affects-element, both weakly inheritable, defines the scope of the vocabulary items, i.e., what elements or attributes can the items be legitimately used for. The vocabulary item will be eligible only for specified attributes or elements. Nearly all <item>s in a TAN-voc file contain the IRI + name pattern or a derived pattern. The only exceptions are <item>s pertaining to token definitions, which instead of <IRI>s take <token-definition>s. See . <verb> includes, in addition to the IRI + name pattern, the option to have <constraints> added. Those constraints define what components are permitted in any <claim> that uses the <verb>. At this time, verb constraints are an experimental feature. Only those constraints that mirror standard TAN vocabulary for verbs, vocabularies/verbs.TAN-voc.xml, will be supported during validation. Study that file for examples of how to build a <verb>. See on the use of verbs in a TAN-A file.

Morphological Concepts and Patterns (<code>TAN-mor</code>) TAN-mor files are used to delineate the morphological characteristics or features of a given language, to assign codes to those features, and to define rules governing the application of those codes. It is a kind of schema language for the grammar of human languages. The format allows specificity, flexibility, and responsiveness. Grammatical rules may be constructed to return warnings and error messages to users who use a code or pattern incorrectly, or not in accordance with best practices. Such rules may be qualified, or made contingent upon certain conditions. This chapter should be read in close conjunction with .

Principles and Assumptions Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see . TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing, and generally acquainted with how the grammars of comparable languages work. The TAN-mor format has been designed under the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how grammatical features should be defined and applied. TAN-mor allows scholars to declare clearly their operative assumptions and views. It is up to other users to decide whether or not to adopt them. The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized. Categorized codes are interpreted according to position. a b c would mean something different than c b a. For example, Perseus () has traditionally categorized codes for morphological analysis of Greek, Latin, and other highly inflected languages. Every code has ten positions, each one corresponding to a major grammatical category, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if a hyphen or null. A d in one position means something different from a d in another. Uncategorized codes, on the other hand, assign one unique code to each grammatical feature. In this approach, codes may be combined and arranged at will. a b c would be identical to c b a. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is in practice most often applied to languages that are not highly inflected, e.g., the Brown and Penn sets for English. TAN-mor morphological codes may not include either the space or the hyphen, and unlike IDrefs, they are case insensitive. For example, the codes NOUN and noun are interchangeable.

Root Element and Header The root element of a morphological rule file is <TAN-mor>. Zero or more <source>s refer to the grammars or related works that account for the morphological rules. If the categories, codes, and rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source may be inferred to be based upon the personal knowledge of the persons or organizations identified in <file-resp>. A language declaration is made in the header: one or more <for-lang>s.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see ). <body> contains interleaved rules and grammatical codes, either categorized or not. Grammatical rules consist of a series of <rule>s, perhaps filtered by attribute tests, and perhaps filtered by children <where>s with attribute tests. These tests are evaluated against the context various <m>s in a dependent TAN-A-lm file. Attribute tests are as follows: @m-matches (regular expression): <m> matches the pattern. @tok-matches (regular expression): one of the values of <tok> in the given <ana> matches the pattern (regular expression). @m-has-codes (space-delimited strings): <m> has the specified feature codes. @m-has-how-many-codes (integer): <m> has the given number of feature codes. If all the attributes in a <rule> or any of its children <where>s evaluate true against a context, then the process allows the actual ruels to be evaluated. Those rules are found in the enclosed <assert>s or <report>s, which declare rules that must be followed, or must never be followed, by any dependent TAN-A-lm file. An <assert> and <report> will be checked only if the conditions declared by the attributes in the enclosing <where> are met : An <assert> also has one or more of the truth conditions above. If the test proves false in a given <m> then the <m> will be marked as erroneous and the message included by the <assert> should be returned. <report> has the same effect, but the test looks for the opposite boolean value: the error and message will be returned only if the test proves true. Mixed with the rules are codes, either categorized or not. If categorized, there are zero or more <category>s . Each one sorts <code>s into groups, assigning them <val> that are unique within the <category>. Sequence is important. The first <category> defines the features allowed in the first code position, the second in the second, and so forth. If not categorized, then there are simply one or more <code>s. Each <code> has a @feature that points to one or more vocabulary items for a grammatical feature, either by IDref or by name. TAN has a standard vocabulary file for grammatical features: vocabularies/features.TAN-voc.xml. This vocabulary file encodes 746 grammatical features declared in the OLiA Reference Model for Morphology, Morphosyntax and Syntax (). See . <code> must have a <val>, which contains the actual code used, and it may take one or more <desc>s, to explain how the grammatical features should be interpreted for a given language. This is the ideal place to provide examples. In addition to examples below, see sample TAN-mor files in the examples directory. Examples of rules and codes <rule m-has-how-many-codes="2-10"> <report m-matches="^c">A conjunction has no other inflectional properties.</report> <report m-matches="^r">A preposition has no other inflectional properties.</report> <report m-matches="^i">An interjection has no other inflectional properties.</report> <report m-matches="^y">An acronym has no other inflectional properties.</report> </rule> . . . . . . . <rule m-matches="^. i"> <assert m-matches="^[dp]">An interrogative must be either a determiner (d) or a pronoun (p).</assert> </rule> . . . . . . <code feature="accusative"><val>accusative</val></code> <code feature="nominative"><val>nominative</val></code> <code feature="case_dative"><val>dative</val></code> <code feature="case_genitive"><val>genitive</val></code> <code feature="case_vocative"><val>vocative</val></code> . . . . . . <category feature="feature_person"> <code feature="first"><val>1</val></code> <code feature="second"><val>2</val></code> <code feature="third"><val>3</val></code> </category>

TAN Catalog Files (<code>collection</code>) TAN catalog files are used to locate relevant TAN files and to support the XSLT function collection(). They catalog or index any TAN files within a local directory and perhaps its subdirectories. These catalog files must always be named catalog.tan.xml. They depart from all other TAN files in their structure. They have no namespace. They have neither body nor head. Rather, they are patterned off the catalog.xml description provided by Saxonica (). Any XML file passed to the stylesheet

applications/create/create TAN catalog
                  file.xsl

will automatically generate one of these files, cataloging all the files in the local directory. The root element of a catalog file is <collection>, with children <doc>s that hold simple metadata about the TAN files that are in a directory and its subdirectories. Only TAN files may be registered in a <doc>. A <doc> may include other material such as each file's resolved <head>, but this is not mandated.

Using the Text Alignment Network Working with TAN files This chapter presents ways to manage, create, edit, and share TAN files. These suggestions, based upon the experience of users, are both brief and general. To get into specifics, read the other chapters in this part of the guidelines, as well as the appendixes.

Installation and local setup The TAN suite can be downloaded from a master data repository listed at . The project has been developed using the version-control software Git. Whether you download the files directly or you use Git, place the TAN code wherever is most convenient on your computer. No extra steps are necessary. Once you've downloaded the files, you have everything you need. The one exception pertains to the output/js directory, which has Javascript libraries that are designed to handle certain types of output from TAN applications. Documentation in a TAN application will let you know what Javascript dependencies are required. Unlike many other applications, you do not install the TAN suite, and you do not have to put it in a specific place on your local drive. There is no executable file in the suite. You will work with TAN through Oxygen, another XML editor, a text editor, or (if you are a power user) the command line. You will be creating and editing TAN files. Those files may be set up in whatever directory structure you prefer. Because TAN files are part of a network, they are meant to be shared and interlinked. So it is beneficial to develop predictable directory structures. However you organize your TAN files, keep them separate from the suite of core TAN files. Many TAN projects will involve dozens of versions of a particular work, and it is easy to get confused as to what file does what. Naming files becomes a challenge (the filename, not the @id, on which see ). In projects with many text versions, it is recommended that your names for class-1 files start with an acronym or short abbreviation for the author and work, followed by the language code, the last name of the editor/author of the scriptum, the date when the scriptum was created or published. If you have a transcription that has been redivided into multiple TAN files linked to each other via <redivision>, the reference system might need to be mentioned in the filename. Some suggestive examples: ar.cat.grc.1949.minio-paluello.ref-logical.xml: Aristotle's Categories, in Greek, from the 1949 edition by Minio Paluello, following a reference system based on semantic units (paragraphs, sentences, independent clauses). apocr.eng.kjv.1760.xml: apocrypha, English, King James Version, 1760 edition. If the file adopted an unusual reference system, that would be important to include in the name. tlg0059.tlg031.perseus-grc1-Pl.Ti.xml: Plato's Timaeus in Greek. This filename has some duplication in that the catalog number tlg0059 already implies Pl and tlg031, Ti, but only an elite few know the meaning of the numerical codes used by the Thesaurus Linguae Graecae. pl.ti.grc.1905.burnet.stephanus.xml: Plato's Timaeus in Greek, Burnet's 1905 edition divided into a system that approximates Stephanus numbers. Many classicists refer to Stephanus numbers in Plato's corpus and Bekker numbers in Aristotle's as canonical, as if the systems are immutable and unambiguous. But any edition that claims to follow Stephanus or Bekker numbers always makes slight adjustments to that system. Words do not always break exactly where they do in the 19th-century edition, and words and phrases here and there get transposed, inserted, or deleted, inevitably throwing off the lineation. Making one's edition conform exactly to the original line numbers is frequently a fool's errand. Some TAN applications, such as , use filenames to order output. If you wish your class-1 files to be read in chronological order according to source, then it is a good practice to put the date in ISO form (YYYY(-MM(-DD)?)?), placed before any alphabetizable elements that are less important. In sum, a good sequence for ordering components in a filename would be: collection, work, language/version, date, editor/author, reference system. Class-2 files are tougher. They unite multiple files and concepts, so comprehensive filenames could become very long or unpredictably structured. One approach is to make sure that each class-2 file is given a brief but meaningful name that points to the research question that motivated its creation. Some examples: ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml: a sample of lexico-morphological data for Aristotle's Categories, in Greek. Each source-specific TAN-A-lm file has no more than one source, so including the source in the filename does not pose a challenge. nt.grc-syr.selections.TAN-A-tok.xml: a selection of word-for-word correspondences between the Syriac and Greek New Testaments. plato.general.TAN-A.xml: a general alignment and annotation file concerning Plato's works. Class-3 filenames are a bit easier. It is recommended that TAN-mor files begin with the language code then an acronym for the person or group responsible for creating the rules and codes. TAN-voc files are written generally to serve a specific project or collection, so the collection name and the type of vocabulary should suffice. Examples: eng.example.com,2014.1.xml: tagging scheme #1 for English, by the owner of the domain example.com in 2014. ar.cat.general.TAN-voc.xml: general vocabulary items serving a project dealing with Aristotle's Categories. If you have a local copy of someone else's TAN collection, and you wish to create TAN files that depend on them, you will in all likelihood use relative URLs pointing to copies of the files stored locally. If those files have <master-location>s pointing to their master copies, you should occasionally validate them, to see if there have been any updates. If you need to move a TAN file from one directory to another, you should think about any internal links that might need to be updated. A standard TAN utility, , will copy a file for you and update any relative values of @href. That application does not delete the old file, because file deletion is treated as a security risk in XSLT.

Working with Oxygen XML Editor If you use an advanced XML editor such as Oxygen, you should edit your TAN collection through a project file, which will help you easily administer your TAN files and validate them automatically. Included with the standard TAN suite is a basic Oxygen project file, TAN.xpr. Use it as-is, or make a copy and configure it to your tastes. You will find that under Configure Transformation Scenarios there are preinstalled generic options for the standard TAN utilities and applications. When you open a TAN file in Author mode, you will find a variety of editing tools, primarily for class-1 files. Browse the options in the menu, the toolbars, and the context-click menu, to see what is possible. In a future version of TAN, more documentation will be provided on how to use these tools. The project file discussed above relies upon an Oxygen framework file, tan.frameworks, which drives the functionality of the project. If you have another project already underway, you can incorporate the the tan.frameworks file directly, combining it with your other Oxygen tools.

Creating and populating TAN files TAN is a representational format. Every TAN file models some source. If those sources are non-digital, it is a relatively straightforward task to create and populate a TAN file. Just start editing, using a template (e.g., a file from the examples directory). In some cases, you might benefit by starting with an algorithm. For example, optical character recognition (OCR) on an edition might give you a dirty but useful start for a TAN-T file. Applying OCR to a printed index of quotations might be the first step to a TAN-A file. Despite the computer's assistance, the majority of the task will be spent in correcting any conversions. Thoughtful attention is needed to making these files suitable for use. In many other cases, you want to take something that already exists digitally and convert it into a TAN format. Many times, when you find a Word file, a web page, or a plain text file that can serve as the basis for a TAN file, the first impulse is to copy the desired content, paste it into the body of an new TAN file, then manually adjust and correct it. That solution is quick and easy, but short-sighted. You may find only hours into the task that you made a major mistake, but that it happened so early in the process, you cannot backtrack. Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you eliminated line breaks that you didn't realize at the time were useful signals about where <div>s should be separated. Even if all goes well, after all that hard work you might discover that the pre-TAN data sources you started out with have been updated, and other things have been corrected. If any significant time has elapsed, you may have forgotten what procedure you followed to convert the data. And even if you do remember, you will have to repeat the steps again, and dread the day when those pre-TAN sources are updated yet again. Save yourself time and hassle. Stop fixing files by hand. Instead, build a system to convert the files. Create an automated or semiautomated workflow that can be applied when needed, so that pre-TAN files can be channeled at will into your TAN library. This approach to the editorial task takes some extra investment at the outset, but in the long run it can save you many hours of labor. A very useful utility is , which allows you to create a list of changes to be made to a particular document, to convert it to TAN-T or TAN-TEI (or even generic TEI). Or if you or a project member has experience in XSLT, develop your own stylesheets. When you find mistakes such as those described above, no harm is done. You can simply adjust the Body Builder configuration or XSLT file and re-run your process, each time getting better and better results. This approach requires extra work, initially. Establishing a stable transformation process can be time-consuming, since it requires repeated sequences of trial, error, and diagnosis. But the investment pays off in the long run, especially if you are dealing with dozens, hundreds, or thousands of files. The routines you write for one set of files might be useful for the next.

TAN validation

The process TAN files are validated when the file, along with its associated TAN schemas, are passed to a validation engine. Validation can be set up either by pointing explicitly to the schemas within a TAN file (via <?xml-model ?> statements in the prolog), or by setting up an Oxygen project or framework to automatically apply the schemas to TAN files (see ). There are two types of TAN validation. First, the file structure is checked against RELAX-NG files that define the attributes, elements, and patterns that are allowed or required in a given TAN format. These files are kept in the schemas project subdirectory, according to format name. If you are editing a TAN-T file, for example, its RELAX-NG schema is schemas/TAN-T.rnc. The RELAX-NG files are written principally in the compact syntax (.rnc), then converted to XML syntax (.rng). The TAN-TEI format is an exception. Behind the schema schemas/TAN-TEI.rnc is a master file schemas/TAN-TEI.odd. This file, linked as it is with the other RELAX-NG files, is processed by TEI stylesheets to generate the master TAN-TEI.rnc and TAN-TEI.rng files that validate TAN-TEI files. The ODD file is processed against TEI All, the largest of the TEI formats, in the version available at the time of the release of a given TAN version. The second type of validation uses Schematron to apply rules that cannot be expressed in RELAX-NG, e.g., no @when should have a date in the future. More than one hundred types of errors are checked during Schematron validation. For a comprehensive list see and . Some of these errors can be quite time-consuming for a computer to check. For example, if a class-1 file has a <redivision>, the text should be identical. On short texts, the comparison can be made in seconds; on longer ones it might take minutes (see next section, on efficiency). Therefore Schematron validation allows three different levels: terse, normal, and verbose. The names reflect not only how fast each phase takes but how much feedback is provided. The Schematron files themselves are rather small. The majority of the work is done by the TAN function library, which takes the file, resolves it, and expands it, inserting errors and help messages along the way. A greatly reduced version of the expanded file, containing only warnings and errors, is then passed back to the Schematron processor as a global variable. The Schematron processor returns as messages any errors or warnings found in the generated file, and any suggested corrections as Schematron Quick Fixes. For more details about the TAN validation process, see .

Efficiency TAN's Schematron validation specifies a process that is much more computationally intensive than is its RELAX-NG counterpart. The longer and more complex your TAN file and its dependencies, the longer it will take to validate. Files such as the Ring-a-roses examples in the examples subdirectory will take a split second to validate, but a TAN-T file of the Old Testament of the King James Version has been known to take about 25 seconds to validate in the normal phase, and the whole Bible, about a minute. A TAN-A-lm file with a full morphological analysis of a very long TAN-T file will take a much longer time to validate. Tests were performed on TAN-A file that had three very large TAN-T sources (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron validation under the normal phase took about 13 seconds (run on Oxygen 22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was expanded to 546, the same process took 63 seconds. When the file had 5,421 claims, the file took 78 minutes, 45 seconds to validate. Much of the extra time is due to the Schematron evaluation process, not the preparatory work performed by the TAN function library. The library component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23 minutes 57 seconds, respectively. The time complexity of the Schematron component grows faster than does that of the XSLT. The figures above are a very significant improvement over the time required in the 2018 version, and no doubt future versions of TAN will bring optimizations to the validation process. Nevertheless, you may need to make decisions that pit speed against convenience. If you want validation to be quick, break files into smaller ones, perhaps to be joined later in a single TAN file via <inclusion>s. Validating ten component files each with ten thousand elements will take aggregately less time than validating one long file with one hundred thousand elements. Had the example TAN-A file mentioned above been split into 43 different files, the time required for validating the entire collection would have been reduced by 88%.

Sharing TAN files TAN files have been designed to be shared and linked, just like any network of files. Most often, TAN files will be created and distributed as collections, not single files. One way to distribute a collection is to make it available as a repository via Git or some other version control software (VCS). This approach has many advantages. You can collaborate with a wide variety of people, and preserve an editorial history that allows you to branch or backtrack, if needed. VCS features and tools are extremely fast and useful. Collections may also be distributed through shared syncing services (e.g., Drive, Box, or Dropbox), or put on a Web server. In the latter case, it may be difficult for users to browse or download your collection of TAN files wholesale. In that case, you may wish to expose the collection as a compressed ZIP archive. This saves on your server's bandwidth, and it still exposes the files for XML processing. But a ZIP archive is not suitable for linking from one TAN file to another, nor is it appropriate as a target of <master-location>. Unpacking a compressed file requires writing to the disk, which is treated as a security risk during validation. Such zipped archives are good ways to distribute a collection, but they should not be used as a primary repository or a master location. When you share a TAN file, make sure to include its dependencies, the files pointed to by <vocabulary> or <inclusion>. If you are simply trying to email a single file, you could send a resolved version, which does not require any other dependencies (see ).

Using TAN Applications and Utilities TAN files are suited for dozens of types of applications. A few have been developed and successfully tested on select projects. The most mature of these have been provided in the subdirectories applications and utilities. Utilities are designed to assist in import, export, creating, and editing TAN files. They tend to support straightforward tasks, and the code is relatively stable. Applications, on the other hand, support study and research. Most of these take a set of TAN files, process them, and create interactive, dynamic HTML files that let you study and analyze textual features and relationships. Applications can have quite complicated code bases, and tend to have features that are not fully supported, or are in the planning phase. TAN utilites and applications are written in XSLT. XSLT, which stands for XSL Transformations, version 3.0, XSL, which stands for Extensible Stylesheet Language, was the predecessor language. is very powerful, and has a distinctive syntax and design. Many people do not know how even to begin to use it. Even some seasoned programmers approaching XSLT for the first time can find it baffling or impenetrable. An XSLT application is rather different from others that may be more familiar to you. This chapter begins with a basic orientation to XSLT. You may not be ready to write anything in XSLT, but you can begin to read and understand an XSLT file. We then look at how to run an XSLT application, and then look at the standard TAN utilities and applications.

First things to know about XSLT

The process In most computer applications, the expected rules are rather straightforward. Given zero or more inputs, zero or more outputs are returned. Many times the application is driven by a graphical user interface (GUI), to allow the user to configure the application. XSLT applications do not have a GUI. They also have a somewhat different approach to input and output. In the classic approach to XSLT, the input consists of an XSLT stylesheet and an XML file, passed to a processor. But there is opportunity for secondary input. And classically there is one output, but XSLT provides the opportunity to create secondary outputs. The basic model is depicted here: The classic view presented here does not take into account another way of configuring an XSLT application, where a particular starting point is designated, the initial template. In those cases, primary input is unnecessary.

The classic XSLT process In the classic XSLT process, there are three key requirements: an XML file, to catalyze the process; a master XSLT file, to declare the rules that should be followed; an XSLT processor. The process begins, actually, with the processor, which is normally given URLs that specify where to find the input, the stylesheet, and where to place the output. The processor fetches the XSLT stylesheet, and looks for any associated components. After compiling the master stylesheet and its dependencies, the rules are applied to the catalyzing XML file. Along the way, the processor may fetch secondary input documents, if the XSLT file so instructs. After all the rules have been applied, the processor saves the primary result document—if there is one—to the specified target URL. If the XSLT rules tate that secondary result documents should be saved at certain locations, the processor does so. Therefore, in any XSLT operation, there are really two possible types of input and two types of output. We use the terms primary input for the catalyzing XML file and secondary input for input that is added during the process. We use the term primary output for the main result tree and secondary output for any other output created along the way. The terms primary and secondary refer only to their position in the process, not their importance to the application. Indeed, there are XSLT applications where the secondary input and secondary output are far more important than the catalyzing input or primary output. Sometimes the primary input does not matter at all, and sometimes there is no primary output. You will normally have direct control over the primary input, because you will need to select an XML file to catalyze the process. But any control you might exercise over the secondary input could be hidden. The application might derive secondary input based upon your primary input, or it might provide parameters, to allow you to control the secondary input. Likewise, you normally have full control over where the primary output should go. But you may not have that kind of control over the secondary output. You may or may not have control over that. When you get an XSLT file, try to understand first of all what kinds of input is expected, and what types of output are returned, and where. In general, if there is not good documentation and the XSLT does not come from a trusted source, do not try to run it.

Syntax XSLT is itself an XML document, and can be treated in every way as an XML document. If there is something you can do to an XML document, you can do it to an XSLT file too. The XML syntax makes the code somewhat more verbose than the syntax of other languages. Many of the instructions are placed in elements, which frequently have opening and closing tags. Unless otherwise specified, white space is flexible, and the document can be reformatted and indented as one likes. Most XSLT files are indented, but in most cases that indentation can be changed or removed without affecting the output. XML in general uses namespaces, to allow mixed vocabularies. So too, XSLT files can interleave elements from different namespaces. In general, most XSLT files do not define a default namespace: that is up to the designer to do. All the XSLT elements are in the namespace , and bound to the prefix xsl. Because an XSLT file is itself XML, then it can be designed to be the primary input of an XSLT process, even its own. Running an XSLT file against itself can be useful in cases where the primary input is irrelevant.

Modular design An XSLT file may invoke other XSLT files, or be invoked by them, through the <xsl:import> and <xsl:include> instructions. Inclusions and imports are recursive: the processor looks not just for the ones it imports/includes, but the ones they import/include, and so forth. The modular approach to XSLT allows developers to be more efficient and effective when writing code. Routines that serve one process well can serve another. But it also means that when you first open up an XSLT file, you do not understand what it does until you trace the chain of <xsl:import> and <xsl:include> instructions, and find all the stylesheets it depends. That process can be cumbersome, but straight-forward. More challenging is asking yourself whether the file you began with is a master stylesheet (the intended starting point for a process), or if it is itself a dependency. You may not be able to tell, without documentation. Tracing these lines of dependence is important, because you need to find the appropriate starting point, and understand how it relates to the network of XSLT files.

Declarative statements In most programming languages, you write a list of things for the computer to do, in a specified order, governed by conditional branching. This list-like approach to programming is called imperative programming. XSLT has imperative components, but at its heart, it is a declarative programming language. That is, an XSLT programmer writes not a list of steps to be followed but rather a set of rules or principles that should be observed. It is up to the processor to determine the most efficient path to honor those rules or principles. Imperative and declarative programming can be compared to real-world examples. Suppose you have a pile of candies that need to be sorted. Imperative programming is like telling a child: get one candy; if it is like such-and-such, put it here; repeat. Declarative programming is like telling that same child, I do not care how you do it, but make sure that the final groups look like such-and-such. If you are familiar with Cascading Style Sheets (CSS) you might appreciate better how the XSLT programmer approaches a task. In CSS, styling instructions are provided by selector patterns that match certain elements within the HTML file. CSS instructions can frequently be placed in different groups and orders, and with different levels of specificity, to infer priority. It is up to the browser to take those rules and find the most efficient way to apply the styles. Such a declarative approach allows the writer of CSS to efficiently write, edit, and maintain some rather complex code. Because of its declarative approach, the order of an XSLT's root element children is flexible. Most often, order does not matter. The children of the root element, called declarations, are special, because they stipulate the rules or principles that should be followed. All of the declarations of the stylesheet's modules are also taken into consideration. Which means that when you are reading a particular section of an XSLT file, you might think you understand what is being done. But there may be declarations in other parts of the file or its inclusions/imports that affect whether the particular component you are looking at is called, or in what priority. As a general rule of thumb, when you read an XSLT file to understand what it does, do not put much importance on the order of its declarations. They will not be followed in that order. There are cases where order is important, but coming freshly to an XSLT file, try to get a bird's-eye overview of all the components. Look at all the declarations, wherever they are found. As you read, don't look for the application's steps. Try to understand the intended outcome.

Variables and parameters In most programming languages, you can write something like the following pseudocode... x = 1 x = x - 1 return x ...and expect the output 0. The variable x starts with the value 1, but then changes, because variables are mutable. In XSLT, variables and parameters are immutable. You cannot change the value of a variable or parameter. A variable can be destroyed (and along with it, its value), and then a new instatiation of the variable can be created, but once again, within its life (scope), it does not change. If you see two <xsl:param> or two <xsl:variable> instructions that create variables with the same name, they are in different scopes (or the XSLT is invalid). Both variables and parameters might be in a namespace. If there is a colon in the name, the variable or parameter is bound to a particular namespace. Check the prefix to see its namespace. As a user of an XSLT stylesheet, you should not worry too much about any XSLT variables. Certainly, you can change them if you want, but at that point you are stepping into the role of developer. We assume here you are interested primarily in using, not altering, an XSLT application. Your should focus, instead, upon parameters, but only a certain kind: global, relevant parameters. Global parameters are found exclusively as children of a root element. That is, they are declarations (see previous section). Any parameters that are more deeply nested are local parameters, and you shouldn't change them. Not all global parameters are relevant. If you have a master stylesheet that includes another one, that stylesheet may have global parameters that are designed to accommodate some other including XSLT application. Normally, you will know which global parameters are relevant for your purposes only by studying the file's documentation, or its code. Every global parameter is a developer's invitation to the user to configure the XSLT application. Some parameters exercise an enormous influence over the type of output; others have no effect whatsoever; yet others might cause the application to crash if you put in the wrong value. Before you try to change a parameter, you should understand something about data types. See .

XPath language XSLT relies upon a sublanguage called XPath, which is itself a proper subset of another powerful XML programming language, XQuery. You will most commonly read or use XPath expressions in the context of the @select attribute in various XSLT instruction elements. XPath is an enormous topic, and well worth learning. Because this chapter is geared to helping new users quickly get comfortable with using and configuring an XSLT application, we introduce here some very common, useful XPath expressions. They are presented according to four basic concepts: navigation, filter expressions (predicates), operators, and functions.

Navigation Every XML file is a tree, and at the heart of XPath is a language for traversing that tree. XPath gets its name, because it was designed to provide a path from one point to many. An XPath expression always assumes some kind starting point for the path. That starting point is called the context, which is commonly a node inside an XML tree. Because this short guide is aimed at users who are configuring global parameters, we will assume in our examples here that the context is the primary input XML document. That means that the context is the document node of the primary input. When an XPath expression begins with a single slash, the document node is selected. The following example shows how to bind to the global parameter $doc-a the document node of the primary input. <xsl:param name="doc-a" as="document-node()" select="/"/> Once you start an XPath expression, you add to it by adding new components. This builds the path of traversal. Commonly you want to traverses downward through the tree, toward the leaves. You do this most frequently by element name. If it is in a namespace, you either need to start with the appropriate prefix, or else use an asterisk (represents any prefix), followed by a colon. The following example selects the <tei:TEI> root element of the primary input XML document. If the root element is not named TEI and it is not in the namespace bound to the prefix tei, then you will get an error, because this global parameter expects exactly one item, no more, no less. <xsl:param name="tei-root-element" as="element()" select="tei:TEI"/> The previous example would have worked as well with /tei:TEI, which says, in effect, go to the document node, then go to the element TEI. We have left it off because we are assuming that the document node of the primary input document is the context (i.e., the assumed starting point for an XPath expression). Another XPath expression comparable to the example above would be *:TEI, which selects the root element if its name is TEI, regardless of what namespace it is in. The nested elements of the tree can be traversed by separating element names with the slash. The following example navigates from the document node leafward to the TEI's body, three levels deep. This example also shows how to use the asterisk alone, which stands for any element. <xsl:param name="tei-text" as="element()?" select="tei:TEI/*/tei:body"/> If you want to go deeply into the document, and select a variety of elements, you can do so with the double-slash operator, which navigates down to all descendants. <xsl:param name="tei-abs" as="element()*" select="tei:TEI//tei:ab"/> The example above selects every <ab> in a TEI document. If one <ab> nests inside another, both are picked. To select an attribute, use the @ sign. In the following example the XPath expression points to an attribute that is bound to a namespace via the prefix xml. One commonly finds @xml:id, @xml:lang, @xml:space, but most of the attributes you encounter will not have namespaces, even if their parent elements have them. <xsl:param name="tan-t-lang" as="attibute()" select="tan:TAN-T/tan:body/@xml:lang"/> To select any attribute, use @*. The following example selects all the attributes in <change> elements in a TAN file. Note the use of the asterisk for the root element. This expression will work no matter which TAN format is used. <xsl:param name="change-attrs" as="attribute()+" select="*/tan:head//tan:change/@*"/> You can use parentheses and commas to group and add nodes. In this example, the XPath expression points to the TAN <body>, then selects all the children comment nodes, text nodes, and elements. <xsl:param name="interesting-nodes" as="item()*" select="*/tan:body/(text(), comment(), *)"/> There is a slightly simpler way to do the preceding example, and it also finds any processing instructions: <xsl:param name="interesting-nodes" as="item()*" select="*/tan:body/node()"/> In an XPath expression node() finds everything except attributes and namespaces. There is much, much more about XPath navigation, but the samples above should get you started. See XPath 3.1 for comprehensive, technical coverage.

Filter expressions (predicates) An XPath expression that traverses a tree might return more nodes than you want. You can reduce what is captured by applying a predicate, which is an XPath expression that filters results. A predicate consists of an XPath expression enclosed by two square brackets, inserted in the middle of, or at the end of, another XPath expression. The predicate must be placed in an XPath expression immediately to the right of the step you want to filter. For every context node found, the predicate will be evaluated as a boolean. If the predicate is true, the node is retained, otherwise it is discarded. A very simple example shows how to pick the second <div> in the body of a TAN-T file: <xsl:param name="second-div" as="element()?" select="tan:TAN-T/tan:body/tan:div[2]"/> This predicate, [2], returns true if a given node is the second child <div> of <body>. The simple numeral 2 in the filter expression is actually shorthand for a slightly longer expression based on XPath functions (discussed below), [position() eq 2]. The next example finds every <div> that has an attribute of @xml:lang. <xsl:param name="second-div" as="element()*" select="tan:TAN-T/tan:body//tan:div[@xml:lang]"/> This predicate, too is shorthand for [exists(@xml:lang)], another XPath function. Predicates may nest. Any nesting predicate still takes as its context the step immediately to the left. This example finds every TEI <div> tag, but only if it has a <p> that has a <quote>. <xsl:param name="divs-with-quoting-ps" as="element()*" select="tei:TEI/tei:text/tei:body//tei:div[tei:p[tei:quote]]"/> Predicates may chain, simply by appending predicates. The following example reduces the previous example to the first instance. <xsl:param name="divs-with-quoting-ps" as="element()*" select="tei:TEI/tei:text/tei:body//tei:div[tei:p[tei:quote]][1]"/> The position of chained predicates is important. Whereas the preceding example filtered the <div>s then picked the first one, the next example finds the first <div> (one that does not have a preceding sibling <div>), and retains it only if it has a <p> with a <quote>. <xsl:param name="divs-with-quoting-ps" as="element()*" select="tei:TEI/tei:text/tei:body//tei:div[1][tei:p[tei:quote]]"/> The previous two examples look very similar, but they produce very different results. Predicates may be placed anywhere in an XPath expression. The following gets all top-level <div>s only if the root element has an @TAN-version, a distinctive marker of all TAN files. <xsl:param name="top-level-divs" as="element()*" select="*[@TAN-version]/*/*:body/*:div"/>

Operator expressions We have already seen some basic XPath operator expressions, namely, in the comma and the parentheses. XPath has many more operator expressions, some of which should be immediately recognizable: + for addition, - for subtraction, * for multiplication, and div for division. (The slash is not used for division, to avoid clashes with the step separator.) The keyword to, with an integer on either side (the smaller on the left), creates a range, e.g.,

(1 to
                        10)

. XPath also has comparison expressions. Although < and > can be used for "less than" and "greater than", those symbols interfere with XML syntax. Instead, use the expressions lt and gt. The expressions le and ge can also be used, to mean less than or equal to, and greater than or equal to, respectively. For checking equality, you will most often use the = expression. There is also eq, but this can be used only to compare exactly two items. The = is very powerful, because it will return true if there is any item in the sequence on the left hand side that is equal to any item in the sequence on the right. Consider for example, an XPath statement that compares two sequences, each with two integers:

(1, 2) =
                        (2, 3)

. The statement is true because there is at least one pair of equal items. Because the expression = is used so frequently to compare sequences, you might think of it as meaning "overlaps with." Complex expressions can be combined with and, or, and grouped with parentheses, as needed. As you work with XSLT global parameters, you will find that most operator expressions are used within the filtering predicates. The following finds all <div>s with an attribute @type whose value is "chapter". <xsl:param name="chapter-divs" as="element()*" select="//*:div[@type = 'chapter']"/> This expression finds the top-level divs in 2nd, 3rd, 4th, and 8th place: <xsl:param name="some-divs" as="element()*" select="//*:body/*:div[position() = (2 to 4, 8)]"/> The following example returns any <div> whose values of @n and @type match. <xsl:param name="dupl-n-and-type-divs" as="element()*" select="//*:div[@type = @n]"/>

Functions XPath expressions become enormously powerful when combined with the language's 155 standard functions. You have already seen two of them, position() and exists(). In a brief survey like this, it is possible to illustrate only a few of the most common standard functions you are likely to use when configuring the global parameters of an XSLT application. last(): returns an integer representing the size of the context. The following examples contain an implicit position() eq, just the same as the filter expression example above, with [2]. <xsl:param name="last-div" as="element()?" select="//*:body/*:div[last()]"/> <xsl:param name="penultimate-div" as="element()?" select="//*:body/*:div[last() - 1]"/> count(): returns the number of items in a sequence. The following returns all TAN-T <div>s that have more than three children <div>s. <xsl:param name="populous-divs" as="element()*" select="//tan:div[count(tan:div) gt 3]"/> not(): returns true if the expression it contains is false, or false if it is true. This function is very widely used, to great effect. The first example belowe finds all leaf divs, and the second, all leaf elements: <xsl:param name="leaf-divs" as="element()*" select="//*:div[not(*:div)]"/> <xsl:param name="leaf-elements" as="element()*" select="//*[not(*)]"/> Whereas the = operator is very popular, its counterpart, !=, is not used very much, because its results tend to be uninteresting. The true complement of = comes with not(), as illustrated in this example, which retrieves all <div>s that are not of a certain type: <xsl:param name="certain-divs" as="element()*" select="//*:div[not(@type = ('ep', 'title', 'pref'))]"/> lower-case() / upper-case(): converts a string to all lowercase / uppercase values. This example looks for any text node that has a certain value, but only after it has been rendered lowercase. <xsl:param name="some-elements" as="text()*" select="//text()[lower-case(.) = 'a b c']"/> Note the use of the period, which is shorthand for the context item. normalize-space(): takes a string, removing all space from the beginning and end, and replacing any consecutive block of intermediary space with a single space. This function is very useful when you wish to compare texts that may be indented. The preceding example might have missed some text nodes that had initial or trailing space. It can be adjusted as follows: <xsl:param name="some-elements" as="text()*" select="//text()[normalize-space(lower-case(.)) = 'a b c']"/> Many times XPath functions must call each other. You may nest them, as in the example above, or you may use pointing syntax, =>. Use the syntax you are most comfortable with. <xsl:param name="some-elements" as="text()*" select="//text()[(lower-case(.) => normalize-space()) = 'a b c']"/> contains() / starts-with() / ends-with(): tests to see if the string in the first parameter contains / starts with / ends with the string in the second. The following finds all elements that contain the text "straw": <xsl:param name="some-elements" as="element()*" select="//*[contains(., 'straw')]"/> contains-token(): tests to see if the string in the first parameter has as one of its "words" the string in the second, based on segmenting the first string at blocks of space. The preceding example would have picked up "strawberry"; in the next example, using contains-token(), "strawberry" would not be selected: <xsl:param name="some-elements" as="element()*" select="//*[contains-token(., 'straw')]"/> matches(): tests to see if the string in the first parameter matches the second, which is a regular expression. Several TAN applications rely heavily upon regular expressions, which provide very powerful way of finding and replacing text. See . The following example finds any text node with one of the seven weekday names in English: <xsl:param name="text-nodes-with-weekdays" as="text()*" select="//text()[matches(., '(Sun|Mon|Tue|Wednes|Thurs|Fri|Satur)day')]"/> There are, of course, many, many more XPath functions. For the complete list, along with all the specifications, see XPath Functions and Operators 3.1.

Configuring and running an XSLT application

Configuring global parameters Once you have determined the master XSLT stylesheet for the application, you may want to configure it by adjusting the values given to the global parameters. You have several possible strategies: Work with a configuration file. If you are comfortable writing some simple XSLT code, you might create a small XSLT file that has nothing but an <xsl:import> whose @href value points to the original stylesheet. Copy from the master XSLT stylesheet only those <xsl:param>s that you want to change. This method is quick to set up and easy to use, but it also means that you do not have immediate access to documentation. Overwrite the values in the master XSLT stylesheet directly. This method is quick, but it also means that you might not easily restore the original settings, unless you make a backup copy. Also, if you are using configuration files, their default values will change. That could be good or bad, depending upon your setup. Work from a copy of the master XSLT file. This method allows you to customize the entire application, and consult as needed the original settings in the master file. Like configuration files (see above), you can make new copies for new situations emerge. You should make certain that any working copies are in the same subdirectory as the original, to keep links intact. Manage transformations from Oxygen. Oxygen XML Editor has a powerful feature, Configure Transformation Scenarios, which allows you to create custom configurations for an XSLT application. Oxygen has good documentation on how to use this flexible feature, which can be combined with any of the preceding three options. Oxygen allows you not only to configure the parameters but to manage input and output. One drawback is that you are presented with all the global parameters that can be found, whether or not they are really relevant. Documentation associated with a particular parameter may be missing or truncated. You should use this feature in conjunction with any documentation that comes with the XSLT application. Whatever method you adopt for configuration, first find the relevant global parameters. Once you have them, you should always ensure you understand what type of data is expected, and in what quantity. Data types. XSLT is a strongly typed programming language. The data that is bound to variables and parameters are always at least implicitly typed. Many variables or parameters specify exactly what kind of data is expected. Those that do not are assigned some default type by the XSLT processor. Most data types you encounter will be of two sorts: atomic types, and nodes. Examples of atomic types are integers, booleans, strings, and dates. Examples of nodes are elements, attributes, comments, and processing instructions. There are other types, but we will focus here on the most common. Quantities. In XSLT, there are four quantity categories: (1) zero or one; (2) exactly one; (3) zero or more; (4) one or more. Each of these are specified by adding to a data-type declaration a quantifier: ?, nothing, *, and +. Quantifiers and data types Quantity Symbol Atomic type example Node type example zero or one ? xs:string? element()? exactly one none xs:boolean document-node() zero or more * xs:dateTime* attribute()* one or more + xs:integer+ comment()+

Below are some of the more common data types you will find in global parameters, along with several examples going from simple values up to more complex assignments based upon XPath expressions or XSLT constructions. For more background, see . Focus is placed upon data types and quantities expected in select TAN applications and utilities. Strings. A string is a concatenated sequence of characters. Even when the value consists only of Arabic numerals, a string will be read and interpreted as a text, not as an integer. In the following example, the string value is specified by the single quotation marks within the double quotation marks. The double-quotation marks delimit the value of the attribute, and the single-quotation marks specify that the value is a string. If you did not include the single quotation marks, it would be interpreted as an XPath expression pointing to the name of a child element within the context. <xsl:param name="text-a-to-compare" as="xs:string?" select="'Every day'"/> When more than one string is expected, the strings should be separated by a comma. It is also common to surround the series with parentheses, for visual clarity. This example assigns to the parameter a sequence of two strings. <xsl:param name="text-a-to-compare" as="xs:string+" select="('day', 'night')"/> In the next example, @select is replaced by the text node within the parameter. This technique can be useful if the value expected will be space-normalized, and you want to wrap text, and you do not need to create multiple strings. <xsl:param name="text-a-to-compare" as="xs:string?">Every day</xsl:param> The next example takes the primary input XML and converts it to a string. Such conversion is called casting. Keep in mind that the context node of any global parameter is the primary input XML document. <xsl:param name="text-a-to-compare" as="xs:string" select="string(/)"/> Perhaps you need to supply a path to some input. The following example traverses the tree to a particular @href within the primary input. The string value in that attribute will be treated like a URL, and it will be resolved relative to the base URI of the primary input. <xsl:param name="path-to-source" as="xs:string" select="resolve-uri(/*/tan:head/tan:predecessor/tan:location/@href, base-uri(/))"/> If a parameter allows multiple values, and you need to change those values frequently, you might want to bind options to global parameters or global variables of your own creation... <xsl:variable name="dir-1-path" as="xs:string" select="'../../novels/book-a'"/> <xsl:variable name="dir-2-path" as="xs:string" select="'test/comparanda'"/> <xsl:variable name="dir-3-path" as="xs:string" select="'test/logs'"/> <xsl:variable name="dir-4-path" as="xs:string" select="'../brown/texts'"/> ...then update the master global parameter on a case-by-case-basis. <xsl:param name="secondary-input-relative-uri-directories" as="xs:string+" select="$dir-1-path, $dir-4-path"/> The preceding example allows you to quickly change from one set of data to another. Booleans. A boolean is a true/false value. If a parameter expects a boolean, you should use some XPath expression that can be cast to a boolean, even if it is a simple one, such as true() or false(). If you need to express the value as a string, it should be either "true", "false", "0", or "1". <param name="ignore-comments" as="xs:boolean" select="false()"/> <param name="preoptimize-string-order" as="xs:boolean" select="'true'"/> Integers. To supply an integer, you need only use numerals, perhaps preceded by a hyphen if it is negative. You should not use quotation marks, or the parameter's child text node. There will be no confusion of the integer with an XPath step, because no element's name may begin with a digit. <xsl:param name="start-at-depth" as="xs:integer" select="1"/> <xsl:param name="ngram-auras" as="xs:integer+" select="(2, 1)"/> Decimals. Decimals are much like integers, but require decimal points. If the decimal is between 1.0 and -1.0, the decimal point must be preceded by a zero, e.g., -0.99. <xsl:param name="diff-threshold-of-interest" as="xs:decimal" select="0.2"/> Elements. If a global parameter expects elements as input, you must construct them inline, or provide an XPath expression that directs the processor to the elements in question. The following example shows how to construct a parameter that might be fed into tan:batch-replace(). <xsl:param name="additional-batch-replacements" as="element()"> <replace pattern="(\d\d)/(\d\d)/(\d\d\d\d)" replacement="$3-$1-$2" message="Converted U.S.-style date to ISO-style"/> </xsl:param> The parameter used in the previous example might need to be given numerous elements. In those cases it might be convenient to put them in a separate XML file and point to it, with an XPath expression: <xsl:param name="additional-batch-replacements" as="element()" select="doc('batch-replacements.xml')/*/tan:replace"/>

Starting the XSLT process Running an XSLT application can be done in several ways. As noted above, at the heart of the process is the XSLT processor. The goal is to find the means to feed the primary input and the master stylesheet into the processor, and to tell the processor where to place the output. From the command line. Processors such as Saxon allow you to initiate the process from the command line. Windows: Press the Windows key; Type "cmd" and click "Command Prompt"; Type the letter of the drive where you plan to run the process, followed by a colon, e.g., e: Using the command cd navigate to the directory where your files are, e.g., cd myfiles. Macintosh: Open the Shell app; Using the command cd navigate to the directory where your files are, e.g., cd E:/myfiles. From there, follow the instructions provided by the vendor of the XSLT processor. Saxon provides instructions for its product at . A simple command-line instruction might look like the following:java -cp "E:/xslt processors/saxon-he-10.0.jar" -s:init.xml -xsl:app.xsl -o:primary-output.xml From Oxygen XML Editor. Oxygen provides numerous ways to initiate the XSLT process, including the following: XSLT Debugger Perspective. This editing mode changes the appearance of Oxygen, putting eligible primary input files on the left, XSLT files in the middle, and an output pane on the right. You can choose the processor you prefer, and pick your primary input and master stylesheet. Running the application provides interactive output, with many diagnostic tools, letting you learn how the output came about. Transformation Scenarios. You can choose configure transformation scenarios, and create a highly customized set of conditions for running an XSLT application. These methods, and other more sophisticated approaches, are described by the vendor in their documentation, .

TAN utilities and applications All TAN utilities and applications share the same basic architecture. Once you have figured out how to use one TAN application, you are well on your way to being able to use the others as well. Each TAN utility and application has its own purpose, which means that its expected input and output will differ quite a bit from the others. Nevertheless, all TAN utilities and applications share a common set of features, to assist users.

Application/utility setup All TAN utilities are in the utilities directory of the TAN files; the applications are in the applications directory. Within those directories, there is one subdirectory per utility or application. And within that subdirectory, there are only two XSLT file, accompanied perhaps by further subdirectories. One of the XSLT files has "configuration" in the name, and it allows you to customize a particular application or utility for your projects. The other XSLT file is the master stylesheet for the utility or application in question, and it has the same name as its parent directory. Subdirectories contain the heart of the code, and other important dependencies. The file structure is designed to make quite clear the main point of entry. Having a directory with so few files should hopefully inspire you to fill it up with copies designed for specific situations.

The master stylesheet All master stylesheets for TAN utilities and applications share a common structure. They are designed to be as user-friendly as possible, and to focus exclusively on configuration settings that the user may want to change. Preamble. Every master stylesheet begins with a long series of comments, indicating the name of the application, its version (an ISO date), its name, and a brief description of what it does. The preamble includes a statement of the intended primary input, secondary input, primary output, and secondary output. Cautionary notes may be included. If the utility or application has areas that are known to need development, these will be listed. Global parameters. After the preamble a series of global parameters are presented. Each one is preceded by a comment that explains the expected value. The parameters may be organized in blocks according to stages or topics. Some of the parameters may be localized versions of global parameters that are defined in standard TAN parameters declared by files in the main directory parameters. The values in the master stylesheet of the application will take precedence over the default values. Import statement. At the end of the master stylesheet is an <xsl:import> statement, pointing to the core stylesheet. That instruction may be followed as well by other comments and declarations that users should not change.

The core stylesheet Every master stylesheet points via its import statement to a single XSLT file in the incl subdirectory. That XSLT file is the core stylesheet. As an everyday user of the application, you will find this core stylesheet to be of little or no importance. But anyone doing any kind of customization or development should be aware of how it works, and this description is aimed at those developers. Each core stylesheet follows a common structure. It begins with <xsl:include> instructions that point to the TAN function library, and perhaps other important components. Next come metadata about the application: its name, its IRI, a change message to be reported, and a variety of descriptions about the application, and its expected input and output. A change log and a list of features to work on may be included. The dates within those parameters dictate the version of the application. All this metadata is used in several ways: to populate the comments of the master stylesheet, to populate the contents of these guidelines, and perhaps to supplement the output. The master data is here in the stylesheet. The development branch of the TAN project includes a maintenance directory. Within it is a Schematron file that makes sure that the master and core stylesheets of any given utility or application are synchronized. After the metadata come the XSLT declarations that drive the process. The output for most TAN utilities and applications require multiple ordered stages. A given stage might have a strong declarative element, but the stages themselves are set carefully in a sequence, signposted by global variables that incrementally build the primary or secondary output. At the end of the core stylesheet are two unnamed templates. Each one points to the document node of the primary input XML file, and so one of the two will always be the initial, starting template. The first of these templates is for diagnostics and is controlled by a static parameter that allows a developer to turn it on or off. It normally reports back the values of the global variables, set in process order. If that first template is turned off, then the second one takes over, and it drives the messaging system, the primary output tree (bound to some global variable), and initiates any processes necessary for <xsl:result-document> instructions required to generate secondary output. Any primary or secondary output that results in a TAN file must be credited to or blamed upon the application or utility. The metadata for the application will be added to the output TAN file's vocabulary, and an appropriate entry will be added to the change log.

Developing with TAN This chapter addresses anyone who wants to develop their own applications using TAN. Some may want to experiment, revise, or extend the code that already exists. Others may be developing their own XQuery or XSLT application, and intend to use select TAN functions. Yet others may want to customize the standard TAN applications or utilities, perhaps as part of a pipeline or workflow, or for populating a website. TAN is very developer-friendly. The function library is one of the richest, largest of its kind. If you are accustomed to doing natural language processing through the Natural Language Toolkit, Classical Language Toolkit, or a comparable package, you may find that TAN has the building blocks you need to do the same activities within an XSLT or XQuery environment.

General design features All TAN digital assets are organized primarily by role. At the heart of TAN is its function library. This library is the foundation for the schemas that validate TAN files, as well as applications and utilities. All of those resources contribute to a large share of the content in these guidelines.

TAN dependencies The TAN function library is so named because it relies heavily upon functions. But, because it is written in XSLT, there are also global parameters, global variables, templates, keys, and other declarations. Certain design principles have been adopted when designing and organizing these declarations. Validation mode. The TAN function library was designed first and foremost to drive the validation process. That process prioritizes dispensing with parts of the primary input file no longer needed for error-checking. As the TAN fuction library grew to supporting utilities and applications, a sharp distinction needed to be drawn between processing for validation and processing for other purposes. The static global parameter $tan:validation-mode-on exerts a significant influence upon many operations. Files in the functions subdirectory whose names include the keyword extended are excluded from the package when validation mode is on. By default validation mode is off, fetching everything in the TAN function library. Named templates. In general, functions have been preferred over named templates. This allows TAN operations to be used in XPath expressions, and contributes to more concise code. Named templates have been used only when result documents need to be created, or when tunnel parameters need to be preserved. Functions. All functions have their visibility declared public or private. You are welcome to use private functions, but keep in mind that they are generally specialized. Some functions have parallel cached and non-cached versions, to support environments where memoized functions are not allowed. Many functions have multiple versions based on the number of parameters (arity). Lower-arity functions contain comments that point to the highest-arity version, which is fully annotated by enclosed comments. We place them inside the <xsl:function>, so that if a function needs to be copied or moved, the documentation always accompanies it. Documentation shares a common structure: first, the intended input; second, the intended output; third, other notes; finally: kw: with a comma-delimited list of keywords categorizing the function. Template modes. Every template mode has an associated <xsl:mode> declaration, which always defines the default behavior of the template. To reduce the chance of interference with XSLT applications that might include the library, there is only one template that defines behavior for all template modes (mode="#all"), at a very low priority, for elements that contain validation error messages. That means that you can use <xsl:include> or <xsl:import> without worrying about conflicts with template modes in your host application. All mode names are set in the TAN namespace, to avoid conflicts with dependent resources. Keys. For convenience, all keys are kept in files at functions/setup. Character maps. For convenience, all character maps are kept in files at functions/setup. Global parameters. Most global parameters are invitations to the user to configure the environment, and they are placed in the main parameters directory. A few global parameters are reserved for technical processes, and they are kept in files at functions/setup. All global parameters are bound to the TAN namespace. The exception to this general rule of thumb are the global parameters unique to specific utilities and applications; they are placed in no namespace. Doing this has helped solidify the boundaries of the TAN function library. Global variables. Development work revealed that global variables, even those that were not used, frequently slowed the validation process. Therefore global variables are kept to a minimum within the standard components, but are used more extensively in the extended components. Each global variable is bound to the TAN namespace. Those whose values rely upon the primary input file are constructed under the assumption that the primary input file is a TAN file. For more specific explanation of individual components see .

Using TAN functions TAN's extensive function library, which drives the validation process, provides a foundation for application development in XSLT. If you are writing an XSLT application, simply point via <xsl:include> or <xsl:import> to functions/TAN-function-library.xsl. That's it. You now have access to the complete TAN function library. If you are developing for XQuery, you can access any of the functions via fn:transform(), taking care to set up that function's parameters correctly. See . Some relatively complex TAN functions may be affected by the settings in the subdirectory parameters. Otherwise, the functions have been designed to be as orthogonal as possible. There are so many TAN functions, you may not know where to begin. Discovering what is available will take some time and study. You could simply browse the XSLT files that constitute the function library. Or you can use the autocomplete feature in Oxygen's editing mode. Either method will provide a complete but perhaps chaotic experience. These guidelines provide a more accessible starting point. Begin with the grouped index: . Find a topic or function you are interested in, and follow the links.

The mechanics of validation In many cases, developers will want to work with TAN files, either as input or as output. But TAN files have a number of distinctive constructions: two different methods of inclusion (see ), space-normalization rules (see ), numeration systems (see ), tokenization systems (see ), and pointing systems (see ). You can work directly with raw TAN files, but you run the risk of misinterpreting the file. Every TAN file is definitively interpreted through the TAN functions that undergird the Schematron validation process (see ). That process is a core part of the standard TAN utilities and applications, and it determines the nature of some of the more important global variables. Every TAN file is subject to two major transformations, both for validation and for applications.

Resolution The first transformation resolves the file. The goal is to get the file into a state where it can be understood on its own terms. A resolved TAN file contains all its relevant vocabulary and components. It can be evaluated without having to consult the files referred to by <vocabulary> or <inclusion> dependencies. (See for background on TAN's approach to inclusion.) This process also does some basic file-specific normalization; it will: Prepare the file. This includes stamping the root element with a base URI (the path location of the file), evaluating <alias>, and inserting into every element a @q that contains a identifier unique to the element. This identifier is used by the Schematron file to match an element with any error messages in the corresponding element in the XSLT output. Insert required components from <vocabulary>s or <inclusion>s using the following method: Relevant external vocabulary items are inserted into the <head>, either as descendants of the appropriate <vocabulary> or if derived from TAN standard vocabulary as new <tan-vocabulary> elements immediately following the <vocabulary-key>. All vocabulary items are imprinted with an <id> corresponding to an @xml:id from any corresponding entry from <vocabulary-key>, to facilitate rapid retrieval of vocabulary. Any vocabulary <name> that is not normalized is duplicated with a name-normalized copy (signaled by @norm): lower-case, hyphens and underscores changed to spaces, and space-normalized. Any element with an @include is replaced by the elements of the same name found in the target inclusion document (constructed recursively if need be). In addition, <inclusion> (in the head) is populated with any vocabulary items required to resolve the newly included material (recursively, if need be). This last point is important, because all idrefs must be interpreted in light of the original context. Included idrefs are made available to the host document, so when you use <inclusion> you must ensure there are no id conflicts. Normalize all numbers in original components (i.e., excluding included elements or vocabulary items) as Arabic numerals. Files are resolved recursively. That is, no <vocabulary> or <inclusion> components are incorporated or processed until the files pointed to are themselves first resolved. Numerals fall at the end of the process because they might need to be resolved in light of resolved vocabulary and inclusions. The description above is necessarily generalized. For details consult the function library, particularly the functions/resolution directory. In cases of conflict between the code and the description above, the code should be given priority.

Expansion The second transformation expands the resolved file. You must resolve a TAN file before you try to expand it. The goal behind expansion is to unpack the components of a resolved document and identify any errors along the way (see the master list of errors). There are three levels of expansion, corresponding to the three levels of Schematron validation: terse, normal, and verbose. In terse expansion, for each value of an attribute, an element with the attribute's name is placed within the parent (e.g., @type="a b" produces <type>a</type> and <type>b</type>). If the value is an IDref, and it points to an alias, a copy is made for the idref of each target vocabulary item. If an idref does not point to a vocabulary item of the expected type, an error message is also copied in the parent. Any values that are ranges are expanded, if need be. Select networked files are checked for basic validity. Class-2 files undergo a extra rounds of processing during terse validation: sources are adjusted if need be, and then checked against references in the host class-2 file. (See .) In terse expansion, all pointing mechanisms are checked. Because of this basic requirement, some terse expansion can take a long time on lengthy files, or ones with complex <adjustments>. Normal expansion builds on terse expansion by interrogating networked files more closely. Any errors that were reported during the terse stage but were suppressed to avoid clutter are enabled. Verbose expansion generally attends to procedures that are complex, or are not essential parts of a validation report. For example, a <model> of a class-1 file will be checked, to find references that one has but is lacking in the other. A class-1 <redivision> will be analyzed, to make sure that the two transcriptions are identical. A catalog file in the same directory will be checked, to see if it has faulty entries. Many errors lend themselves to solutions that can be recommended by the TAN function library. Some solutions are returned to the Schematron validation method as Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs (e.g., Oxygen XML Editor) can then prompt users to quickly fix an errant section. For example, if text has not been NFC Unicode-normalized, an SQF will allow a user to make the change in two clicks. Thus, TAN validation does not merely tell you what the problems are; it tries to help fix them. The term "expansion" describes the process but possibly not the output. If the global parameter $tan:validation-mode-on is true, then in the course of expanding the file the TAN templates will abandon any parts that are no longer needed. The output is normally much smaller than the input file, restricted as it is to the root element, which merely wraps errors, warnings, or fixes. So although during validation the file is really being expanded, at the end only a small portion of the expanded file is returned to the Schematron processor, to expedite validation. But if $tan:validation-mode-on is false (the default value), the entire expanded file and its dependencies are returned. Such output can be very useful in applications. The preceding description about expansion is necessarily generalized. For details consult the function library, especially functions/expansion.

Using TAN global variables The global variables in the TAN function library provide quick access to some important material. For a complete list of global variables, with detailed lists of dependencies and dependents, see . That technical appendix will not provide the context necessary to identify some of the key features of the global variables, which this section attempts to provide. As noted above, the primary task of the TAN function library is to drive the validation process, described in the previous section. If you are developing an application that begins with a TAN file, whether as primary or secondary input, it is often best to start with it in its resolved or expanded state. If that TAN file is the primary (catalyzing) input, use the global variables $tan:self-resolved and $tan:self-expanded. If it is secondary input, use tan:resolve-doc() and tan:expand-doc(). You must resolve a TAN file before you attempt to expand it. For a class-2 file, $tan:self-expanded, or the output of tan:expand-doc(), is a sequence of documents, starting with an expansion of the class-2 file, followed by expansions of its dependencies (TAN-T or TAN-mor). Its expanded class-1 sources will be tokenized where required, and marked with anchors for each reference in the class-2 file. If a token straddles leaf <div>s, the token will be reconstituted by moving the tail of the token up. These expanded sources are excellent candidates for other types of transformation. For example, HTML pages can be created to integrate class-2 annotations and their class-1 sources, in a variety of ways. Even when the validation mode is turned off (default), the validation phase (terse, normal, verbose) plays a significant role in the results of expansion. At the terse and normal phases, an expanded class-2 file will contain expanded versions of both the host file and its sources. At the verbose level, an expanded TAN-A file will conclude its $tan:self-expanded sequence with one or more documents with a root element <TAN-T_merge>, one file per detected work. A TAN-T_merge file has one <head> per class-1 source that has been merged, and the <body> contains a master set of <div>s that merge all the other sources' <div>s that share the same reference, after all <adjustments> have been made. Each leaf <div> in each source appears in the appropriate place, but as a child of a common <div> that encompasses all other leaf <div>s with the same reference. For each version's leaf div, @type is changed to #version, and other markers signify which source it corresponds to. A TAN-T_merge file is a good basis building parallel displays (e.g., ) or statistical analyses. These merge files can be created on an ad hoc basis through the function tan:merge-expanded-docs(), applied to individual class-1 files, after expansion. If your application uses a TAN file as the primary input, you may want to take advantage of some other important global (see ): Global variables for networked files Raw (first document available) Resolved Expanded <inclusion> — $inclusions-resolved — <vocabulary> — $vocabularies-resolved — <source> — $sources-resolved $self-expanded[tan:TAN-T] <see-also> $see-alsos-1st-da $see-alsos-resolved —

The column labeled "raw" lists variables that hold the first documents available, without alteration. Variables in the next column hold the resolved form, following the same process described above for $tan:self-resolved. The resolved forms of <inclusion> and <vocabulary> are sufficient for validation, therefore they do not have expanded versions. Expanded sources are always bundled with their class-2's $tan:self-expanded. For most applications, a resolved file is a sufficient starting point. But even then, there will be places where you will want to fetch the vocabulary bound to a particular attribute or element. One of the more important functions to familiarize yourself with is tan:vocabulary(), which can be used to get the IRI + name pattern of a specific node, or to get all the vocabulary available for a given type. Some developers will find even tan:vocabulary() a hassle to use. Consider setting the global parameter $tan:distribute-vocabulary (default false) to true. If that happens, whenever an attribute with an idref appears, it it will be imprinted with the corresponding IRI + name pattern for the referred vocabulary item. Exercise this option with care: the expanded document will grow significantly larger.

Appendixes