Guide to the Text Alignment Network, Version 2020

Guide to the Text Alignment Network, Version 2020 Text Alignment Network: Official Guidelines 2015-present Joel Kalvesmaki Joel Kalvesmaki kalvesmaki@gmail.com All software, code, and dependencies (/applications, /functions, /schemas, /vocabularies) are released under a GNU General Public License, https://opensource.org/licenses/GPL-3.0. All other materials (such as this document), unless otherwise specified, are licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/ Latest stable version: http://textalign.net/release/TAN-2020/guidelines/. Development version: https://github.com/textalign/TAN-2020/tree/dev Version 2020 (alpha) 2020-08-13 Formats: HTML • PDF • Docbook (master) In case of contradictions, apparent or not, between these guidelines and the core TAN files, priority should be given first to the RELAX-NG schemas (compact syntax), then to the functions, and finally to these guidelines. General Overview Introduction

Definition and purpose The Text Alignment Network (TAN) is a suite of highly regulated XML formats designed to maximize the syntactic and semantic interoperability of texts, annotations, and language resources. TAN is particularly suited to aligning texts with multiple versions (copies, translations, paraphrases), and to annotating quotations, translation clusters (word-to-word), and lexicomorphological features. Simple, modular, and networked, the TAN format allows users, working independently and collaboratively, to find, create, edit, study, align, and share their texts and annotations. The extensive validation rules are integrated into a library of functions that definitively interpret the format and provide a foundation for third-party tools and applications. Although expressive of scholarly nuance and complexity, the TAN format has been designed to benefit everyone, scholars and non-scholars alike, and can be used broadly for reading, teaching, publishing, research, analysis, and language learning.

Rationale and Purpose Scholars working with texts frequently need to work with numerous versions. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined by generations and cultures who created, read, and circulated the versions. Such textual comparison requires texts whose words, sentences, paragraphs, and other segments are aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation and its original correspond to each other might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires engagement with other works and other languages, and collaboration across projects and fields of study. Text Alignment Network (TAN) XML facilitates the exchange of multiple versions of texts and annotations on those texts. TAN syntax is suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. TAN is not a single format, but rather a suite of formats, built modularly. Each format is dedicated to a particular task, requiring editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (whether human or computer) can decide whether the data meets their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications (see ). TAN has been designed to support two kinds of scholarly activity: creation and research. When we create our primary sources or analyze them, we normally want what we create to be useful to our colleagues. TAN was designed to assist scholarly creative activities such as: Creating and sharing a transcription of a particular version of a textual work that it is more likely to align with any other TAN version of that text created by someone else; Creating an index of quotations that is semantically rich and can be applied to any other version of the quoting or quoted works; Specifying exactly (e.g., word-for-word) where a source and its translation correspond, even with overlapping or ambiguous relationships, or where doubt or alternative possibilities of alignment need to be expressed; Listing the grammatical features of every word in a text or a language in a way that allows it to be compared easily against other languages and texts. Shared TAN files form a decentralized, interoperable corpus of texts, a kind of Internet of primary sources and annotations. As this TAN-compliant corpus spreads into different linguistic, chronological, and geographical regions, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as: For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones? How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts? How do Arabic translations of Greek texts from the Abbasid period differ from contemporaneous translations from Sanskrit into Arabic? Can an anonymous English translation of a modern French novel be identified with known translators from that period? How do present-day translations of official United Nations documents differ across languages? Neither the TAN format nor its applications answer such questions. But they can be used to start to answer such questions. TAN differs from other text formats such as HTML, Microsoft Word, PDF, or Docbook. Each of those formats are interoperable only in the sense that any file can be reliably opened and displayed by the same software. Despite such software compatibility, the content, structured by each user, looks very different from one file to the next. If you receive from different people two versions of a particular literary work in the same formet, there would be little likelihood that you could align them without a lot of extra work. These are presentation formats, designed to let the creator use his or her imagination to shape, structure, and present the material in highly stylized, creative ways. The formats are laissez faire, concerned mainly to ensure that each component is rendered properly, without regard for the meaning of those components. Creating a text in TAN is like opening a word processor and telling it, "I don't care how the text looks. I want to ensure that it is in a meaningful structure that corresponds to any other version of that text. The appearance, which could take thousands of directions, can be worried about later." The closest analogue to the TAN formats is the XML format developed by the Text Encoding Initiative, whose design catalyzed and continues to inspire the development of TAN. TAN adopts and extends the TEI validation rules, to make them more rigorous and penetrating, to support cross-project interoperability. One of the TAN formats is modestly customized TEI. (For more on comparisons between TAN and TEI see .) Some other caveats: Although TAN comes with an extensive library of functions and templates, it is not what most people think of as a tool or application. It does not provide a graphic interface to create, edit, or display TAN-compliant files, nor does it dictate how such tools should behave. Rather, it allows programmers (especially XML developers) to create customized applications and tools. If you are working with an XML editor like oXygen, your editing experience will be greatly enhanced by the TAN function library. The TAN formats are specialized. They are not meant to replace other common text formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Converting a TAN file into these formats is usually straightforward, but will usually entail loss. Conversely, most conversions from one of these formats into TAN will not entail loss, but will be imperfect or incomplete, because the TAN format requires data that will be missing, or not easily identifiable. Conversion must be given careful thought, and can only be semiautomated. Each TAN format has a restricted field of inquiry, defined and explained in these guidelines. TAN is not suitable for unsupported research interests, e.g., marking a transcription to imitate its presentation in a particular print edition. TAN files are optimized for legibility and readability, and may be inefficient in certain contexts and applications. The extensive TAN validation routines—essential to aiding interoperability—can be taxing to run on numerous or large files. There are work-arounds, explained in the guidelines. Many applications will perform better when TAN files are pre-processed. See .

Participation Changes are made regularly to TAN, mainly in its development branch. If you have a TAN library, sharing it with other participants, particularly via Git, will help developers test any changes that have been made to the function library, and encourage others to contribute to your project. Participants in testing, using, and developing the Text Alignment Network are welcome. Our core purpose is to develop and maintain, in ascending order of importance, the schemas, functions, guidelines, and applications. Inquiries about participation should be sent to the project director, Joel Kalvesmaki, by email: director at textalign.net. Official announcements are made by email (Google Group) and by Twitter.

Starting off with the TAN Format If you are new to markup languages, or unfamiliar or uncomfortable with acronyms and techincal terms such as XML, RDF, XPath, and Unicode, you should start with this chapter, which uses a simple example to illustrate the steps typically taken to create and and edit TAN files, and to gently introduce important technical terms. By the end of this chapter, you will have a sense of how to create and edit a small collection of TAN transcriptions and alignments. In the TAN system, a transcription is a plain digital text that replicates a text found somewhere else, usually reproducing its script and spelling. The following—"In pluribus unum"—is a (partial) transcription of a United States dollar. The term should be distinguished from a transliteration, which is a transcription rendered in a script other than the original. For example, εν πλουριμπυς ουνεμ, would be a Greek transliteration of the previous transcription. The chapter touches on a number of general concepts that are discussed only briefly. If you find the concept new or confusing, follow the prompts for further reading, to get better grounded in a particular topic or technology. If you are already familiar with basic markup concepts, you should nevertheless at least skim through the chapter, because some familiar concepts get handled by TAN in its own special way.

Creating TAN Transcription and Alignment Data Let us take a simple example, that of aligning two English versions of the nursery rhyme Ring-a-ring-a-roses, sometimes known as Ring around the Rosie. Our goal here is to publish two versions of the nursery rhyme in the TAN format so that they are most likely alignable with any other TAN version of the poem that might appear. Although the TAN examples below look much like files in the examples subdirectory of the TAN library, they have been adjusted, to explain the formats better. We begin by finding previously published versions that haven't been digitized. In this case we have taken an interest in the versions published in 1881 and 1987 (one published in the UK and the other, the US). Each of these books have other rhymes, but we've decided to focus upon one nursery rhyme, so we type up (transcribe) that poem and nothing else:Ring around the Rosie 1881 (U.K.) version 1987 (U.S.) version Ring-a-ring-a-roses, A pocket full of posies; Hush! Hush! Hush! Hush! We're all tumbled down. Ring-a-round the rosie, A pocket full of posies, Ashes! Ashes! We all fall down.

We must be sure to save each of the two transcriptions as plain text. Do not bother with a word processor (Word, OpenOffice, Google Docs, and so forth), which is too fancy for our needs. Word processors sometimes generate erroneous data, even when you export to plain text. And we are not concerned with italics, colors, fonts, margins, and so forth. We would be better off with a text editor, which opens and saves only text. But even those do not check to see if the rules of the TAN format have been followed. So the best tool is an XML editor, which like a text editor takes and creates only text. An XML editor is designed to follow the rules of XML, and so saves a lot of typing, and prevents many errors. More important, an XML editor will tell us when our TAN file is invalid, and will provide important help as we edit. Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists published on Wikipedia for both text editors and XML editors. TAN was developed using oXygen, which is very powerful. If you are a new user, you are likely to find it overwhelming. Take advantage of tutorials and documentation associated with the XML editor you have chosen. Our first task is to get these two versions into separate files with the appropriate markup. Each TAN transcription file has two major parts: a head and a body. For now, we focus on only the second part, the body, as well as a few of the necessary preliminary lines that stand at the opening of the file, before both the head and the body. First, the 1881 (U.K.) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring01"> <head> . . . . . . . </head> <body xml:lang="eng"> <div type="line" n="1">Ring-a-ring-a-roses,</div> <div type="line" n="2">A pocket full of posies;</div> <div type="line" n="3">Hush! Hush! Hush! Hush!</div> <div type="line" n="4">We're all tumbled down.</div> </body> </TAN-T> And now the 1987 (U.S.) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring02"> <head> . . . . . . . </head> <body xml:lang="eng"> <div type="l" n="1">Ring-a-round the rosie,</div> <div type="l" n="2">A pocket full of posies,</div> <div type="l" n="3">Ashes! Ashes!</div> <div type="l" n="4">We all fall down.</div> </body> </TAN-T> The examples above are eXtensible Markup Language (XML). XML lets you take a text or a collection of data and structure it with angle brackets, < and >. In the examples above, such markup is in boldface. Each file begins with a prolog, the first few lines that begin with <?. The first line simply states that what follows is an XML document. The next two lines in each example are processing instructions that point to the schemas: files that will be used to check to see whether or not our XML follows TAN rules, a process called validation. We will skip the details of those first five lines. They will be identical, or nearly so, from one TAN file to the next. We can simply cut and paste them when we want to start a new TAN file. After the prolog comes an opening tag, signified by an angle bracket followed by a letter, here <TAN-T>. That opening tag, <TAN-T...> is answered by a closing tag, </TAN-T>, the last line. An opening tag and a closing tag mark the beginning and the end of one of the most important parts of an XML document, the element. For now, you can think of an element as a chunk of data. Every element is marked by a pair of tags. iI this example. <head> is answered by </head>, <body> by </body> and each <div...> by </div>. Any element that has an opening tag must have a closing tag. If an element doesn't have anything between its opening and closing tags, the two of them can be collapsed into a single tag. That is, <a></a> can be simplified to <a/> (such empty elements are illustrated below). Elements and processing instructions are two of the seven basic XML ingredients, called nodes. The other five node types are text, comment, attribute, namespace, and document, some of which we will meet below. The element is arguably the most important type of node, because you will see it most often, and it absolutely required for something to be XML. Every XML file must have at least one element. Elements nest within or beside each other, but they never overlap or interlock. That is, you cannot have <a><b></a></b>. The prohibition on overlapping elements is one of the cardinal rules of XML, and is one of its aspects most discussed. The no-overlap rule keeps XML files tidy, and makes it easier for developers to write efficient applications. Any two nearby elements relate to each other, either by one nesting inside the other, or by one being adjacent to the other. Because of this, every XML file can be thought of as a tree, with the root at the trunk and the nested elements as branches, terminating in metaphorical leaves—the elements that do not contain elements. It is helpful to use the tree metaphor when we describe the path we take, toward either the leaves or the root. In these guidelines, we may use the terms rootward and leafward when we want to trace movement up and down the levels of hierarchy in an XML document (you may also hear the corresponding terms outermost and innermost). The metaphor is strengthened by the XML rule that there can be but only one root element, i.e., the element that contains all other elements and is contained by none. In our examples above the root element is TAN-T. An XML document tree can also be profitably thought of as a family. Family names provide the most common terminology to describe how elements relate to each other. In our examples above, <TAN-T> is the parent of <body>, and <body> is the parent of the four <div> elements. Likewise, each <div> is the child of <body>, and <body> is the child of <TAN-T>. Distant parental relationships can be described with the terms ancestor and descendant. <TAN-T> is the ancestor of every element it encompasses, and every element encompassed by <TAN-T> is its descendant. Paratactic relationships are also important. <head> and <body> are siblings to each other, and every <div> is a sibling to every other <div>. The terms "following" and "preceding" are the most common ways to describe the relationship of one sibling to another. Inside of the opening tags for the <TAN-T>, <body>, and <div> elements are stretches of text: a word followed by an equals sign, then something within quotation marks. These stretches of text are called attributes. On the left side of the equals sign is the attribute name, and on the right side, within the quotation marks, is the attribute value. <TAN-T> has three attributes, @xmlns, @TAN-version, and @id (when in prose we talk about an attribute, we normally preface the name with @). We will skip @xmlns for now. It looks like an attribute, but it's really a pseudo-attribute, because it specifies the namespace of the XML file. Namespaces are an important but advanced topic, not discussed in this chapter. (See .) The value of @TAN-version indicates that the 2020 version of TAN is being used. @id is quite important. Every TAN file has an @id that uniquely names and permanently identifies the document itself. It should not be changed, even if we make edits. If you change the filename or a copy of it winds up being incorporated into another project, a stable @id will be quite important for finding it. An @id should be unique. The only time it should be repeated in a file is when you are referring to another version of the same file. The value of @id must always be what is called a tag uniform resource name (tag URN). A tag URN begins with tag:, followed by an email address or domain name that we own or owned. It is okay to use an obsolete address or domain; its purpose is to allow users to identify you, perhaps centuries from now, not to contact you, although that might be a nice side benefit. After that email address or domain name comes a comma (no spaces) and a date on which we owned it, in the form of numbers for the year, year + month, or year + month + date, each item joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be the first of the month; if we leave off the month value it is assumed to be January. In the examples above, parkj@textalign.net,2015 points to our fictive self, Jenny Park, who owned that particular email address on the stroke of midnight (Coordinated Universal Time) January 1, 2015. After that comes a colon, and then any name we wish to assign to the file. We have anticipated a simple collection of texts, so we've called the files ring01 and ring02. If we run out of names, or want to restart, we can simply use a new email-date preface, e.g., parkj@textalign.net,2015-01-02. Or we could change the way we build our tag URNs. Tag URNs are very useful. You do not need permission to create a tag URN. You don't need to register them with anyone. Hundreds of years from now, when that email will be defunct or perhaps owned by someone else, users might still be able to identify who was responsible for creating the file. And that email address or domain can be recycled by the new owners, decades from now, to create their own tag URNs. The element <body> contains our transcription. @xml:lang, required, specifies the principal language of the transcribed text. We use the standard 3-letter abbreviation for English. We could have used en, but 2-letter abbeviations support only a relative handful of languages. (See for more.) Our transcription has been divided into four <div> elements. How we divide up the work is entirely up to us. But we must make sure that every bit of text is enclosed by a leaf <div> (i.e., one that contains no other <div>). Every <div> must be the parent of only other <div>s, or none at all. No <div> may mix text and other elements. An exception is made for text that is nothing but space (the space bar, the tab, or the new line). Space-only text can be mixed with elements as needed, which means that a TAN file can be indented as you like without changing its meaning. The values of @type and @n indicate, respectively, the type of division and the name of the division. We have used line in the first example, but we could easily have also used l (as we did in the second) or ln or any other phrase that we think will make intuitive sense to other users. The value is arbitrary, but leads to meaning that is not arbitrary (we will see how and why below). We have used arabic numerals for the values of @n, but the value, once again, could have been anything. Here we've opted for a reference system that seems intuitive and will most likely apply to multiple versions of the work. But the Arabic numerals are not required. We could have used Roman numerals, or some other numbering or naming scheme that is standard in the field. Aside from the <head> element (discussed later), that's all we need in the TAN-T transcription. We can now move to alignment and annotation. The TAN-A format allows us to align and annotate as many transcriptions as we wish, and to make claims about them. Let's begin, once again temporarily skipping <head>. Significant differences from the previous two TAN-T files are emphasized:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> . . . . . . . </head> <body/> </TAN-A> In the prolog, the first line is identical to the first line of our transcription files. The second and third lines, the processing instructions, are identical, except that href points to the validation files specific to the TAN-A format. Even the fourth line looks like the two TAN-T files, other than the new name for the root element, <TAN-A>, and the new value for @id. The penultimate line, <body/>, is an empty element, and is equivalent to an opening tag immediately followed by a closing tag, i.e., <body></body>. The alternative form, <body/>, is a shorter and easier way to indicate that an element contains nothing. It will become apparent, when we discuss <head> below, why our <body> can be empty. The other kind of alignment, TAN-A-tok, takes a bit more work, because we must first identify words that correspond with each other. Even before we do that, we need to decide what kind of relationship holds between the two texts. Let us pretend, for the sake of example, that the 1987 version is a direct descendant (and therefore variation) of the 1881 one. So our task is to show exactly what words or phrases in the the older version correspond to those of the newer one. We will simplify in this case, and assume an interest only in words with letters, and not punctuation (some linguists legitimately treat punctuation as words in their own right). The term word is notoriously difficult to define, so we will call them tokens, to avoid false connotations (hence the name of the file, TAN-A-tok, to refer to alignment of tokens). We now create a TAN-A-tok file:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-A-tok.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> . . . . . . . </head> <body reuse-type="general_adaptation" bitext-relation="B-descends-from-A">  <align> <tok src="ring1881" ref="1" pos="1"/> <tok src="ring1987" ref="1" pos="1"/> </align> <align> <tok src="ring1881" ref="1" pos="2"/> <tok src="ring1987" ref="1" pos="2"/> </align> <align> <tok src="ring1881" ref="1" pos="3"/> <tok src="ring1987" ref="1" pos="3"/> </align> <align> <tok src="ring1881" ref="1" pos="4"/> <tok src="ring1987" ref="l" pos="4"/> </align> <align> <tok src="ring1881" ref="1" pos="5"/> <tok src="ring1987" ref="1" pos="5"/> </align>  <align> <tok src="ring1881" ref="2" val="A"/> <tok src="ring1987" ref="2" val="A"/> </align> <align> <tok src="ring1881" ref="2" val="pocket"/> <tok src="ring1987" ref="2" val="pocket"/> </align> <align> <tok src="ring1881" ref="2" val="full"/> <tok src="ring1987" ref="2" val="full"/> </align> <align> <tok src="ring1881" ref="2" val="of"/> <tok src="ring1987" ref="2" val="of"/> </align> <align> <tok src="ring1881" ref="2" val="posies"/> <tok src="ring1987" ref="2" val="posies"/> </align>  <align> <tok src="ring1881" ref="3" pos="1, 2"/> <tok src="ring1987" ref="3" pos="1"/> </align> <align> <tok src="ring1881" ref="3" pos="3 - 4"/> <tok src="ring1987" ref="3" pos="2"/> </align> <align> <tok src="ring1881" ref="4" pos="1"/> <tok src="ring1987" ref="4" pos="1"/> </align> <align> <tok src="ring1881" ref="4" pos="2"/> </align> <align> <tok src="ring1881" ref="4" pos="3"/> <tok src="ring1987" ref="4" pos="2"/> </align>  <align> <tok src="ring1881" ref="4" pos="last-1"/> <tok src="ring1987" ref="4" pos="last-1"/> </align> <align> <tok src="ring1881" ref="4" ord="last"/> <tok src="ring1987" ref="4" ord="last"/> </align> </body> </TAN-A-tok> Once again, the first four lines, the prolog and root element, should look familiar, with the only significant changes being the names of the validation files, the name of the root element (<TAN-A-tok>), and the value of @id. The heart of the data is <body>, which has two key attributes, @reuse-type, which describes the activity that was performed to change one version into the other, and @bitext-relation, which specifies how one book relates to the other. Our two values, general_adaptation and B-descends-from-A, are arbitrary names that we define in the <head> (discussed later). (To understand the concepts behind reuse types and bitext relations, see ). You will also notice some lines that begin . These are comments, and can be placed within or beside any element, and can enclose any text we like, including line breaks. <body> is the parent of one or more <align> elements, each of which correlates a set of tokens in each of the two texts, pointed to by its <tok> children. Each <tok> has, in this example, three attributes. @src takes a nickname (an @id reference) that points to one of the two transcriptions; we have used ring1881 and ring1987 for our two texts, but we could have just as easily used anything else such as a and b, or uk and us. @ref has a value that points to a specific <div> in the source TAN-T transcription; and @pos or @val specify which token is intended, either by word number (@pos) or text of the actual word (@val). Either technique is fine, and @pos and @val can be mixed, as in the example. It is generally a good idea to use @val, because if the underlying transcription changes in that location, @val might help someone repair it; with @pos alone, you can't. You may also notice that the comma and hyphen can be used in @pos to point to multiple words within the same <div>, and that last and last-X (where X is a digit) can be used to point to a token by position counting from the end of a <div>. Each <align> can establish one-to-one, one-to-many, many-to-one, or many-to-many relationships between tokens from the two texts. A token may feature in multiple <align> elements. And if an <align> has <tok> elements belonging to only one source, such as in the fourth-to-last <align> above, we have what is called, in these guidelines, a one-sided alignment. This one-sided alignment indicates that the second word of line four of the 1881 version is excluded from the act that we have called adaptation. If this were a translation, it would be as if we were saying that this word was excluded from the translation. (A one-sided alignment containing tokens only of the later source might point to words that the translator added, i.e., what in translation studies is called explicitation.) A one-sided alignment should not be confused with silence. As creators of this file, we make no claim to providing an exhaustive account, and we are under no obligation to indicate every word-for-word correspondence. If we fail to mention certain words, all that can be implied is that we opted not to say anything about them. We could have aligned the two texts in different ways. Perhaps further study will reveal that we were in error to associate the second "ring" with "round" in line 1. We can make corrections, even after publication, and notify other users of our data about the change. There are also ways to express doubt or alterative opinions, and to credit (or blame) the person making the assertion. We can even correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are discussed at .

The Principles of TAN Metadata (<code><link linkend="element-head" ><head></link></code>) At this point, we have finished four TAN files: two transcriptions (TAN-T), one macro-alignment file (TAN-A), and one micro-alignment file (TAN-A-tok). We've avoided discussing the <head> in each of them until now. Before getting into details, some important concepts need to be covered first. Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> contains data about the data that is in <body>. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the standard XML ones), the metadata requirements are stricter than they are for Word documents, HTML, TEI, or other formats you might know better. Scholars who find our file expect to know some things about it before they can responsibly use it. For example, what are the sources we have used? Who produced the data? When? What changes or adjustments have been made? What licenses govern the use of the data? The questions are not difficult to answer, but they require thought, care, and some time to answer. Some metadata questions apply only to one TAN format. For example, in a TAN-A-tok file, we ask what relationship holds between the two sources. But that question makes no sense for a TAN-T file, which is merely a transcription. Some questions apply universally across all TAN files, no matter what kind of data. The TAN formats have been designed so that <head> handles common metadata consistently across each format. This reduces potential confusion, and helps other people using our data to find the information they want. More important, what we write in one file can be referenced by another, without duplication, and so will reduce the chance of errors. Another TAN principle is that each <head> should focus exclusively upon scope of the data in <body>, and not on other things. For example, in a TAN-T file, we are concerned only about the transcription, so our metadata too should be concerned only with the transcription. We should indicate its source, but because our file is not about the source itself, so we don't need to describe it further. We are not library catalogers, nor should we be. A TAN-T file is for transcribing, not for curating bibliographical data. Our obligation is merely to point a reader to complete and authoritative information, found elsewhere. TAN was also designed under the principle that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen (Ring around the Rosie) in a way that is comprehensible not just to the reader but to the computer. Take for example the 1881 book we have used for our first transcription. For the human reader we can write something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But this human-readable string is too complex and syntactically opaque for computers and algorithms. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs), in this case urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below another way to name it.) There are different URNs for different things: journals (via ISSNs, urn:issn:...), articles (DOIs, urn:doi:...), movies (ISANs, urn:isan:...), and so forth, which means that anyone can use them to refer unambiguously to a particular kind of thing. URN naming schemes must be registered with the Internet Assigned Numbers Authority (IANA) to ensure permanent, persistent, unique names for various types of things. (See IANA's registry and for a complete list of official URN schemes.) All URNs are simply names. They don't tell you where an object is. To provide a unique location, however, we have the perhaps more familiar universal resource locators (URLs), e.g., http://academia.edu. Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor). Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms are easily confused and conflated, even by veterans. URIs and IRIs are basically the same thing, and they encompass URNs and URLs, a relationship and function that can be remembered by the last letter in each acronym: URIs/IRIs Incorporate both Locators (URL) and Names (URN). If those acronyms are confusing, don't worry. For our purposes, they are pretty much all the same, and from this point onward we'll stick with the term IRI (unless we really mean a location to find a file, which we'll call a URL). IRIs are essential to a system frequently called the semantic web or linked (open) data, which relies upon IRIs as the basis for a simple universal data model. The semantic web allows people to make assertions in a way that computers can "understand." If people, working independently, happen to use the same IRIs to describe the same things, then computers can be programmed to make associations between disparate, heterogenous datasets. For example, if one scholar claims through IRIs that X is the mother of Y, and another claims in a different dataset that Y is the mother of Z, a computer can infer that X is the grandmother of Z, without the two scholars being aware of each other's work. When many scholars begin to use IRIs in their data, the result is a network that allows us or anyone else to discover connections across disciplines and projects, and make inferences that transcend any single project. TAN has been designed to be semantic-web friendly, and so requires in its <head> almost all data to be not just human-readable but also computer-readable, normally as an IRI. Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to those most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar. Highly familiar: The two books that provide the basis of our transcription are catalogued and generally well known. A number of services provided by librarians provide controlled IRI vocabularies that can be used by anyone to unambiguously identify a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found Library of Congress IRIs for both editions of Mother Goose: http://lccn.loc.gov/12032709 and http://lccn.loc.gov/87042504. Observe that these two IRIs are also, perhaps confusingly, URLs (locations). If we paste these strings into our Web browser, we retrieve a record that describes the book. This locator does not lead us to the book itself, only to information about the book. Nevertheless, the Library of Congress has decided to make this URL also a name for the book, which means that it does double duty, both as a location for a Web page and a name for a book. Anyone who owns a domain name can designate a URL as a name for an object, a practice that can easily confuse anyone new to the semantic web, because such URLs name in reality two types of things: an entity and a web resource to learn more about that entity. The idea is that hundreds of years from now, when the web page no longer exists, the name will still be valid. In the TAN system, you can apply as many IRIs to a concept as you like. In fact, it is a good practice to find and add as many IRIs as you think worthwhile, just in case someone can't figure out what you're trying to identify. Just make sure that any IRI you copy unambiguously points to the thing you have in mind. We now have IRIs for the sources. Let's now find an IRI for the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is a benefit. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which provides a unique URN for every Wikipedia entry in the major languages. The DBPedia IRI in this case is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses. Once again, this is both a name and a locator. It names a specific, intangible, abstract work, namely, a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that IRI into your browser, you will get back more information about that named object. Familiar to specialists: We will need to have IRIs for some of the people who edited the file. Here we're not interested in the authors of the books we transcribed. We are interested in identifying the people who helped make the TAN file itself. Most people who write and edit TAN files will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for authors, editors, and other persons central to the publications held in the world's libraries. Most contributors to TAN files, however, will not be listed in these databases. In those cases, we can name these participants with an IRI that we "own." We have already done something like this by assigning tag URNs to our four TAN files (the value of @id in the root element). Our editors can do the same thing. If a student Robin Smith has been helping with proofreading, Robin can take an email address (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as tag:smith.robin@example.com,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about this particular Robin, but it at least allows us to assign a name that will not be confused as another Robin Smith, for example the one identified by ISNI as http://isni.org/isni/0000000043306406. (If we want to go a step further, Robin could mint a URN from a domain name that she owns, and set up a linked data service that offers more information, human- and computer-readable. But this is not required, and it can be a hassle to set up and maintain.) Let's take a more difficult challenge for locating an IRI, that of describing the @bitext-relation in our TAN-A-tok file. @bitext-relation draws from the discipline of stemmatology, which studies how manuscripts were copied from each other, and tries to place these manuscripts in a chain of transmission, a kind of historical stemma (tree). We have to find an IRI that describes the relationship that we claim holds between two text-bearing objects. Making that clear is important, because our perspective about the relationship between the two books affects the decisions we make when we align words, and other scholars using our files will want to know the assumptions we had when we aligned the two texts. For the sake of illustration we posit that the version published in the 1987 Mother Goose is a direct but not immediate descendant of the 1881 version. Because no suitable IRI vocabulary yet exists for the relationships between texts, TAN itself has coined an IRI that can be used by anyone wishing to declare that, given two ordered sources, the second descends from the first through an unknown number of intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b. (The arbitrary symbol / signifies a step from one version to the next, and the x+ represents one or more intermediate versions.) We'll use that one for now. We face a similar issue when thinking about text reuse, @reuse-type. Here we are concerned with creative activities such as translation, paraphrase, adaptation, and so forth. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are no stable, well-published IRI vocabularies for text reuse. So we adopt an IRI that is part of TAN's standard vocabulary, tag:textalign.net,2015:reuse-type:adaptation:general. In the previous two cases, we could have come up with our own vocabulary. But the idea behind the semantic web is to use common, familiar vocabulary whenever possible. That's the same principle that drew us to structure and label the poem in four consecutively numbered lines. We adopt conventions we expect others will likely follow. The built-in TAN vocabulary simply gives us a convenient lingua franca for describing some important but abstract concepts. For other examples of IRIs coined by TAN, see . Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps even us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we had only superficially thought about the category. If we wish to avoid such situations, we can assign a randomly generated IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. UUID URNs are very useful. The likelihood that a randomly generated UUID will be identical to any existing UUID is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that UUID URN to name some other object or concept). Numerous free UUID generators can be found online. To humans, a UUID on its own is meaningless, unmemorable, and rather ugly. But it is a start. We always have the option, later, of supplementing it with other IRIs. It's perfectly fine to assign multiple IRIs to one object or concept. But the reverse is never true. One should never use one IRI to identify more than one object or concept.

Creating TAN Metadata (<code><link linkend="element-head" ><head></link></code>) Now that we have explored various IRI vocabularies for concepts related to our files concerning Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version:<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring01"> <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/> <license licensor="park"> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Attribution 4.0 International</name> </license> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name> </work> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <vocabulary-key> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name>Jenny Park</name> </person> <div-type xml:id="line"> <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> <name>line of poetry</name> </div-type> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> <to-do/> </head> . . . . . . . </TAN-T> <name>, the human readable counterpart to the @id that is inside the root element, can be anything. And we can supply more than one <name>, in case we wish to provide alternative names of the file in different spellings or languages. One or more <master-location>s provide URLs where master versions of the file are kept (and maintained). We provide this as a courtesy to others who might be using our data. Anyone who validates their local copy of the file will be warned if it does not match the master version, and they will be told of the most recent changes. This lets us silently and conveniently notify other users of changes. We do not have to keep track of the users of our file, and users do not have to pester us with questions about what changed when. <master-location> is mandatory only if we are finished with our to-do list, which is specified at <to-do>. If that element is empty, then we imply that we do not know of anything further that should be done to the file. Conversely, any elements in <to-do> specify what remains to be done, and details will be returned to other users. That way you can release data that is useful but not completely perfect, and let users know about its deficiencies. One day the link in <master-location> will be dead. But perhaps a copy of our file will be in circulation in other quarters. The document @id in the root element provides a way to identify and find files, independent of links. <license> specifies the license under which we are releasing our data. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). That is, we are specifying what rights are attached to the data, not its source, i.e., if we have placed additional strictures on the content in <body>. In this example, we have released the data under a creative commons license. The child element <IRI> specifies a Creative Commons IRI, and <name> is the human-readable form. @licensor specifies who has granted the license, in this case our fictive Jenny Park (see below). The conjunction of <IRI> and <name>, the IRI + name pattern, recurs throughout TAN files. They are used provide identifiers for vocabulary items. In an element that takes the IRI + name pattern, we may include as many children <IRI>s or <name>s as we like. But if we do so, we are stating that they are synonymous, i.e., that they all name the same thing. (Once again, an IRI is unique, so it should never be used to identify more than one thing.) <work> uses the IRI + name pattern to name the work we have chosen to transcribe. <source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen. <vocabulary-key> contains vocabulary that we are using in our file. Inside, we can place more vocabulary items, and attach locally unique ids. For example, an IRI + name pattern is used for <person>, which identifies through a tag URN Jenny Park. The value of @xml:id allows us to use park any time we want to mention Jenny. In fact, we already have, at @licensor. Any mention of park will point to the appropriate item in <vocabulary-key>. There are a few other parts of <vocabulary-key>. <div-type> specifies an IRI + name pattern for line divisions, and the value of @xml:id means that we can use line any time we want to invoke the concept. Similarly we have a <role>. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles. After the <vocabulary-key>, we get into parts of the file that specify who did what, when. First is a <file-resp>, whose value of @who, park, indicates that Jenny Park is the one primarily responsible for the file. <resp> specifies further who was responsible for doing what. If you decide to modify someone else's TAN file, you should credit / blame yourself for the changes. Your first point of order should be to add a <person> to the <vocabulary-key>, identifying yourself. You can then either add a <change> (see below) or a <resp> (you might need to specify a <role> in the <vocabulary-key>). You should not change the document's @id, unless your changes are so significant that it becomes altogether a new document. TAN does not try to broker the age-old problem of determining when a thing that undergoes changes becomes something altogether different. Use your best intuition. Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who was responsible for the printed source. If someone wants to know anything about the book, they should pursue the IRI identifier we have provided under <source>. <change> has attributes @when and @who to specify who made the change and when. The value of @when is always a date or a date + time, formatted according to the ISO standard syntax: [YYYY]-[MM]-[DD] or [YYYY]-[MM]-[DD]T[HH]:[MM]:[SS]. @who always carries an IDref that points to a person or organization. <change> does not take the IRI + name pattern, or even any children at all. So now we have finished one transcription file's metadata. The next one will look similar, but we'll take a couple of shortcuts:<TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring02"> <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license which="by 4.0" licensor="park"/> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <adjustments> <normalization which="no hyphens"/> </adjustments> <vocabulary-key> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> </vocabulary-key> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> <to-do/> </head> . . . . . . </TAN-T> In this example, <name>, <master-location>, and <source> have been modified to describe this file. Note, we haven't had to change <work>. <license> looks different, but in reality it is identical to our previous example, and that is because the IRI + name pattern has been replaced with @which. You may replace any IRI + name pattern with @which; its value should match a <name> in customized or standard vocabulary (a TAN-voc file). In TAN's standard vocabulary for licenses (see ) is the following item: <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:textalign.net,2015:tan-voc:licenses"> . . . . . . . <body affects-element="license"> <item> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <IRI>tag:textalign.net,2015:license:by/4.0/</IRI> <name>by 4.0</name> <desc>attribution 4.0 international</desc> </item> . . . . . . . </body> </TAN-voc> Because the validation rules for TAN-voc files require every <name> to be unique, that element can be treated as a unique identifier, similar to @xml:id. We could have repeated the <license> from the previous TAN-T file. But the @which method is much quicker and cleaner. Before <vocabulary-key> comes a new element, <adjustments>, which contains a <normalization> statement whose @which says no hyphens. That too points to a standard TAN vocabulary for normalizations that provides an item with an IRI + name pattern for eliminating discretionary hyphens (see ): <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:textalign.net,2015:tan-voc:normalizations"> . . . . . . . <body affects-element="normalization"> <item> <IRI>tag:textalign.net,2015:normalization:hyphens-discretionary-removed</IRI> <name>no hyphens</name> <desc>Discretionary word-break line-end hyphens have been deleted.</desc> </item> . . . . . . . </body> </TAN-voc> As you might have inferred, the element <normalization> specifies how we have changed the data, namely, that we have opted to remove word-break line-end hyphenation. In other transcriptions we could use <normalization> to declare other kinds of changes we felt compelled to make, such as removing editorial comments or footnote signals. A healthy list of <normalization>s is a courtesy to users of our data, some of whom might passionately care about keeping or removing line-end hyphenation. Back to our example. <div-type> has a new value for @xml:id, the letter l, and in it too the IRI + name pattern has been replaced by @which, whose value, line (poetry), is a standard vocabulary item (see . There is a also new <comment> element, which is built much the same as <change>. (A <change>, after all, is just a comment about what has been changed.) That seems to be all there is. But if you've been attentive, you will have noticed that <role> from our first TAN-T file (inside <vocabulary-key>) is missing. That's because we don't need it, based on the same principle that lets us resolve @which. A vocabulary <name> can be invoked not only in @which, but in any attribute that points to values of @xml:id, in this case @roles. There is already a standard TAN vocabulary item with the <name> creator, so we can use it directly without having to go through an intermediate vocabulary item with an @xml:id. If we had defined something else in <vocabulary-key> with a @xml:id of creator, that item would take precedence and override the built-in TAN vocabulary. But we haven't, so the standard TAN vocabularies are the default.

Building TAN Vocabulary The first TAN-T transcription had a longer <head> than the second one did, and that is because for the former we used an explicit method, that of specifying every IRI and name, and then in the latter adopted shortcuts that took advantage of TAN vocabulary. TAN vocabularies are meant not merely to be a convenience; they are intended to avoid problems that beset projects that create many files with repeated data patterns. When (not if) you make changes to one file you have to remember all the other places where you might need to make the same changes. The old programmer's adage "Don't repeat yourself" (DRY) is operative here. If there is a repeating data pattern, put it in one master place, and let the other files point to that pattern. When we make changes, we do so only at a single place. The previous examples drew from standard TAN vocabulary, which is written in one of the other TAN formats, TAN-voc. There is a whole collection of standard TAN-voc files in the project subdirectory called vocabularies. We can write our own TAN-voc files, to collect the vocabulary items that we will use repeatedly from one file to the next. For example: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="../../schemas/TAN-voc.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="../../schemas/TAN-voc.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:TAN-voc:standard"> <head> <name>Keywords for TAN files edited by Jenny Park</name> <license licensor="park" which="by 4.0"/> <vocabulary-key> <person which="Jenny Park" xml:id="park"/> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2019-10-08" who="park">Started file</change> <to-do> <comment when="2020-01-04" who="park">Need to check files for new vocabulary items.</comment> </to-do> </head> <body> <group affects-element="person"> <item> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </item> </group> <item affects-element="work"> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring a Ring o' Roses</name> <name>Ring Around the Rosie</name> </item> </body> </TAN-voc> In this example case, updates have been made to @id and <name>, and a <comment> has been added to <to-do>. The most significant difference is the <body>, which has two <item>s, one of which is wrapped in a <group>. Each @affects-element specifies one or more names of elements that the enclosed items affect, and the <item>s have the standard IRI + name pattern. <group>s may nest as you like. The difference between a grouped and ungrouped <item> is purely a matter of taste and convenience. The example above illustrates both methods. The <vocabulary-key> has a <person> whose @which points to the body of the first <item>. That is, a TAN-voc file can use its own vocabulary, without repeating it in <vocabulary-key>. Let's return to the <head>s of our two TAN-T files, and see how to incorporate our new TAN-voc vocabulary file. <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring01"> <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/> <license which="by 4.0" licensor="park"/> <work which="Ring around the Rosie"/> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <vocabulary> <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI> <name>Vocabulary for TAN files edited by Jenny Park</name> <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/> </vocabulary> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <div-type xml:id="line" which="line (verse)"/> </vocabulary-key> <file-resp who="park"/> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> <to-do/> </head> . . . . . . . </TAN-T> <TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring02"> <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license which="by 4.0" licensor="park"/> <work which="Ring around the Rosie"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <vocabulary> <IRI>tag:parkj@textalign.net,2015:TAN-voc:standard</IRI> <name>Vocabulary for TAN files edited by Jenny Park</name> <location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/> </vocabulary> <adjustments> <normalization which="no hyphens"/> </adjustments> <vocabulary-key> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> <to-do/> </head> . . . . . . </TAN-T> In each TAN-T file, a new <vocabulary> points to the project TAN-voc vocabulary file we have just created. Along with the customary IRI + name pattern is a new element, <location>, which specifies where the digital file was accessed and when (through @accessed-when). We may include as many of these <location> elements as we wish, with the most preferred or reliable one at the top. The validation process will consult only the first one that leads to an available document. The @accessed-when value is important, because the validator will look for changes in the file since we last accessed it, and if any changes are found a warning with a summary of the changes will be returned. It is then up to us to determine if the alterations merit any action on our part. Similarly, anyone using or dependending upon our file will be notified of any changes we make, through the same validation process. Once the <vocabulary> is in place, we can draw from our predefined vocabulary. Hence, these revised versions of the <head>s are a bit more compact and easier to read. The longer the TAN file, the more noticable the improvement. And when our library grows into dozens of files, we'll be grateful that a change that affects all the files needs to be made only once. Now that we have created the metadata for our transcriptions, let's turn to the alignment files. Those <head>s will look slightly different, because they are not concerned with transcriptions per se. We start with the TAN-A file:<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A> Much of the code above will look similar to the previous two examples. The file's <name> and <master-location> are updated. Just like TAN-T files have <source>s, so TAN-A files do as well, except that those sources are always TAN-T transcription files, and they take the IRI + name + location pattern we saw above in <vocabulary>. Because alignment files take only TAN transcription files as sources, each <source>'s <IRI> always takes the @id value of the target TAN-T transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other source names. Our TAN-A file could have any number of <source>s, and not necessarily for the same work. The order in which we put the <source>s does not necessarily mean anything. This <head> explains why the <body> of our TAN-A file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. Any user or processor of a TAN-A file may assume that every <div> in every source should be automatically aligned upon the basis of shared values of @n. Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> might look like this:<TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A-tok/ringoroses.01+02.token.1.xml"/> <license which="by-nc-nd_4.0" rights-holder="park"/> <token-definition src="ring1881 ring1987" which="letters"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <vocabulary-key> <bitext-relation xml:id="B-descends-from-A" which="a/x+/b"/> <token-definition src="ring1881 ring1987" which="letters"/> <person xml:id="park" which="Jenny Park"/> </vocabulary-key> <change when="2015-01-20" who="park">Started file</change> </head> . . . . . . </TAN-A-tok> The TAN-A-tok <head> looks similar to the previous examples, except that <vocabulary-key> has some new content. <bitext-relation> states through @which or an IRI + name pattern the stemmatic relationship we think holds between the two sources. We have used @which and the value a/x+/b, pointing to a standard TAN vocabulary item for bitext relations: <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:textalign.net,2015:tan-voc:bitext-relation"> . . . . . . <item> <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI> <name>a/x+/b</name> <desc>direct descent, B descends from A, one or more mediaries</desc> </item> . . . . . . </TAN-voc> <token-definition> specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources. @which points to this standard TAN vocabulary item: <TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:textalign.net,2015:tan-voc:tokenizations"> . . . . . . <item> <token-definition pattern="[\w‍]+"/> <name>letters</name> <name>letters only</name> <name>general word characters only</name> <name>general ignore punctuation</name> <name>gwo</name> <desc>General tokenization pattern for any language, words only. Non-letters such as punctuation are ignored.</desc> </item> . . . . . . </TAN-voc> Up until now, all vocabulary items have taken the IRI + name pattern. The one above does not have an IRI, only a <token-definition> with a @pattern. The value of @pattern, which may look like gibberish, is a regular expression. "Regular" here does not mean ordinary; rather it derives from the Latin regula, rule. Regular expressions are rule-based patterned text searches. This particular pattern says that a token is defined as any contiguous string of word characters (\w), soft hyphens (), zero-width spaces (), or zero-width joiners (‍). This is TAN's default tokenization pattern, and it will be assumed for any TAN-A-tok file that lacks a <token-definition>. TAN adopts this default because in ordinary conversation, when we refer to the nth word in a sentence, we most often ignore punctuation marks. For more on token definitions see and . See also . In our <vocabulary-key> we could have also included a <reuse-type>, but we have intentionally omitted it here, because we have

<body
                  bitext-relation="B-descends-from-A" reuse-type="general_adaptation">

. The value for @reuse-type, general_adaptation, corresponds to a <name> in a standard TAN vocabulary item for reuse types. We don't need to invoke a <reuse-type> in the <vocabulary-key> because we have opted not to give it an @xml:id. Notice that general_adaptation has an underscore instead of a space. That's because <reuse-type> can take multiple values, which are signified by spaces. We could have used a hyphen instead of an underscore, if we preferred. The values of <name> are never case-sensitive, and the space, hyphen, and underscore are treated as equivalent.

Aligning across Projects We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc file for vocabulary shared across the files. Let us imagine what it might be like to connect our TAN collection to a TAN file made by someone else. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </license> <licensor who="schmidt"/> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <numerals priority="letters"/> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <adjustments> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </adjustments> <vocabulary-key> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <div-type which="poem" xml:id="Gedicht"/> <person xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </person> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> </vocabulary-key> <file-resp who="schmidt"/> <resp who="schmidt" roles="Produzent"/> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> <to-do/> </head> <body xml:lang="deu"> <div type="Gedicht" n="1"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </div> </body> </TAN-T> It seems that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or that for a given alignment we must align everything. In this case, we choose not to worry about word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for example, we can use the built-in TAN application to display the three versions in parallel, to study more closely the relationships between them. To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, <numerals> specifies by its value for @priority that any ambiguous numerals should be interepreted as letter numerals, not Roman (that's important, e.g., for a <div> with an @n value c, which could be interpreted to mean 3 or the Roman numeral for 100). Next, the lines are wrapped in a <div> for the whole poem (Gedicht) and they have been lettered instead of numbered. And last, the editor seems to have made a typographical error, making the last line e instead of the expected d). These five differences typify inconsistencies one commonly finds in digital texts from different projects of the same work. There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a UUID. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions. Also, the <div-type> is defined as http://dbpedia.org/resource/Gedichtzeile (Gedichtzeile = line of poetry), so it doesn't intersect with our IRIs for the vocabulary item line. But <div-type> is not used to align versions, and validation isn't affected, so we do not concern ourselves here with trying to reconcile the different IRIs. These are points we can easily reconcile in our TAN-A file, which we now expand to include the German version. We make the following adjustments (emphasized):<TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/> <license which="by_4.0" licensor="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <adjustments src="ger"> <skip div-type="Gedicht"/> <rename n="e" by="-1"/> </adjustments> <vocabulary-key> <person xml:id="park" which="Jenny Park"/> <alias id="ring" idrefs="ger eng-us"/> </vocabulary-key> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> <to-do> <comment when="2018-08-09-04:00" who="park">Finish file.</comment> </to-do> </head> . . . . . . </TAN-A> The first major change is the insertion of a third <source>, pointing to the new file and specifying its name and IRI. Note that two <location>s have been provided, one for the original and another for a local copy we have saved. Validation will take into account only the first document available. If we wanted to work primarily off our local copy, we would have put that <location> first. By placing it second, we allow the validation engine to work primarily off the master version and therefore look for updates and changes. If that version is unavailable, validation will be made against second, local copy. <adjustments> specifies through its @src that only the German version should be adjusted by the contained instructions. The enclosed <skip> says, in effect, to ignore the wrapping <div> for purposes of alignment. The <rename> takes care of the apparent typographical error, and anchors the German version to the U.S. one. Note that the German version uses e, but we have used 5. But we could have used e, or even the Roman numeral v, had we wished to. Every TAN file's numeration system is evaluated locally, independent of any external files. We need not reconcile the a, b, and c @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format supports four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems are interpreted as a two-tier numbering system. The second major change, to address the German version's different value of <work>, is the addition of an <alias>, which allows us to assign one or more vocabulary items a common id. Wherever the value ring is used, it stands in for ger and eng-us, which point to the two TAN-T files. Every TAN-T file has only one work and only one written source. So if you wish to make a claim about a particular work or source, you can use a TAN-T's id as a surrogate. So if we make claims in our TAN-A file about a written source or a work, ring would assert the claim to be true for the works pointed to by the German and the U.S. version. (We do not need to specifically mention eng-uk in the <alias>, since it has the same work IRI as the U.S. version does.) Alternatively, instead of <alias>, we could simply have adjusted our TAN-voc file, adding the German version's <IRI> value to the appropriate vocabulary item, and use that id. The last major insertion is a new <change>, documenting when we made the alterations. Its @when effectively updates the version of our TAN-A file. With these additions, the German version is now aligned with the other two. We could have made our work simpler just by directly modifying our local copy of the German version. But such a changes would not have affected the master copy. What happens when the owner of the German file makes changes? At that point we would struggle to integrate the changes in our forked copy. And we would have to repeat that exercise every time the German file was updated. By keeping our local copy of the German file unchanged, and making simple adjustments in our TAN-A file, we can keep our local copy synchronized with the master file and yet make the adjustments needed to coordinate with ours. The purpose statement in these guidelines says that TAN was "designed to maximize the syntactic and semantic interoperable alignment and exchange of texts, annotations, and language resources across projects." Here we see the importance of the qualifier "maximize." In no world will there ever be (nor should there be, it seems) a single, standard, canonical way to divide a given work. The TAN format does not change that reality. Rather, it provides a convergent ecosystem in which different practices can be easily reconciled, to help editors and authors enhance cross-project interoperability without artificially forcing conformity, or suppressing legitimately different outlooks. Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g., through his tag URN). We do so, and we suggest that he modify the version to make it align better. Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent typographic error in the last line. Or perhaps we're the ones in error. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch" at line 3, the other, "Holderbusch".) If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made. At this point we have a network of six TAN files, five from our collection and one from outside. Although simple and small, this network could be extended to address some creative and complex research questions. Applications based on XSLT stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. What you've read so far is only a cursory introduction to TAN features. Study the rest of these guidelines, as well as example TAN libraries, and you will find numerous ways to develop TAN files, and to use them to enhance your research, teaching, and writing.

Detailed Description This part of the guidelines provides a detailed description of the design and structure of the formats of the Text Alignment Network. The material follows the organization of the schema files (kept in the schemas subdirectory), so both can be read in tandem. outlines, in a non-technical way, the principles and technical foundations of the TAN format. , , , and describe each TAN format, by class. Each chapter starts with theoretical or scholarly contextual before explaining technical points. The chapters in this part are meant to provide a narrative companion to the much more detailed technical appendixes, and , which are derived from the master schemas and vocabularies. The chapters in this part of the guidelines should be read selectively, not consecutively. They have been written with the assumption that you have already read the previous part () and that you have already started to create or edit a TAN collection. Because readers will come from different specialties, all acronyms, abbreviations, and concepts are defined and explained, albeit tersely. Concepts or technologies are discussed only insofar as they affect the use of TAN; suggestions for further reading are provided for those who want a more thorough introduction to a topic. General Underpinnings This chapter retains something of the introductory spirit of the previous one by providing an overview of the fundamental principles and technologies behind TAN. The goal is to explain the principles behind the design of the format. Although this chapter assumes on your part no prior knowledge of any particular technology, it is also not meant to be a tutorial. Links to further reading will take you to good introductory material.

Design Principles The TAN formats have been designed around a few basic principles: Scholarly habits Be patient. Simplify. Stay focused. Don't be redundant. Don't state the obvious. Use familiar conventions. Scholarly freedom Express doubt. Offer alternatives. Exercise independence. Invite interdependence. Scholarly responsibility Declare your assumptions. Make your work citable. Satisfy scholars' expectations: Who did what when? What are your sources? How do you define your terms? What alterations have you made to your sources? What rights do I have to use your material? General utility Use stable technology. Keep design predictable, consistent. Make each datum human readable. Make each datum computer actionable.

Format Organization The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: texts (class 1), text alignments and annotations (class 2), and everything else (class 3). Class 1, representations of textual objects, consists solely of transcription files. (See note on transcriptions versus transliterations.) Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum; see ), whether physical or digital. There are two types of transcription file: a standard generic format (TAN-T) and a customization of TEI All (TAN-TEI). These two types are differentiated by the root element, <TAN-T> and <TEI> respectively. Class 2, encode claims about class-1 texts, and align them. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A> as the root element, aligns any number (one or more) of class-1 files, and allows a wide variety of claims about those files. The latter, <TAN-A-tok>, aligns only pairs of class-1 files. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part-of-speech) forms of individual words from a single class-1 file, or of a language in general. Class 3, covers everything else. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for lexico-morphological codes in dependent TAN-A-lm files. <TAN-voc> collects and labels vocabulary items used in other TAN files. TAN catalog files have the root element <collection>, and they index locally available TAN files, and selective parts of their metadata. This modular approach relies upon a stand-off approach to annotation or markup. In the alternative method, inline markup, an annotation is inserted directly into a transcription, e.g., <p>He said <quote>"Jump!"</quote></p>, where the inner element <quote> annotates the third word. Most TEI and HTML files rely upon in-line annotation. In stand-off annotation,

<p>He
                  said "Jump!"</p>

would be left as-is, and somewhere else there would be an annotation that states that the third word is a quotation. If the stand-off annotation is in the same file, it is an internal stand-off annotation. If the annotation is in a different file, it is an external stand-off annotation. TAN depends upon external stand-off annotation, which provides several benefits: An editor can focus on a limited set of closely related questions. A source text without inline annotations is less cluttered, and therefore easier to read, than one with inline annotations. Editors can work on separate annotation files based upon the same master transcription file, even if they have very different research interests. Complementary or competing annotations can be made, even in the same file, and those annotations may point to concurrent or overlapping spans of text (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another). A corpus of stand-off external annotation files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project. Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently. When an errors is found in a transcription file, in can be corrected in a single place, in the master. Anyone using a copy of that master file will be notified in the validation process of changes that have been made and they can deal with them accordingly. Any data file can be updated independent of any other that points to it, or to which it points. Cross-file links required in stand-off annotation networks files, which can then be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis). The stand-off approach works toward a principle often valued in computer science, that of the disaggregation of data. That is, in a master format, data should be simple and not entangled with other data. It can later be reaggregated in all kinds of ways, but that is an end product, not the way data should be managed. It is analogous to the way any well-run kitchen keeps its ingredients separate, until it is time to cook or bake a variety of products, at which time a few disaggregated ingredients can be combined in a variety of ways. Stand-off annotation is not without problems and vulnerabilities. Files might be altered or altogether deleted, rendering pointers in dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are important challenges, but TAN validation rules have been designed to mitigate such problems.

Assumptions in the Creation of TAN Data All creators and users of TAN files are expected to share few basic assumptions. First, all TAN-compliant data is to be understood as largely derivative. That is, data files express no originality or creativity independent of their sources (but see below about interpretation). TAN-compliant data must be created with the intent of adhering as closely as possible to some model or archetype. For example, a transcription is assumed to replicate faithfully some earlier digital edition or text-bearing material object (e.g., stone, papyrus, manuscript, printed book for written text; audiovisual media for oral or performative texts). Morphological files and alignment files should describe as clearly and as reliably as possible their source transcriptions. In creating and publishing a TAN file you claim to have offered a good-faith representation or description of something; in using a TAN file, you hold the creator to that expectation. Second, all core TAN files are interpretive. That is, they are permeated by editorial assumptions and opinions that might not be shared by everyone. If there is any resemblance of originality or creativity in a TAN file it is in that interpretive outlook. For example, if you edit a transcription file you must decide how to handle unusual letterforms and other visible marks. Your decisions will be influenced by your perspective on the original text and its native writing system, and how you interpret and use Unicode. If you write an alignment file, you must make decisions about what factors caused one text to be transformed into another. Lexicomorphological files require you to commit to one or more grammars and dictionaries, which adopt certain perspectives on language, and you must discern how best to handle cases of vagueness and ambiguity. No TAN file ever stands completely outside the interpretive act. In creating and publishing a TAN file you claim to have disclosed as best you can the assumptions behind your interpretive outlook; in using a TAN file, you hold the creator to that expectation. Third, all core TAN files are applicable. That is, the interpretive impluse is assumed to be coupled with an equally strong desire to make the data as useful to as many users as possible, even those who may not share your assumptions or interpretation. A creator of a transcription file, for example, should normalize and segment texts with a minimum of idiosyncracies, adopting the most widely used reference systems, so as to optimize the alignment process. Morphological files should depend whenever possible upon commonly accepted grammars and lexica. Alignment files should work with comprehensible categories of text reuse. No TAN file will always be applicable to everyone, but it should be as applicable to as many as possible, as often as possible. In creating a TAN file you claim to use common, shared conventions whenever possible, and to note any departures; in using a TAN file, you hold the creator to that expectation. Fourth, TAN data is to be considered accurate, but not necessarily precise or exhaustive. For example, if a TAN-A file claims that the opening of Plato's Republic book 3 quotes from Homer's Iliad, the claim is true and accurate, but is neither precise nor exhaustive. There are parts of the opening of book 3 that are certainly not quotations, and most parts of the Iliad are not quoted in the Republic. A token-for-token alignment of two texts might be selective, and focus only on the points of interest to the editor. Although the TAN formats permit a great deal of both precision and comprehensiveness, neither is mandated, unless explicitly required by a specific part of the TAN specifications. In creating a TAN file you claim to make accurate assertions; in using a TAN file, you should hold the creator to that expectation, but you must assess for yourself how precise and complete it is.

Core Technology TAN depends upon a set of relatively stable technologies. Those technologies and the underlying terminology are briefly explained below, with attention paid to interpretive decisions that affect validation rules. References to further reading will lead you to better and more thorough introductions elsewhere.

Unicode

What is it? Unicode is the worldwide standard for the encoding, representation, and exchange of digital texts. The standard is maintained by a nonprofit consortium whose goal is to represent all the world's writing systems, living and historical. The Unicode standard allows us to share texts in any alphabet reliably, regardless of how that text is rendered (e.g., fonts, display). With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular alphabet or group of characters. Within each block, characters may be grouped further. Each character is assigned a single number called a codepoint. Codepoints are numbered according to the hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The decimal number 10 is hexadecimal A; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It is helpful to think of Unicode as a very long table of sixteen columns, a glyph in each square; this is illustrated nicely in this article. It is common to refer to Unicode characters by their value and perhaps by their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four hexadecimal characters. When the official Unicode name is given, it is normally in uppercase. Examples: Unicode characters Character Unicode value Unicode name " " (space) U+0020 SPACE ® U+00AE REGISTERED SIGN ю U+044E CYRILLIC SMALL LETTER YU

In an XML file, nearly any Unicode codepoint may be used, either by typing or pasting the character directly, or by using XML entities. An XML entity is a proxy for some other text, marked by an ampersand, some text, and then the semicolon, e.g., & for the ampersand or < for <. To access specific Unicode characters an entity may start &#x followed by the hexadecimal codepoint (if you prefer the decimal version, leave off the #). For example, the XML entity ю is a proxy for the Cyrillic small letter yu.

Unicode Normalization Unicode rules provide guidance on how text should be normalized, to identify equivalent variations. For example, the character o (U+006F: LATIN SMALL LETTER O) followed by the combining accent ¨ (U+0308: COMBINING DIAERESIS) should be treated identical in meaning to the single character ö (U+00F6: LATIN SMALL LETTER O WITH DIAERESIS). There are two codepoints that could be used for the Greek question mark (;), and normalization converts the less preferred codepoint to the other. TAN validation rules require all data to be normalized according to the Unicode NFC algorithm (the most common of the four normalization methods). Any text in a TAN file that is not NFC normalized will be marked as invalid. A supplied Schematron Quick Fix will let users automatically normalize text (for editing environments that support Schematron Quick Fixes).

Unicode characters with special interpretation The characters U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, and U+00AD SOFT HYPHEN placed at the end of a leaf <div>, perhaps followed by space that will be ignored (see below), signal that the text is to be joined with any subsequent text (i.e., the next leaf <div>). Accordingly, any TAN function that needs to extract text from a <div> structure will delete the U+200B, U+200D, or U+00AD character and its trailing space. (By contrast, text from a leaf <div> that does not end this way will be space-normalized, then appended by a single space.) Because these characters are difficult to distinguish visually from spaces and hyphens, any output based on the character mapping of the core functions should replace these characters with their XML entities, , ‍, and . Much has been written about the different ways U+00AD SOFT HYPHEN has been or should be used and interpreted. Debate will no doubt continue. In designing TAN, we have adopted the position that the soft hyphen marks a place in a word where a line break has occurred, is allowed to occur, or both. In situations where the text is printed or displayed, any soft hyphen that does not mark a word that breaks across lines should not be displayed.

Combining characters At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently or as part of a base character, nor do core XML technologies. In most cases, this point is negligible. But it can affect regular expressions and XPath expressions (see below). Two of the class-2 formats allow the counting of characters. Such counting is assumed to be made exclusively of individual non-combining characters (each perhaps followed by one or more combining characters). Therefore one character is defined as the regular expression \P{M}\p{M}*, bound to global variable . Any numerical reference made in a TAN file to an individual character, i.e., through @chars, will be found by counting only non-combining characters. When the nth character is requested, TAN functions will return the nth base character along with any combining characters that immediately follow. TAN rules stipulate that combining characters must have a preceding base character. Any <div> that, after any initial space, starts with a combining character will be marked as invalid. See also .

Unicode points not allowed Because TAN files are not scriptum-oriented (see ), the following characters will generate an error if found in a TAN file: U+00A0 NO-BREAK SPACE U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE

Further Reading Unicode Consortium Unicode (Wikipedia)

eXtensible Markup Language (XML)

What is it? Defined by the W3C, the eXtensible Markup Language (XML) is a markup language that that can be extended to allow anyone to define the structure and rules of a document type. For a quick, simple introduction to XML see .

Schemas and validation Validation files are found in the schemas subdirectory. Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type, written in RELAX-NG, the other with more complex, detailed rules, written in Schematron. The RELAX-NG rules are written primarily in compact syntax (*.rnc), and then converted to XML syntax (*.rng). For TAN-TEI, the special format One Document Does it all (TAN-TEI.odd) is used to adjust the rules for TEI All. The ODD file is then processed by TEI stylesheets into compact and XML RELAX-NG formats. The Schematron files are generally quite short. The primary work is done by a substantial function library written in XSLT. For the most part, the Schematron files simply point to the TAN function library, and handle its results. For a detailed overview of this process, see and . Some validation engines that process a valid TAN-compliant TEI file may return an error something like

conflicting ID-types for attribute "who"
                        of element "comment" from namespace "tag:textalign.net,2015:ns"

. Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. An XML editor may be configured to ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

Space characters and normalization By default in XML, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom to format XML documents as they like, balancing human readability against compactness. In XML, space normalization is performed by stripping leading and trailing whitespace and replacing sequences of one or more whitespace character with a single space,  . All TAN formats assume space normalization, with an extra caveat for leaf <div>s. Initial space is always stripped. If a leaf <div> ends in the soft hyphen or the zero width joiner (see ) the character is suppressed along with any ending space, otherwise the text ends in a single space character (whether or not there are space characters in the leaf <div> itself). If retention of multiple spaces or spaces of specific sizes is important for your files and research, then you should not be working with the TAN format, which cannot be used to replicate the appearance of a scriptum (see ). Pure TEI (and not TAN-TEI) is a better alternative, since it allows for a literal use of space, and supports the creation of scriptum-oriented XML files. For more on space see guidance in the W3C recommendation.

Mixed, non-mixed, and semi-mixed content In many popular XML formats such as TEI, XHTML, and Docbook some elements allow a mixture of elements and nonspace text as children, e.g., <div>Some <span>text</span></div>. These are called mixed content models. The TAN formats, aside from TAN-TEI, are committed to a non-mixed content model, e.g.,

<div><span>Some
                        </span><span>text</span></div>

. Nonspace text nodes and elements are never siblings. The practical effect of this decision is TAN files may be indented as you like, and whitespace text may be placed anywhere, without altering the meaning. An expanded TAN file (see ) may include what we term a semi-mixed content model, in which any element may have one and only one nonspace text node along with any children elements. That nonspace text node may appear at the beginning or the end of the children nodes.

Namespaces

What are they? XML allows users to create document types of whatever kind. One person may wish to use the element <band> to refer to a musical group; another might use this element to encode radio frequencies. Perhaps someone wishes to mention a musical group and a radio frequency in the same document, which would entail mixing two very different types of <band>. XML allows users to mix vocabularies, even when those vocabularies use the same element names. Disambiguation is accomplished by associating an element name with a kind of family name. That family name is an IRI (see below). The actual full name of an element, then, is the local name plus the IRI that qualifies its meaning, e.g., band{http://music-example.com/terms/} and band{http://frequency-example.com/terms/}. The IRI—the family name—is called the namespace, a term that is understandably vague or confusing to many, because it has nothing to do with space. Namespaces can be declared in an XML document. When they appear, they look a lot like attributes. (They aren't.) They take the form xmlns="http://music-example.com/terms/" (this defines the default namespace) or xmlns:[PREFIX]="http://frequency-example.com/terms/" (this assigns a namespace to a prefix) placed inside an opening tag. For example, <band xmlns="http://music-example.com/terms/">...</band> declares http://music-example.com/terms/ to be the default namespace for <band> and all descendants, unless explicitly overridden. To return to our example, different <band>s can be combined through namespaces: <band xmlns="http://music-example.com/terms/"> <band xmlns="http://frequency-example.com/terms/"> ... </band> </band> <band xmlns="http://music-example.com/terms/" xmlns:e2="http://frequency-example.com/terms/"> <e2:band > ... </e2:band> </band> <e1:band xmlns:e1="http://music-example.com/terms/" xmlns:e2="http://frequency-example2.com/terms/"> <e2:band > ... </e2:band> </e1:band>

TAN namespace and prefix The TAN namespace is tag:textalign.net,2015:ns. The recommended prefix is tan. The namespace does not change from one version of TAN to another. The TAN-TEI format uses as its default the TEI namespace, , normally given the prefix tei. But in a TAN-TEI file, the head and its descendants are in the TAN namespace.

The Text Encoding Initiative

What is it? The Text Encoding Initiative (TEI; ) is consortium of scholars and scholarly organizations that maintains the rules and documentation behind a collection of XML formats intended for encoding texts. TEI files have been widely used by libraries, museums, publishers, and individual scholars to prepare and publish texts for online research, teaching, and preservation. In addition to the guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software. TEI gave the impetus for the creation of TAN, and continues to inspire its development. TEI was designed to be highly customizable, to suit the needs of individuals or communities of practice. One of the TAN formats, TAN-TEI, is one such customization, based as it is on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release. TAN-TEI files and standard, out-of-the-box TEI All files are not automatically interchangeable. TAN-TEI expects all metadata to be human- and computer-readable, whereas TEI metadata is geared primarily to human readability. TAN-TEI tightly regulates the structure of the text, whereas TEI allows for a variety of structures. In any conversion process to and from TEI and TAN-TEI, some human intervention is required, and conversion in either direction may entail loss. For more about the strictures placed upon the TEI All schema see . See also and .

Further reading Text Encoding Initiative

Data types Being written purely in XML technologies, TAN uses data types defined in the W3C's official specifications, e.g., strings, booleans, integers. The following data types require some special comments.

Languages TAN adopts for language identification Best Common Practices (BCP) 47, which standardizes identifiers for languages and scripts. For most users of TAN, this will be a simple two- or three-letter abbreviation, sometimes supplemented with a hyphen and an abbreviation designating a script or regional subtag. For example, eng, eng-UK, and eng-UK-Cyrl refer, respectively, to English (in general), English from the United Kingdom, and English from the United Kingdom written in the Cyrillic script. As a general rule, values of this type should begin with a three-letter language code, preferably lowercase. (The two-letter codes cover only a few dozen languages; the three-letter codes support thousands of them.) ISO codes for human languages appear in @xml:lang and <for-lang>. The former states what language the enclosed text is in. The latter is an empty element that simply points to a specific language. For example, <for-lang> in the context of a TAN-mor file indicates which languages the file was written for. TAN has several global variables and functions useful for working with language codes. See . For more information, see one of the following: BCP 47 official specifications BPC 47 technical details

Dates and times For dates and dates + times, TAN adopts the corresponding XML data types, which follow ISO syntax. That syntax begins with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest). The simplest date takes this form: YYYY-MM-DD. If a time is included, it is specified by continuing the string, first with a T (for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm). For example, the following is 2016-09-20T20:38:27.141-04:00 is an ISO date-time for Tuesday, September 20, 2016 at 8:38 p.m., Eastern Time Zone. More reading: W3C specification Wikipedia entry on ISO 8601

Identifiers and Their Use (IRIs, URIs, URLs, URNs, UUIDs) TAN makes extensive use of the following identifiers: IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987 URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986 URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource. URN: Uniform Resource Name, a term that originally referred to persistent names that used a bare urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs. UUID: Universally Unique Identifier, a computer-generated 128-bit number that may be attached as an identifier to any entity. UUIDs can be built into a URN by prefixing them with urn:. The TAN format makes extensive use of all the above. See also .

Resource Description Framework (RDF) and Linked Open Data

What are they? Identifiers are used in many contexts for many purposes. One such purpose is called Linked Open Data (LOD), also known as the Semantic Web, which aims to network data across projects. It relies upon a very simple data model called Resource Description Framework (RDF), recommended by the World Wide Web Consortium (W3C). The term "Resource"—the R in RDF—refers to any person, place, concept—anything at all, whether you think of it as a resource or not. "Description" is overly specific, too, since RDF was designed to support general assertions, descriptive or not. Perhaps it is easiest to think of RDF as a standardized way to make assertions, as if the name were simply "Assertion Framework." The RDF data model rests upon the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that name things. The object may take an identifier or just data. As people independently identify concepts with the same URLs, they create RDF datasets can be combined, synthesized, and compared. RDF statements found across the web allow inferences no individual project could ever anticipate. The Semantic Web recommends the use of URLs as identifiers. That way, if a computer encounters a URL naming a concept, it can be programmed go to the web resource and retrieve other RDF statements, recursively. So URL identifiers look like a web page address (e.g., http://...), but they are first and foremost names for things. Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found. Although RDF statements must be made of only three components, it is possible in a roundabout way to create more complex assertions. In one technique, the assertion itself is given a URL, and then RDF statements are made about the assertion. Such assertions are in some cases not easily integrated with other RDF statements. Users who query an RDF database will not find relevant complex RDF statements unless they build their queries to anticipate such situations (or the query engine has been customized).

TAN Claims and RDF Much of TAN can be converted to RDF statements. In fact, TAN may be one of the most human-friendly ways to read and write RDF. For example, consider how one might express "Person X's name is 'Dave Smith'." Compare this snippet (taken from ), written in Turtle, the RDF syntax generally regarded as the most human-readable, ...@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . <http://biglynx.co.uk/people/dave-smith> rdf:type foaf:Person ; foaf:name "Dave Smith" . ...with the TAN equivalent:<person> <IRI>http://biglynx.co.uk/people/dave-smith</IRI> <name>Dave Smith</name> </person> These TAN and RDF expressions are interchangeable. But in more complex claims, it is, at this time, not clear whether all assertions in TAN can be losslessly converted to the RDF model. This happens most often in the context of the TAN-A <claim>, which is designed to allow scholarly assertions and claims that are difficult or impossible to express in RDF. For example, RDF does not allow one to say "Person X is not the author of text Y," but TAN does. TAN claims can also be quite complex. Whereas the standard RDF claim consists of three components—subject, predicate, object—most TAN claims have more. Every TAN claim must have at the minimum: a claimant (no RDF counterpart; the person, organization, or algorithm that asserts the claim), a subject (counterpart to RDF subject), and a verb (counterpart to RDF predicate). Verbs can be defined to permit, require, or disallow other claim components, such as adverbs or objects, many of which are permitted by default. Most TAN claims involve more than three components, so converting a TAN claim to RDF requires creating a complex RDF statement (see previous section). Many TAN claims involve textual subjects or objects. References to parts of text can be quite complex, and they must be made with reference to other entities. It doubtful whether a given specific textual subject or object can be satisfactorily reduced to an unambiguous IRI, because such an IRI would need to include a mechanism to resolve the meaning of the syntax. Such an IRI must not only explain the work's reference system, but also identify the chosen version, scriptum, and perhaps token definition and numeration system. Many texts have more than one "canonical" reference system, so an IRI might point to two different textual passages, thereby breaking a cardinal rule of IRIs: although an entity may be given multiple IRIs, it is never acceptable for an IRI to be ambiguous. For more details see and <claim>.

Further reading W3C recommendation Linked Data Linked Open Vocabularies

Tag URNs TAN files make extensive use of tag URNs (see ). In fact, TAN's namespace is a tag URN (). A tag URN has two parts: Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned + :. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01. Name of the TAN file. An arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for the entire file and related versions. It can be the same as the filename, but it is a good practice not to do so, because filenames . You should pick a name that is at least somewhat intelligible to human readers. Although you may use any tag URN coined by someone else, you may create a tag URN only in namespaces you own. Great care must be taken in choosing the name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple identifiers, but never acceptable for an identifier to name more than one thing. It is a good practice to keep a master checklist of tag URNs you have created. If you find yourself forgetting, or think you run the risk of creating duplicate tag URNs, you should start afresh by creating a new namespace for your tag URNs, if only by changing the date in the tag URN namespace. Tag URNs tag:jan@example.com,1999-01-31:TAN-T001 tag:example.com,2001-04:hamlet-tan-t tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts tag:bbrb@example.org,1995-04-01:pos-grc The first example comes from someone who owned the email address jan@example.com on January 31, 1999 (at the stroke of midnight, Universal Coordinated Time). The other examples follow a similar logic. The namespace of the second and third examples are tied to the owners of specific domain names. The 2014 in the third example is shorthand for the first second of January 1, 2014. TAN has adopted tag URNs over URLs for several reasons: Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries from now, well after most domain names today have changed ownership or fallen into obsolesence, and well after the creators are dead. URLs are not built for such permanence. Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, a group of persons, or an algorithm). A tag URN connects the identifier with the responsible person or group. URLs cannot provide such support. Accessibility. Tag URNs have almost no barriers. They can be created by anyone who has an email address. No one has to register with a central authority. You can begin naming anything you want, any time you want, without seeking anyone's approval, and without paying anything. Ease. Tag URNs are easy to use. Many potential TAN authors never have owned a domain name, and never will. Further, many of those who do own domain names cannot or do not wish to configure, populate, maintain, and troubleshoot servers with the referral mechanisms recommended by Semantic Web advocates (see ). Scholarly citation norms. In the Semantic Web, the conflation of URL qua name with URL qua location is considered by many a virtue because the single string does double duty, both naming the resource and pointing to a location where more can be learned. Although the combination is elegant from an engineering perspective, it is confusing to others: URLs are commonly thought to be purely locations for data, not names for things. It also goes against an important principle in scholarly bibliographies, namely, the name of a cited publication should always be distinguished from where it might be found. In scholarly citation practice, a name and a location should always be disambiguated. Further reading: RFC 4151, the official definition of tag URNs

Regular Expressions Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, alluding to the Latin root regula (rule), it refers to a rule-based method of finding and replacing text through patterns. Regular expressions come in different flavors, and have several layers of complexity. TAN regular expressions adhere closely to the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Fuctions 3.0. XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the word classifications for codepoints U+0020..U+00FF according to XML (and therefore TAN): Word characters (\w):

$ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W):

! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § «  ¶ · »
                           ¿

Some of these decisions about what is word-like and what isn't may seem counterintuitive or wrong. But at this point complaining will not change the conventions. The distinction is a legacy that will endure. Just familiarize yourself with decisions that look admittedly arbitrary. A regular expression search pattern is treated just like a conventional search pattern until the computer reaches a special character:

. [ ] \ | - ^ $ ? *
                     + { } ( )

. Here is a brief key to how those special characters behave in regular expressions when they are first found. (Some of these special characters change their meaning if they are found inside square brackets; on this point, see the recommended reading below): Special characters in regular expressions Symbol Meaning . any character | or (union) ^ start of line ? zero or one * zero or more + one or more [ ] a class of characters ( ) a group \w any word character \W any nonword character \s any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D) \S anything not a spacing character \d any digit (0-9) \D anything not a digit \p{IsGujarati} any character from the Unicode block named Gujarati ^ beginning of a line or string (doesn't capture any characters) $ end of a line or string (doesn't capture any characters) \\ backslash (an escaped escape character) \^ a caret sign (must be escaped with the \) \$ dollar sign (escaped) \( opening parenthesis (escaped) \[ opening square bracket (escaped)

Some examples: Examples of Regular Expressions Expression Meaning What the expression matches when applied to "Wi-fi, good. A_hem* isn't!" ^.+$ one whole line of characters "Wi-fi, good. A_hem* isn't!" [ae] a or e "e" [a-e] a, b, c, d, or e "d", "e" [^ae]+ one or more characters that are anything except a or e "Wi-fi, good. A_h", "m* isn't!" .i any character followed by i. "Wi", "fi", " i" (.i) when a character followed by an i is found treat it as a capture group (used only in a search pattern) "Wi", "fi", " i" [aeiou]\w* any lowercase vowel along with every word character that follows "i", "i", "ood", "em", "isn" [t*]. any t or * and the following character "* ", "t!" Note that the asterisk, if inside a character class, represents itself. \s+ one or more space characters " ", " ", " " \w+ one or more word characters "Wi", "fi", "good", "A_hem", "isn", "t" \W+ match one or more nonword characters "-", ", ", ". ", "* ", "'", "!" [^q]+ one or more characters that are not a q "Wi-fi, good. A_hem* isn't!"

The examples above provide a taste of how regular expressions are constructed and read. Regular Expressions and Combining Characters A regular expressions might be ambiguous in the context of combining characters. Suppose we have a string of three characters, áb (i.e., an acute accent over the a, áb). The regular expression a. will in some search engines include the b and others not. Unicode has differentiated three levels of support for regular expressions (see official report). Only level-one conformance in TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see ). TAN includes several functions that usefully extend XML regular expressions. See . Further reading: Various tutorials on Regular Expressions Wikipedia, Regular Expressions Regular Expressions in XSLT 3.0 Unicode and Regular Expressions XML Schema Datatypes A New \u: Extending XPath Regular Expressions for Unicode

Patterns and Structures Common to All TAN Encoding Formats This chapter provides general background to the elements and attributes that are common to all TAN files. For more detailed discussion, see . This chapter does not discuss TAN catalog files, on which see .

Common Patterns

IRI + name Pattern Both humans and computers need to read and write TAN metadata. Very often what is readable to humans is unreadable to computers, and vice versa. So the TAN format requires that all metadata be provided whenever possible in both forms. Although this rule may appear to introduce redundancy and therefore opportunities for error, the clarity is critical. It is the only way at present to ensure that any person or algorithm that approaches the data can parse and use it. In addition, doubly expressed metadata provides a safeguard much like a checksum: human- and computer-readable descriptions should comport. Any discrepancy signals an error that should be checked. Some metadata, such as that inside <comment> or <change>, are neither easily nor profitably translated into a computer-actionable string. In such cases only the human-readable form is required. Other metadata involve regular expressions (e.g., @pattern) or ISO-compliant dates (e.g., @when), both of which are well formed and are usually human-legible. Such data are not repeated, although they may be explained via <desc> or <comment>. Those exceptions aside, all other metadata takes what is called the IRI + name pattern: one or more <IRI>s and <name>s and zero or more <desc>s. This is the core pattern for nearly all TAN vocabulary items.

Digital Entity Metadata Pattern Some entities identified by the will be digital resources. In those cases, the IRI + name Pattern is extended. There must be one or more <location>s, with @href and @accessed-when, which signals where the resource is and when it was last consulted. In validation, only the first document available will be used. Extra <location>s might prove helpful for applications. There may be an optional <checksum>, to more accurately specify which version of a file was consulted. If the entity is a TAN file, then <IRI> (one and only one) must be a valid tag URN that matches the @id value of the TAN file being referred to. If the entity is not a TAN file, then any IRI may be used, including its resolved URL. @accessed-when indicates when a file was last accessed. During validation, the target file will be checked. any changes before that date will be ignored, and any after will be reported, normally as warnings. See . All these requirements may seem excessive, since in other contexts (HTML, TEI), one needs simply a link, via @href or @src. TAN files are meant to be valid long after their creation, when @href point to broken links. An <IRI> might allow one to find a missing file, and it will also check, in case the original file has been deleted and another, with a different name, has taken its place.

Edit Stamp Most TAN elements allow for an optional edit stamp, an @ed-who and an @ed-when, stating who created or edited the enclosed data and when. Neither attribute is allowed without the other. @ed-when is one of the attributes that help determine a file's version. See . An edit stamp is much like a <change> without a description. The attributes simply mark the element where a change has been made. If a description of the alteration is considered necessary, <change> should be used, perhaps in addition to the edit stamp.

Overall Structure All TAN-compliant files, no matter the type or class, follow a common basic structure: (1) a prolog with at least two processing instruction nodes; (2) a root element; and (3) a head, a body, and an optional teiHeader and tail. Prolog and processing instruction nodes: The standard prolog of every XML file should begin:

<?xml version="1.0"
                  encoding="UTF-8"?>

XML version 1.1 is a permissible alternative, and encoding="UTF-8" is optional. After that come two processing instructions specifying the two schema files required for validation

<?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
                           c]"?>

<?xml-model
                        href="[PATH]/[ROOT-ELEMENT-NAME].sch"?>

The first processing instruction node points to the RELAX-NG schema that declares the major, structural rules. The second points to the finely tuned rules, written in Schematron. Both processing instructions are required, except in systems where those processing instructions are implicitly understood (e.g., an oXygen project or framework). [PATH] represents the pathname to the schema file, whether local or on a server and [ROOT-ELEMENT-NAME] stands for the name of the file's root element (the element that is the ancestor of all other elements in the document and the descendant of none). It is your choice whether you use .rnc or .rng as the extension for the RELAX-NG schema. The former is the compact syntax and the latter, the XML format. They are equivalent. The schemas are written initially in the compact sequence, then converted to the XML format. TAN files permit three different levels of validation: terse, normal, and verbose. A phase may be specified with a pseudoattribute phase in the prolog, e.g.,

<?xml-model
                  href="TAN-A.sch" phase="verbose"?>

. But it is customary not to specify the phase, since most users will want to pick the level of validation desired at a given time. Verbose takes the longest time, and terse the shortest. Verbose provides the most feedback, terse the least. But some files will not show any difference in results from one phase to the next. For more on validation, see . Root element: The name of the root element identifies the type of TAN file:Root TAN elements Root element name Type of data TAN class <TAN-T> plain text transcriptions 1 <TEI> TEI transcriptions 1 <TAN-A> division-based alignments and annotations 2 <TAN-A-tok> token-based alignments 2 <TAN-A-lm> lexico-morphological annotations 2 <TAN-mor> part of speech / morphology patterns 3 <TAN-voc> glossaries 3 <collection> catalog of TAN files 3

<collection> is provided here only to complete the table. None of the material in this chapter applies to this special class 3 format. See . Each root element takes a mandatory @id and @TAN-version. On @id, see below. @TAN-version must be 2020, the current version of TAN. All TAN elements take the namespace tag:textalign.net,2015:ns. In most cases, this value is placed in the root element. (The only exceptions are TAN-TEI transcription files, which take as a default namespace http://www.tei-c.org/ns/1.0 everywhere but in /TEI/head, which takes the TAN namespace.) For more about namespaces, see . Root element children: Most root elements take two mandatory children: <head> and <body>, the latter containing data and the former, metadata (data about the data). Root elements of TAN-TEI files take three children: <teiHeader>, <head>, and <text>. The apparent duplication of a head element is necessary: the <teiHeader> does not satisfy TAN metadata requirements. See . All TAN files may take one final optional child, <tail>, a private use element that allows any well-formed XML. It was introduced initially to experiment with methods in improving the efficiency of validation and applications, but it can be used for a variety of tasks or applications. Nothing in a TAN file should be dependent upon the <tail>. That is, if you are editing a TAN file and you add a <tail>, assume that it will be disregarded by other users. Similarly, you may delete any TAN file's <tail> without consequence.

Identifying TAN files: <code><link linkend="attribute-id" >@id</link></code> Every TAN file requires in its root element an @id, which must take the form of a tag URN (see for syntax). The file's @id is the primary way other TAN files will refer to it, and it may be used in RDFa, JSON-LD, and linked open data (see ). A tag URN begins with a namespace component, and concludes with the identifying string. The namespace of @id must match at least one other tag URN namespace from the <IRI> of a <person> identified by <file-resp>. See . In choosing a value for @id you might borrow the filename, but you probably should not, since files are frequently renamed, often with good reason. A TAN file's @id should not be changed, especially after public release. The name should remain permanent and stable, even after updates. On occasion during editing, it will become clear that revisions are so deep that the file is altogether a different kind of thing. If a previous version has been published, then coining a new @id is advised, to make a clean break. You may always document the connection by supplying <predecessor>, which establishes a line of ancestry. If you take someone else's data and alter it then you should not change the @id. To ensure that you are credited with any revisions you make to the file (if you are allowed—see <license>), you should add yourself as a <person> and then document your alterations through <change> or @ed-when and @ed-who. You might also add a <predecessor> element, pointing to a version of the file that predates your intervention. The @id is the only file-specific metadatum positioned outside <head>. It is placed as rootward in the document as possible to emphasize that it names the entire document.

TAN file versions The version of a TAN file is identified by the most recent date in a file's @when, @ed-when, and @accessed-when. Whenever you change a TAN file that has already been published, provide at least an edit stamp () in the part of the file you changed or in a <comment> or <change>, so that anyone validating a TAN file dependent upon yours will be warned that changes have been made. The user may then either continue to process the file (the changes may be minor or inconsequential) or pause and see if anything on their end needs to be changed.

Attribute inheritability and priority Some attributes affect not merely their parent element but all their parent's descendents. This phenomenon is called inheritability. Most attributes are non-inheritable. That is, the attribute relates only to the parent element. Examples: @xml:id, @flags. If TAN schema documentation for an attribute does not state anything about the inheritability of an attribute's values, it should be treated as non-inheritable. Most inheritable attributes are weakly inheritable. That is, inheritance stops at any descendant that has the same attribute. For example, @xml:lang set to eng specifies that its text nodes are in English, but it might contain another element whose @xml:lang is set lat. In that case, the text will be marked as Latin, not English. Other inherited attributes are cumulative. That is, their values somehow combine. For example, if an element with @cert wraps another, and each one has a @cert value of 0.5, it means that the wrapped element qualifies any claim it participates in as being only 25% certain (compounded perhaps by other elements in the claim that are not completely certain). @n in a <div> is indirectly cumulative for the purposes of resolving values of @ref. Any given <div> has one or more implied references, formed by all permutations of concatenating values of inherited @ns. Cumulative inherited attributes are infrequent. The documentation must be studied to understand how each one behaves. Some attributes have greater priority over other attributes. This is important for interpretation. @claimant, for example, has priority over @cert. That is, the two attributes in the same element are to be interpreted to mean: "@claimant has @cert confidence about the following claim:...." (It does not mean that one is uncertain whether the claimant made such-and-such a claim.)

Defining Words and Tokens At the heart of interaction between class-1 and class-2 files is the need to identify words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the context or language. For example, "New York" and "didn't" can each be justifiably claimed to be one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., commas inserted by a medieval scribe or a modern scholar into ancient Greek and Latin texts). In the end, the number of meanings for "word" reflects the diversity of scholarship. TAN follows the field of corpus linguistics and avoids word in favor of the proximate term token—one or more characters defined not according to grammar but according to a regular expression (see ). In TAN, a token is purely a string definition, used as a segmenting and pointing mechanism. To define a token in TAN does not entail any linguistic commitments. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed. TAN was developed with a concern for ancient literature, where punctuation is generally ignored as being late or not central to the text. Happily, even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters (\w), the soft hyphen, the zero-width space, or the zero-width joiner, formally defined by : <token-definition regex="[\w‍]+"/> This pattern closely resembles what is ordinarily thought of as words, but perhaps with some surprises (see above, ). If no <token-definition> is explicitly given, the default token definition above will be used. If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword letters and punctuation: <token-definition regex="[\w‍]+|[^\w‍\s]"/> This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string (I go!) would have five tokens: ( I go ! ). For other standard TAN token definitons see <token-definition>s. You may customize your own <token-definition>. But keep in mind that TAN files were meant to be shared across fields and disciplines. You should define tokens in a way users of your texts expect. Specialized definitions make it difficult to compare the data in your TAN file with that in others. Two class-2 files annotating the same class-1 file cannot be easily compared or synthesized if they use different token definitions.

Metadata (<code><link linkend="element-head"><head></link></code>) No matter how much one TAN format differs from another, the metadata follows the same basic structure. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore to find easily and predictably, the following: the stable name of the file; its version; its sources; other files upon which it depends or otherwise has an important relationship; the most significant moments in the editorial history; the linguistic or scholarly conventions that have been adopted in creating and editing the data; the license, i.e., who holds what rights to the data, and what kind of reuse is allowed. the persons, organizations, or entities that helped create the data, and the roles played by each. To answer these questions completely, consistently, and predictably the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, making work across large numbers and types of TAN files predictable. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. That is, your metadata should focus on the data itself and not on other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. Merely give good <IRI>s to point to authoritative sources that provide background information. The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. The <teiHeader> supports the creation of metadata that has little to no relevance to the content of <body>, has its own unique structure, has very few metadata that are required, and is not designed to incorporate IRIs. Although <teiHeader>and TAN's <head> have overlap, they cannot be mapped onto each other. Each one serves its own purpose, so both must be retained. In what follows we provide a general overview of the TAN <head>, focusing on its general structure, and some of the principles that affect other parts of the TAN ecosystem.

Key Information Key information about the file as a whole is the first section of a <head>. This includes <name>, perhaps one or more <desc>s, and perhaps one or more <master-location>s, which point to locations for authoritative versions. <master-location> is optional, but not if <to-do> (see below) is empty.

Key Declarations Each <head> in a TAN file has a declaration section, pertaining to how the file should be used: <license> and <numerals>. <license> stipulates the license(s) under which the persons listed in its @licensor are releasing the data. The license applies only to the data in <body>, not to its sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to do so for that of others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional, best handled in a <desc> or <comment>. When using a TAN file, you should investigate the entire chain of rights. You may find discrepancies between the license of a TAN file and that of its sources. For example, you might create a thorough lexico-morphological analysis of a 20th-century novel, and legitimately release the TAN data under a public domain license, even though the novel itself is under copyright. Users must be aware of and respect licenses, and know that the license in a TAN file may not be the license of its sources. TAN adopts the Creative Commons licenses as its default license vocabulary. See . <numerals> may be used to declare whether an ambiguous numeral should be interpreted as an alphabetic numeral or a Roman numeral (default). See the entry for <numerals> as well as the section on numeration systems. Many TAN files allow in this section <token-definition>, which specifies a definition for tokens, perhaps tailored via @src to a specific class-2 file. See and <token-definition>.

Networked Files The third major section of <head> accommodates links and references to other files. Some files are essential to processing the TAN file, while others are less important. The two most critical types of files are marked by <inclusion> and <vocabulary>. The files pointed to by these elements should be considered constituent parts of the dependent TAN file. In the validation process, failure to access one will be treated as a fatal error. <inclusion> and <vocabulary> were developed to reduce duplication (and therefore potential error) in collections of TAN files. Many if not most TAN files are created alongside or in the context of a project, where certain data patterns are repeated. Explicit repetition from one file to the next makes them prone to error. Changes might be made in one file but not in another, introducing version conflicts. <inclusion> and <vocabulary> provide a specialized method of inclusion that leads to cleaner, smaller files. In general, you should first try using <vocabulary>. If that element does not do what you want, then try <inclusion>. It is normally easier to diagnose a complex set of <vocabulary>s than a complex set of <inclusion>s.

Vocabularies Oftentimes, from one file to the next, an editor needs to refer repeatedly to a common set of things, e.g., manuscripts, works of literature, or persons who helped edit the files. Projects are advised to create their own <TAN-voc> files populated with commonly used vocabulary. Once set up, the TAN-voc file must be linked to via a <vocabulary> in the <head> of other TAN files. Vocabulary items can then be invoked either by pointing to <name> values, or by assigning an @xml:id to a vocabulary item placed in the <head>'s <vocabulary-key>. If you draw upon <name>, you may make alterations to capitalization, and hyphens, spaces, and underscores are treated as interchangeable. Capitalization and spelling of @xml:id, however, must be strictly followed. Vocabulary (TAN-voc) files tend to require frequent change and expansion, so it is recommended that you depend upon only those TAN-voc files that are part of your project, and not those from a different project. In the host file, any attribute that takes multiple IDrefs, e.g., @who, @type, @subject, may take mix references to vocabulary items via @xml:id or <name>, but because in such attributes spaces are reserved to delimit multiple values, in the case of the latter, any space in a <name> must be replaced with the underscore or hyphen. A @which in the host file, however, can take no more than one value, so using spaces is fine. @id and @xml:id are case-sensitive, and do not allow spaces. @which and therefore <name> are not case-sensitive, and the space, hyphen, and underscore are equivalent. If you point to @id or @xml:id you must respect case and punctuation. If you are pointing to a <name> you can ignore case, and you should probably replace the space with a _. TAN includes a number of standard vocabulary (TAN-voc) files for a variety of concepts commonly used in textual scholarship (see ). For example, there are more than one hundred types of textual divisions that can be invoked simply by using their names (see ). <vocabulary> itself may take @which, but only to point to one of the extra TAN vocabularies listed in . This restriction avoids some complexity in the validation routine. See on how to use this feature. Files pointed to by <vocabulary> are considered an essential part of any TAN file. Failure to find the target file will throw a fatal error during validation.

Inclusions Whereas vocabularies do not change the host document, inclusions do. Unlike other forms of inclusion you might be familiar with, TAN inclusion is targeted at select elements, never an entire file. TAN inclusion is a two-step process. First, a TAN file is linked to, and therefore made available for inclusion, via <inclusion>s (inside <head>). Like <vocabulary>, an <inclusion> does nothing on its own. It merely points to a file that is eligible for inclusions. No actual inclusions occur until the next step. Second, select parts of the included file are invoked in the dependent file. To do so, insert an element X in a valid location, but with nothing but @include, with one or more values (space-delimited) pointing to the @xml:id values of the <inclusion>s desired. In the validation process, that element will be replaced with every element X found in the inclusion file, resolved recursively, and ignoring duplications (deeply equal elements). For example, a TAN-T file might have a

<div
                     include="poem1">

. The validation routine will replace that element with every rootmost <div> in the included file called poem1. Any host file that includes elements from another file inherits any associated vocabulary, and along with it @xml:id values. This may result in errors if there are any resultant conflicts in IDrefs. TAN inclusion is very practical for texts. Textual works commonly nest inside each other. By setting up your class-1 files as a series of inclusions, you can reduce validation time, both in the file and in class-2 files that depend upon the transcriptions. See the examples subdirectory for a case of the Gospels including the Sermon on the Mount including the Lord's Prayer. The inclusion technique is also especially useful for vocabulary (TAN-voc) files. A single master TAN-voc file can include other vocabulary files, each devoted to a particular type of item (e.g., one for works, one for scripta). Project files then need to link merely to the master TAN-voc file. You can include a TAN file that itself includes other TAN files. Inclusion is recursive. In any recursive system, circularity is fatal. That is true for TAN inclusion as well, but only within the scope of specified element names. It is perfectly legal for two files to include each other, as long as they do not try to include (directly or indirectly) the same elements, or try to consult each other to resolve any vocabulary. Files pointed to by <inclusion> are considered an essential part of any TAN file. Failure to find the target file will throw a fatal error during validation.

Other Related Files Other files can be specified. The more that are mentioned, the richer the network. <predecessor> and <successor> point to versions of the file that precede and postdate it. <source> is another type of related file, but it may or may not link to another file. In class-2 files <source> always points to a class-1 TAN file. In class-1 and class-3 files, <source> may point either to a file or to a scriptum (see ). <see-also> can be used to point to any file that has some relationship to a TAN file. The required @relationship points to one or more <relationship> vocabulary items. There is no standard TAN vocabulary for relationships. Normally, when a file-to-file relationship is considered important, it becomes a full-fledged element. Some TAN formats allow special types of related files (e.g., <redivision> and <model> for class-1 files). See metadata descriptions under specific classes or formats.

Adjustments The fourth major section of <head>, which is optional, consists of <adjustments>, which specifies changes that have been made (class 1), or should be made (class 2), to the sources. In class-1 files, these consist of <normalization>s and <replace>s; see . Class-2 files allow <skip>, <rename>, <equate>, and <reassign> as adjustments; see .

Local vocabulary items and ID assignments: <code><link linkend="element-vocabulary-key"><vocabulary-key></link></code> The fifth major part of <head>, <vocabulary-key>, allows you to declare any specific vocabulary items relevant for the file. It also allows you to take vocabulary items existing in other TAN-voc files (whether defined in <vocabulary> or standard TAN vocabulary), and assign them @xml:ids that are valid only in the current file. Anything in <vocabulary-key> will overwrite default TAN vocabulary, but not any TAN-voc files pointed to via <vocabulary>. These id assignments can be supplemented with <alias>es, which are used to assign an id to one or more ids. This practice resembles what text editors do when naming groups of manuscripts. Each manuscript is given a siglum, say a single lowercase Greek or Latin letter, and the manuscripts are grouped together into families, with each family given its own siglum, say an uppercase letter. If the editor wishes to indicate that a whole family of manuscripts departs from a particular reading, the family siglum is all that is needed. An <alias> works much the same way, and can be used for any vocabulary items. For example, if a textual division can be legitimately called both a rubric and a heading, you could assign rubr and hd as ids in the <vocabulary-key> to the vocabulary items for the rubric and the heading, and then insert

<alias
                     xml:id="rubrichead" idrefs="rubr hd">

. Then, in that file, <div n="1" type="rubrichead"> would identify that <div> as being both a rubric and a head. Unlike other similar attributes, the @idrefs of an <alias> cannot point to the <name> value of vocabulary items. They can point only to the id references of locally defined instances of @xml:id. This restriction reduces confusion, and avoids some complexity in the resolution and validation of a TAN file. <alias>es may recurse, as long as there is no circularity. That is, @idrefs in an <alias> may refer to any @xml:id or @id, not only to a vocabulary item but to another <alias>. In most cases <alias> should refer to items of the same type. In a few situations mixed groups do not pose a problem, for example mixing <person>s, <algorithm>s, and <organization>s. TAN validation will indicate whether mixed typology introduces errors. Because @xml:id may not contain certain types of characters, such as common punctuation marks, and because <alias> must be able to coin unusual ids (especially for grammatical features), @id may be used instead of @xml:id in <alias>.

Responsibility The sixth section of a <head> declares who is responsible for the file. It consists of a <file-resp> and one or more <resp>s. The persons, organizations, or algorithms pointed to in <file-resp> must include at least one who has a tag URN whose namespace matches the namespace in the tag URN of the root element's @id. This requirement strengthens the effort to make sure that each TAN file is associated with the person or persons who are or were responsible for the file. <person>s so identified by <file-resp> are called primary agents, and are bound to the global variable $primary-agents. If a claim is made in a TAN file, and no @claimant is explicitly declared, it is assumed that the $primary-agents are making the claim.

Change log The change log, the seventh section of the <head> consists of one or more <change>s, which provide a partial history of the file. The entire history is calculated from every attribute that has a date or timeDate value, which can be fetched via the function tan:get-doc-history() or the global variable $doc-history. The change log is an effective way to communicate to those who might use your files. In all likelihood, a user will download from the master location a local copy. You might make changes or updates to your master copy. Anyone depending upon a copy will be warned, during Schematron validation, of each <change> that postdates the value of their @accessed-when. If you have introduced an important or disruptive change, you can mark your <change> with @flag, that allows the following values: warning (default value), error, info, fatal. By marking a change as info, you lower the level of a change's importantce; error raises it. The value fatal will halt the validation process in the dependent file altogether. If you receive change messages during validation, and you want to stop them, merely update the value of @accessed-when to the current date.

Pending work The last section of a <head> lists all pending tasks that yet need to be applied to a file. These are itemized as a list of <comment>s in <to-do>. A file with an empty <to-do> is assumed to be no longer in progress, so there must be a <master-location> provided. Like the change log, the <to-do> effectively communicates cautionary notes to those who might use your files. Anyone depending upon a copy will be warned, during Schematron validation, of each item in the list. The report is not dependent upon when the file was last consulted (@accessed-when), because these are standing, unresolved issues. One benefit of <to-do> is that a clear account of what remains to be done will encourage people to release their material earlier than normal, because other users will have fair warning about what is imperfect or incomplete.

Class-1 TAN Files, Representations of Textual Objects (Scripta) This chapter provides general background to class-1 TAN files and their elements and attributes. For detailed discussion see . Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Class-1 files are the foundation of any TAN project. No TAN-A-tok or TAN-A-lm file can be created without at least one class-1 file. There are two types of class-1 formats, identified by the root element. <TAN-T> is a simple, generic format, as close as one can get to plain text. <TEI> (also referred to in this manual as TAN-T(EI)), on the other hand, can be complex and highly expressive. Because the two formats function almost identically, the generic TAN-T format is described first, followed by supplemental comments on TAN-TEI.

Principles and Assumptions

General (For more general principles and assumptions applying to all TAN files, not just class 1, see .) Class-1 formats are designed for faithful but judiciously normalized digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a (preferably familiar) reference system. Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the text of a primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how the transcription might be used in other contexts, especially translation studies or a study of quotations. Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicography. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills are more profitably spent editing other TAN formats. TAN-T(EI) editors stand at the foundation level of the Text Alignment Network. Because other files will depend upon them, careful proofreading is important. Eliminating as many typographical errors as possible before publication will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed with the assumption that most files in circulation have typographical errors that can and should be corrected as they are found. If you are aware that a text needs proofreading, but you still want to make it available, simply leave a <comment> in the <to-do> part of the <head>. If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate alignment and annotation, which requires use of a suitable reference system (see reference systems). Transcription files should be segmented and labeled according to a reference system that is familiar and can be easily applied to other versions of the same text in other languages. If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should be prioritized over visual (lines, columns, pages, volumes). Any transcription can be furnished multiple reference systems, but it is advisable to do so on the basis of separate files, linked by <redivision>s in the <head>. See .

Domain model Contributors and users of TAN files must sharply distinguish between a scriptum (text-bearing object) and a conceptual work, e.g., between a specific printed copy of the Iliad and the Iliad concieved generally. The former has materiality (digital files are treated here as being material) and the latter does not. Even though both are constitutively necessary for any transcription, the two are always differentiated in the TAN format: <source> and @src point to physical exemplars; <work>, @work, and <version> to the conceptual. Adherence to this distinction is quite important. Some readers may be reminded at this point of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies in its Group 1 (Products of intellectual & artistic endeavor) four types of entities: work, expression, manifestation, and item. A work is "a distinct intellectual or artistic creation" and an expression is the conceptual, immaterial realization of a work. Both work and expression are terms for conceptual, non-material entities. A manifestation, on the other hand, is "the physical embodiment of an expression" and an item is a single exemplar of a manifestation. Quotations in this section come from International Federation of Library Associations and Institutions, Functional Requirements for Bibliographic Records: Final Report, amended and corrected (February 2009), . Examples of FRBR Group 1 Entities Work Expression Manifestation Item Iliad Caroline Alexander's English translation of the Iliad. the print run identified with ISBN 978-0062046284 A specific copy The Psalms The (Hebrew) Masoretic Psalter The 1820 printing of George Offor's edition of the Hebrew Psalms Biblioteca Palatina Cod. Parm. 1699 A River Runs Through It Norman MacClean's original version The 1992 film version Print run ISBN 0226500608 Blue Ray disc UPC code 004339632533 Author's personal print copy Reference print CGB 7432-7438 (deposited in the Library of Congress)

TAN's domain model differs slightly. The most important difference is abandonment of FRBR's expressions, which was considered problematic in the development of sample TAN data. The term expressions was intended to describe a conceptual, non-material entity, but the FRBR guidelines defined and explained it in vague or material terms. "Expression encompasses, for example, the specific words, sentences, paragraphs, etc. that result from the realization of a work in the form of a text....defined, however, so as to exclude aspects of physical form, such as typeface and page layout, that are not integral to the intellectual or artistic realization of the work as such." (ibid., p. 19, emphasis added) That is, expression includes integral aspects of physical form (e.g., typeface that is integral to the realization). "Inasmuch as the form of expression is an inherent characteristic of the expression, any change in form (e.g., from alpha-numeric notation to spoken word) results in a new expression." (p. 20, emphasis added) Even the very term expression and FRBR's preferred synonym, realization, imply materiality (without which nothing can be expressed or realized). Further, FRBR's expression does not easily handle creative adaptations of works that are themselves arguably works in their own right. For example, Euripides' Medea was adapted several centuries later by Seneca the Younger. Seneca's Medea is arguably merely an expression, but has itself been subject to various editions and performances, i.e., expressions. But FRBR does not accommodate expressions of expressions. If Seneca's Medea is treated as a work in its own right, its expression relationship to Euripides' origin is lost, since FRBR does not accommodate works that are expressions of other works. In the TAN domain model, expression is altogether dropped. There is only one type of conceptual, non-material entity, namely, a work. The term version in TAN is applied to a work that substantially follows but varies another work, e.g., translations and adaptations. But such versions are themselves still works. One work is indicated to be the version of another if a class-1 file through the <work> and <version> declarations. As for material entities, FRBR's manifestation and item are combined in TAN through the term scriptum. A scriptum is a text-bearing object, e.g., book, manuscript, pamphlet, tombstone, traffic sign, digital file (digital media is interpreted as being material). When scriptum is used in a TAN file, it points either to a single physical item or to a set of physical items that are for all intents and purposes are indistinguishable (i.e., a scriptum reproduced mechanically). A scriptum that points to a manuscript points only to that one particular manuscript. But a scriptum that points to a printed book or a digital file is understood as applying to all copies of that printed book or digital file. There is at present no formal mechanism to specify whether a scriptum points to one object or a set of objects. The distinction must be inferred from a scriptum's IRI + name pattern. In cases of potential ambiguity, it is up to creators of a TAN file to assign to the scriptum IRIs that avoid confusion. For example, to point to Edward Gibbon's personally annotated copy of the 1763 edition of Herodotus (now held by the Wren Library, Trinity College, Cambridge University), one should not use or , which point to the set of all copies. In this case, one may need to mint their own IRI, based on the Wren Library's acquisition number, RW.50.15. In summary, the TAN domain model defines two kinds of entities: works and scripta. Works, which are immaterial, conceptual entities, may contain other works, or they may be versions of other works (or work-versions). Scripta, which are material entities, may contain other scripta, and they may refer either to a single object or to a set of copies. A work may be instantiated in many scripta, and similarly, any scriptum may contain many works. Most work-scriptum relationships can be inferred from the <head> of a class-1 file, and they may be expressed in a <TAN-A> file. Examples of TAN Entities Work Scriptum Iliad Caroline Alexander's English translation of the Iliad. the print run identified with ISBN 978-0062046284 a specific copy The Psalms The (Hebrew) Masoretic Psalter The 1820 printing of George Offor's edition of the Hebrew Psalms Biblioteca Palatina Cod. Parm. 1699 Norman MacClean's A River Runs Through It The 1992 film A River Runs Through It Print run ISBN 0226500608 Author's personal print copy Blue Ray disc UPC code 004339632533 Reference print CGB 7432-7438 (deposited in the Library of Congress)

One Version, One Work, One Object, One Reference System Every TAN-T(EI) file must be restricted to a transcription of a single version of a single work found on a single scriptum, segmented and labeled according to a single reference system. The principle above is critical to the the success of the network. It reduces the risk of confusion and simplifies the files. It follows the generally advisable principle, that master data should be disaggregated.

One Scriptum Each TAN-T(EI) file must transcribe one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created by the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies (what some librarians call a manifestation) is to be regarded as the scriptum. Identifying and naming a scriptum might require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; other manuscripts are composites, made of several manuscripts. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files. The scriptum is declared via <source>, which either takes the IRI + name pattern, or points to a <scriptum> vocabulary item. It is a good idea to name your scriptum with an <IRI> value in the form of an http URL that points to a detailed entry in a library catalogue. Doing so allows users to retrieve extensive, structured bibliographical information. You also save yourself the hassle of having to write a detailed, structured bibliographical description. If a URL cannot be found for <IRI>, you may simply coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses the same scriptum-source, incorporate its <name>s and <IRI>s with your own (multiple <name>s and <IRI>s are a virtue). If you need to specify exactly where on a scriptum a work-version appears (e.g., page range), <comment> or <desc> should be used.

One Work The transcription must be restricted to a single creative work, identified by <work> (part of the declarations section of <head>). Many scripta have more than one work. Identifying the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. You define the work. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or merely the Beatitudes. Use whichever work you like, but make sure that the TAN-T(EI) file contains nothing but the work you have declared. It should be a complete representation of what is found on the object, even if only partially preserved, and respect as far as is practical the order of the text in the scriptum. The requirement to provide the entirety of the work-version on the scriptum is a significant departure from the fourth principle of . Users should be able to assume that the transcription in a class-1 file covers the entirety of the work-version chosen, within the particular scriptum. If you are aware that the transcription is incomplete, leave a <comment> to that effect in the <head>'s <to-do>, identifying which portions are missing from the transcription. Well-known works may have a suitable IRI already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN.

One Version The transcription must be restricted to a single version of the creative work, identified perhaps by <version> (part of the declarations section of <head>). In most cases, <version> is unnecessary, because <work> in conjunction with <source> are in most cases sufficient to identify a particular work-version. But if the source carries multiple versions (e.g., a bilingual edition of a text), then <version> should be included, to specify which version has been transcribed. <version> can also be used to declare explicitly that the work mentioned in <version> is a version of the work mentioned in <work>. If you have a scriptum with multiple versions of a work, and you wish to transcribe them all, each version should be in its own separate TAN-T(EI) file. There may be cases where individual textual divisions are repeated, not so much because they represent a different version, but because they are variants that are integral to the work-version chosen. Creating a separate file for such individual cases would be both impractical and misleading. Standard TAN vocabulary for div types includes as a standard item variant, which may be use to wrap every variant in its own <div>, e.g., . . . . . <div type="title" n="title"> <div type="variant" n="orig">The Place</div> <div type="variant" n="subscript" xml:lang="grc">Ὁ Τόπος</div> </div> . . . . . Notes should be included only if they are an integral part of the primary work (i.e., by the same author, not by a later editor). If you think the notes to a work are important, and legitimately a work in their own right, consider putting them in their own TAN-T(EI) file, or converting them to claims in a TAN-A file. Very few work-versions have IRIs. It is advisable to assign a tag URN or a UUID. If the IRI you have used for <work> is in a namespace that you own or control, then you are entitled to modify it, and you may wish merely to add a suffix to the work IRI. For example, you might have tag:urn:example.com,2001:work:a defined for the work; a 1987 German translation might be specified as tag:urn:example.com,2001:work:a:ver:1987:deu.

One Reference System Every TAN transcription must be segmented into a hierarchy of labeled divisions, defined in the <body> through <div>s and their @n values. Those divisions, whenever possible, should align with the reference system that prevails for the work across different versions or translations, in what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute, the term canonical is problematic. It is avoided in these guidelines we refer simply to a work's reference system. If you have your choice, preference should be given to reference systems that follow the semantic contours of the work, not the physical features of a particular scriptum. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other versions of the work (e.g., translations, paraphrases) will only roughly, if at all, follow a reference system based on features found in a particular scriptum. Sometimes a scriptum-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on the Categories). It is perfectly acceptable to adopt that system, but it may entail more labor during the alignment process. If a given work has more than one common reference system (e.g., the works of Plato and Aristotle, which have two reference systems—logical and scriptum-oriented—both of which are standard and important), then the recommended practice is to create two class-1 files with identical transcriptions, each one structured by its own reference system. Place in each file a <redivision> pointing to the other. Under verbose validation, you will be notified if there are textual discrepancies between the transcriptions, and Schematron Quick Fixes will allow you to automatically update one text to match the other. Having two or more alternatively divided editions can be quite useful. They could serve as the basis for reference cross-indexes, or to help convert other versions of the work from one reference system to the other. If there is a good reference system, but the divisions are overly lengthy, you may introduce subdivisions. But there is no guarantee that the provisional subdivisions you introduce will be adopted by other editors who create or edit TAN versions of the same work. Editors working independently upon the same text and subdividing it, will likely produce discordant schemes. Class-2 formats provide a mechanism via <adjustments> to reconcile some basic differences. But a discordant scheme might be best handled simply by creating a copy, and restructuring it according to the preferred system, making sure related files refer to each other through <redivision>. If a work does not have a reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to design it so that it can be easily applied to any version of the work, including translations. Prefer logical divisions of text over scriptum-based divisions. TAN supports five major methods of numeration in reference systems: Arabic numerals. 1, 2, 3, etc. Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value will be deducted from the total value; all others are added). Alphabetic sequences. The 26-letter Latin alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed. (Note, this is not the hexavigesimal (base 26) system, where a is 0, b is 1, z is 25, aa is 00, ab is 01, etc.) Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c.... Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals. See tan:letter-to-number() and references there to TAN functions for converting numbering systems. The TAN validation process attempts to convert all values of @n to Arabic numerals. Some values are ambiguously Roman numerals or alphabetic sequences. For example, c could mean 3 (alphabetic sequence) or 100 (Roman numeral). Such numerals are assumed to be Roman, unless you supply a <numerals> and assign @priority to specify letters (or roman).

Extra <code><link linkend="attribute-n">@n</link></code> vocabulary If you are using @n to label the names of books of the Bible or Surahs of the Qur'an, you will run into the issue of different conventions for @n. To avoid this long-standing problem, you may want to use extra TAN vocabulary for @n. If you include in the head of your TAN file <vocabulary which="bible eng"/>, then any non-numeric values of @n will be checked against the corresponding TAN-voc file (in this case, the TAN-voc file at /vocabularies/extra/n.bible.eng.tan-voc.xml). This, in turn, will will allow other files to refer to that <div> by any other <name> that is a synonym. For example, in a class-1 file pointing to the TAN English Bible vocabulary above, a <div type="book" n="matt">...</div> would be regarded as containing the work the Gospel of Matthew. Any class-2 file that refers to that class-1 file as a source may use any synonym listed in the extra vocabulary file n.bible.eng.tan-voc.xml, i.e., Mt, Mat, Matt, or Matthew (or their lowercase equivalents). An extra benefit of this method is that such <div>s are also marked as the works, identified by the <IRI>s of the target TAN vocabulary items. If you use extra TAN vocabulary, it is recommended you include in the declarations section of your <head> an <n-alias>. This element, along with its @div-type, specifies exactly which types of <div>s are eligible for this kind of aliasing on @n. Supplying this element considerably speeds the validation process on long files. The goal behind the extra vocabularies is to eliminate the need to worry about what abbreviations are used to name well-known, unnumbered <div>s. It is hoped that in future releases of TAN these extra vocabularies will grow in number and quality. Extra TAN @n vocabularies:

Normalizing Transcriptions You should declare how you have normalized the transcription via <adjustments> and its children, e.g., <normalization> or <replace>. (For suggestions on values of <IRI> for <normalization> see .) Generally speaking, normalization entails the suppression of things extraneous to or separable from the work-version you have chosen. You are encouraged to omit parenthetical editorial insertions (especially quotation references), stray handwritten remarks, discretionary word-breaking hyphens, editorial comments, inserted cross-references, and reference numerals (page numbers, section numbers, etc.). If chapter 4 of a text begins "4." or "IV" then leave out that labeling numeral—you've already indicated it in @n, so there's no need to clutter the transcription with it. Remember, scholars who use your file will be concerned with things like word-for-word alignments and lexico-morphological analysis, and putting in a modern editor's "4" might contaminate research results. For the same reason, you should resolve ligatures and correct unintended typographical errors. The goal is a transcription whose text is free of the interpretive voice of later editors. You should remove from the text anything that is not part of the work proper and would interfere with detailed word-for-word alignment, or would require extra preprocessing or postprocessing work for other users. If you are breaking a transcription into individual lines, and you are required to break a word, do so with either the soft hyphen (), the zero-width space (), or the zero-width joiner (‍). TAN processors that handle the text within a leaf <div> will automatically normalize its space. If either of those two characters are found at the end then it will be deleted and the text from the next leaf <div> (if there is one) will immediately follow without intervening space; if those two characters do not occur at the end, then a space,  , will be added, and all other space will be normalized. For more details, see . In a digital source, variable lengths of special spacing marks (e.g., General Punctuation U+2000..U+200B) should be converted to ordinary spaces (see ), and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see ). Variant readings should not be transcribed. For example, a manuscript may have correctors' marks. Or a set of footnotes (or apparatus criticus) might provide an alternative reading. In those cases, each set of corrections should be moved to a separate TAN-T file, or rewritten as <claim>s of a TAN-A file. In some ambiguous areas, you can use TAN-TEI both to normalize and to preserve what is in the scriptum. Suppose, for example, a manuscript has reference numerals that are sui generis. That is, these reference numbers do not correspond to the "canonical" reference scheme, and are scribal adjustments to the text's structure (sometimes mistaken). On the one hand, such reference numerals are metadata, and should arguably be deleted; on the other, they are part of the text, and witness to how a text was read and changed over time. A middle-ground approach would move these references to TAN-TEI's <milestone rend="[TEXT]">, substituting [TEXT] for the reference text. In that way, the numerals are properly removed from the main text, but the information is retained. Generally speaking, TEI's @rend is an excellent way to remove something from a transcription while keeping it in the file. Overall, normalization is a difficult, understudied topic. Scholars are not in the habit of documenting everything they normalize, and sometimes have so internalized a set of normalizations that they are unaware of them. Not all decisions will be clear-cut. You may justly hesitate before normalizing orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode that permit different conventions may need special consideration. You may need to deliberate on whether an unusual or rarely used Unicode character might be misinterpreted or hinder searches. Document any decisions in the <adjustments>. Whether you use <normalization> or <replace> is up to you. The former can be used to apply a class of changes to a vocabulary item. The latter provides a precise, regular-expression-based method of describing exactly what has been changed, and the order in which those changes took place. Note, a <replace> might help one to reconstruct the path that led from the input to the output, but not the reverse. If it is important to document exactly what the pre-normalized version of a text was like, use <predecessor> or a similar element available in the key links section of the <head> (see ) to point to the original. If you find it very difficult to bring yourself to normalize to the depth advised above, try first making a (non-TAN) TEI file, and create the transcription you have in mind as the ideal. Once that is finished, create a second, TAN version, and be more aggressive in your normalization, with <see-also> pointing to the first approach. Users of your TAN transcription will be more interested in your TAN version than the TEI version, but you will have at least satisfied your craving to avoid normalizing.

Normalizing Annotations The footnotes or endnotes in a scriptum should be normalized. Many, most, or all should likely be deleted. Before deciding, distinguish between those that are an intrinsic part of the work you're transcribing from those that aren't. Those that aren't can be removed, or they can be put into a separate TAN-T(EI) file, perhaps linking the two through <see-also>, and hopefully structuring both files with the same reference system, to facilitate alignment. Another way to approach the task is to convert some or all of the notes you're removing into <TAN-A> <claim>s. Footnotes, endnotes, glosses, or marginalia that are intrinsic parts of the work present special challenges for encoding in general, and normalization in particular. First is the issue of connecting an annotation to the text annotated. When we encounter a superscript number—a note signal—while reading the text of a printed book, we infer that we are being invited to find a companion footnote, and that footnote comments on the text we have just read. But specifically what text? Is it only the preceding word? Is it a word or phrase that occurs earlier in the sentence? Does the annotation cover earlier sentences, the entire paragraph, or even prior paragraphs? For some notes, identifying the text being annotated requires interpretation. In a digital file, connecting an annotation to its text cannot be so vague; it requires a decision and a commitment. Here are three possible ways to approach annotations in a TAN file: Use the <note> feature of TAN-TEI (see related TEI documentation). This will allow you to connect the annotation to merely an anchor in the text, i.e., to no text whatsover. <div n="1" type="p"> <p>The process occurred in New York, among other places.<ref rend="1"/> <note><p><ref rend="1"/>On New York, see: X.</p></note> </p> </div> Move each annotation into a <div> with a @type that implies that it is an annotation (e.g., scholium) and place it immediately after the <div> it annotates. <div n="1" type="p">The process occurred in New York, among other places.</div> <div n="n1" type="footnote">On New York, see: X.</div> Note in the example above that n1 is used to make sure that 1 unambiguously points to only one <div>. As #2, but also write a <TAN-A> file that more precisely connects each annotation to the text it annotates.<claim verb="annotates"> <subject src="text" ref="n1"/> <object src="text"> <from-tok ref="1" val="The"/> <through-tok ref="1" val="York"/> </object> </claim> The first option is expeditious, and will allow you to be as precise or imprecise as you like. Validation is not affected, but you should be aware that the <note> will be treated as a constituent part of its parent <div>. The second option is also relatively easy, but it entails a decrease in precision. The third option provides immense precision, permits multiple annotations on the same text range, and allows notes to target overlapping ranges of text. But the task could be time-consuming, if only because you will need to determine the range of text targeted by each annotation, and the targeted text might be quite messy or vague. You will need to take stock of how precise and comprehensive you choose to make your connections. (See also accuracy, precision, and comprehensiveness.) Remember that the note signals in the main text and in the footnote area are metadata meant to help readers link corresponding passages of texts, and in the spirit of normalizing should be deleted. In a TAN-TEI file you can replace a note signal with <ref> (see above).

Class 1 Metadata The <head> of a class-1 file is much like that of other formats, with some extra options. In the key declarations area (see ), class-1 files may allow <n-alias>. See for context on how to use this element. In the section devoted to links to other digital resources (see ), class-1 files allow several extra types of files. One <model> is allowed, to point to another class-1 file that has the model reference system. The model should be the same work. It may be in a different language, or come from a different source/scriptum. During verbose validation, any differences between a class-1 file and its model will be presented as warnings, since small differences are nearly always inevitable. Zero or more <redivision>s are allowed, to point to an alternative transcription that follows a different reference system. A class-1 file and any redivisions must have identical text in the <body>, and draw from the same version of the same source/scriptum. <redivision> is an important alternative to the knotty, longstanding problem that besets texts that admit multiple reference systems. In a traditional TEI file, one must adopt a primary reference system, and add other reference systems through milestone-like anchors. This can result in transcriptions that are difficult to read. TEI anchors do not have the semantic underpinnings needed to cycle through the milestones from one primary reference system from one to another. TAN's design principles call for simplicity and disaggregation, hence the stand-off annotation model. So the ideal TAN approach is to encode same transcription in multiple files, one per reference system, linked through <redivision>. This may appear to contradict another principle, that one should not repeat themselves. But that is the easier principle to repair. During verbose validation, <redivision> transcriptions will be checked against the host, and specific areas that differ will be flagged. Should users wish, a Schematron Quick Fix will provide an automatic update of a text to a <redivision>, without changing the reference structure. Zero or more <annotation>s point to class-2 files that use the file as a <source>. This type of linked resource is helpful for keeping track of key alignments and annotations. Zero or more <companion-version>s point to different versions of the same work in the same scriptum. This feature is useful for correlating multiple versions of a work that appear in a single scriptum, e.g., the original text and a facing translation in a bilingual edition. The adjustment section of the <head> (see ) allows zero or more <normalization>s and <replace>. See .

Class 1 Data The sole purpose of the <body> of a class-1 file is to contain an ordered, segmented transcription of a single version of a single work from a scriptum. <body> must take @xml:lang, specifying the predominant language of the text. If a change in language occurs in a descendant <div>, ensure that its @xml:lang also changes. <body> takes one or more <div>s, each of which govern either other <div>s, or text (or TEI elements), but never both. TAN files adopt a non-mixed content model (see ). The term leaf div refers to those <div>s that contain only text, and not other <div>s. Within this treelike structure of <div>s, the concatenation of @n values, starting from the most rootward <div>, provides the reference system used by class-2 files to refer to parts of TAN-T(EI) files. A given <div> may have more than one reference, if its @n or any @n it inherits has multiple values. Every permutation is calculated, and they are treated as synonymous ways to refer to that <div>. In previous versions of TAN, there was a requirement that each leaf <div> should have a unique reference. That requirement has been downgraded to a warning, because there are cases where non-unique leaf <div>s are required. Some scripta are encoded such that leaf divs are broken up (see Bodëús's edition of Aristotle's Categories, at 2a35, 2b5, and 2b6b). And some translations must be encoded so that leaf divs interleave. Further, one TAN-T's leaf divs might easily become another TAN-T's non-leaf divs, and vice versa. The distinction between leaf and non-leaf div is arbitrary, so both types should be expected to adhere to the same kind of rules for the reference system. For any two <div>s that share the same reference, it is not allowed that one be a leaf <div> and the other not (to do otherwise would entail a mix content model). It is also further assumed that all <div>s that share the same reference are consecutive, constituent parts of the same <div>. That is, any two <div>s with the same @n are not alternatives to each other, but are rather disjoint parts. For true alternatives, see discussion above on using variant in @type.

Transcriptions Using the Text Encoding Initiative (<code><TEI></code>) This section is to be read in conjunction with and , which address related technical issues. Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further. Some may already have a library of transcriptions whose annotations are desirable to keep, even if uninteresting to every user. In these cases, you should use TAN-TEI, a customization of the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in textual scholarship. TEI was designed to be maximally expressive and flexible, to serve the detailed needs of scholars in the humanities. In serving this mission, TEI has come to define more than five hundred different elements, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library. Although TEI XML is oftentimes described as a standard, it lacks charactistics one normally expects of a standard. It is very flexible, admits flavors and interpretation, and is best used when it is customized. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations, based on TEI All. The major difference between TEI All and TAN-TEI is that the latter imposes extra strictures, to ensure that transcriptions are maximally likely to be interchangeable with other TAN-TEI files. All TEI files are validated against a TEI-conformant schema normally as an XML DTD, RELAX NG, or W3C Schema. TAN's TEI-conformant schema is based upon the TAN-TEI.odd file in the schemas directory, converted to a RELAX-NG file, TEI.rnc and TEI.rng, to define the structural rules of TAN-TEI files. There is an additional layer of validation, through the related Schematron process (TEI.sch), which performs detailed validation not possible in a TEI-conformant schema. In the discussion below, it is important to distinguish between structural validation and Schematron validation. See . TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace, http://www.tei-c.org/ns/1.0): Synopsis of TAN-TEI customization TEI element Strictures <TEI> must have @id with tag URN must have @TAN-version takes a new child element, <head>, placed between <teiHeader> and <text>; it and its descendants must be in the TAN namespace, xmlns:tan="tag:textalign.net,2015:ns" <text> There are no extra strictures, but during Schematron validation (not RELAX-NG), this element and any children <front> and <back> will be ignored. Of its children, only <body> will be Schematron validated. <body> must take @xml:lang any non-<div> children will be ignored during Schematron validation; most often only <div> should be children contents must be restricted to a single version of a single work any and all text nodes will be treated as part of the transcription <div> may encompass a textual division of whatever size you like (TEI defines <div> as being larger than block-like or paragraph-like textual divisions; TAN's <div> is much more like HTML's). must take elements; either they all are <div>s (perhaps interleaved with anchors such as <pb>) or none of them are <div>s (non-mixed model) must take @type and @n (or only @include) @type may take multiple values, space delimited, pointing via IDref to a vocabulary item @n must consist of word characters or the underscore, conforming to the following regular expression:

[\w\._]+([\-
                                          ,]+[\w\._]+)*

. If @n is to be given more than one value, those items must be separated by a space or a comma. A hyphen-minus, - (U+002D, the most common form of hyphen), always has special meaning in @n, specifying a range. This feature is useful for cases where a <div> straddles more than one standard reference number (e.g., a translation of Aristotle that cannot be easily tied to Bekker numbers). If you need to use a hyphen-like character in an @n that does not specify a range of numbers, consider ‐ (U+2010 HYPHEN), ‑ (U+2011 NON-BREAKING HYPHEN), ‒ (U+2012 FIGURE DASH), – (U+2013 EN DASH), or − (U+2212 MINUS SIGN).

TAN-TEI files have two heads, which may strike you as strange. Each head does something different, and was designed for different purposes. Whereas the TAN <head> is meant to be brief and restricted to only those matters relevant to the transcription, the <teiHeader> permits quite an expansive range of metadata, and may be used to encode a variety of things, including those that are tangential or irrelevant to the data. Unlike the TAN <head>, whose data is designed to be both computer- and human-readable, <teiHeader> was designed for data to be read principally by humans; although it can accommodate IRIs, it was not designed around them. Further, a TAN <head> can never be empty and valid; a bare-bones <teiHeader> with no actual text content, such as the following, is considered valid:<teiHeader> <fileDesc> <titleStmt><title/></titleStmt> <publicationStmt><p/></publicationStmt> <sourceDesc><p/></sourceDesc> </fileDesc> </teiHeader> TAN's Schematron validation process ignores the contents of <teiHeader>, since its contents are unpredictable and therefore not reliably parsable. If your <teiHeader> has any kind of metadata that needs to appear in the TAN <head> (see and ), the conversion needs to be performed manually, since (as mentioned above) the two headers are incommensurate, and writing each one requires a different mentality. In a TAN-TEI file, the TAN <head> must be in the TAN namespace, i.e.,

<head
                  xmlns="tag:textalign.net,2015:ns">

(or

<tan:head
                  xmlns:tan="tag:textalign.net,2015:ns">

, but this would require all descendant elements to be prefixed tan:). Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. Most users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. TEI files are flexible, permitting different approaches to markup. A TAN-TEI file should not be scriptum-oriented, i.e., it should not try to replicate how the text appears or looks on the object. You may have a TEI file that you wish to convert to TAN-TEI. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps: Structure: insert new processing instructions (pointing to files to perform TAN-TEI structural and Schematron validation); adjust root element by supplying a tag URN for @id and @TAN-version. Metadata: create new

<head
                              xmlns="tag:textalign.net,2015:ns">

and populate it. Data: edit <body> to make sure all text nodes are restricted to the content of a single version of a single work; restructure <body> content into nesting <div>s with correct @type and @n values. It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming, particularly in finding suitable IRIs. But step 3 should not be underestimated, either. Many people write TEI files with a focus on the original textual object, and they do not normalize to the level expected in a TAN file. Some TEI files have been written with little attention paid to space and space normalization. Some TEI files are so laden with annotations that the text is impossible to read. In general, the more simple the TEI file the better, with annotations pushed to external files. Some TEI markup is already implicit, or is easily calculable (e.g., <w> to mark words, which should already comport with the tokenization declared in the <head>; users of <w> easily lose track of where space is and isn't). Some TEI markup can be expressed in a class-2 file (e.g., lexico-morphological data, which should be expressed in a TAN-A-lm file).

Class-2 TAN Files, Annotations of Texts This chapter provides general background to class-2 TAN files. For detailed discussion of individual elements and attributes see . There are three types of class-2 files: TAN-A files provide broad, macroscopic alignment of multiple versions of any number of works. It also allows annotations of texts, in the form of claims. TAN-A-tok files provide narrow, microscopic alignment of any two class-1 files, annotating word-for-word or character-for-character correspondences between the two texts. TAN-A-lm files express annotations pertaining to lexico-morphology (part-of-speech), for either a single class-1 file or a language in general. In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for situations where it may not be clear which text is the target and which is the source. Further, there is a more generic use of source and target that prevails in many other contexts. In these guidelines, therefore, the term target never refers to a text as such (rather, it normally refers to a file that is being pointed to), and when we use the word source, we are referring only to one of the class-1 files upon which a class 2 alignment depends.

Common Elements

Class 2 Metadata (<code><link linkend="element-head" ><head></link></code>) Class-2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system discussed below. All class-2 files have as their sources nothing other than class-1 files. Therefore each <source> must take the . Editors of class-2 files must be able to name or number word-tokens in a transcription, and to determine an appropriate definition of "token," via an optional <token-definition>. See . The declaration <numerals> at present does not allow you to customize a numeration system for sources. A future release of TAN may support such a feature. Inevitably, some class 1 sources for the same work will differ from each other. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows an idiosyncratic reference system. If sources need to be reconciled, alterations may be specified in <adjustments>, which stipulates a set of actions that should be applied to the sources that have been named. Adjustment actions: <skip>, to allow you to ignore specific <div>s, deeply or shallowly. <rename>, to allow you to rename specific <div>s. <equate>, to allow you to provide synonyms for @n values. <reassign>, to allow you to split leaf <div>s and move their parts elsewhere in the structure. These adjustment actions allow you to reconcile discordant sources without changing them directly. Skips, renames, and equates are first applied to the source as received. If a particular source <div> is the target of more than one adjustment action, only the first one will be applied according to action priority: <skip>, <rename> based on @ref, <rename> based on @n, then <equate>. Because of this priority order, some actions might not be performed. For example, if you deeply skip a <div>, no renaming adjustments will be made to its children. If you have renamed a div, then want to reassign it, you must do so based on the new name, not the original. You should be aware of the consequences of your adjustments. After skips, renames, and equates are applied, <reassign>s are applied to the the newly adjusted source. Each adjustment action adds time to the validation routines. On lengthy texts these can become quite time-consuming. Take, for example, the Tanakh / Old Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of these differs from the other in the names of books, and the numeration of some chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To reconcile these three versions, one might write 267 <rename>s and 6 <equate>s. Applying these actions to all three versions can take about a minute (tested on computer with an Intel i5-8250U, 12 GB ram), before any other significant validation checks on the <body> of the class-2 file. Normal validation takes about a minute and a half. If such processing times are unacceptable for your needs, you are advised to keep <adjustments>s to a minimum or to apply them to relatively small texts. Further, adjustment actions were intended primarily to support the alignment process, and so were designed to apply select changes to sources. If a source must be changed in numerous places to reconcile it with other sources, it might be better to create a new version of the source organized according to the target reference system. Then in both the new and original versions of the class-1 files insert <redivision>, <predecessor>, <successor>, or <see-also> to link the two versions. There is a TAN application that remodels one text in the image of another. See applications/remodel/remodel via TAN-T.xsl. The output of that application requires editing, but it can reduce the amount of work required. TAN tools for oXygen's author mode can also be used to correct that newly segmented text. These and related applications are under development, and may not function as expected. Improvement of these tools is scheduled for future releases of TAN.

Class 2 Data (<code><link linkend="element-body" ><body></link></code>) Data differs greatly between the class 2 formats. However, they all share one thing in common: the <body> consists of a series of claims, and responsibility for those claims should be attributed to the persons, organizations, or algorithms making the claims. Therefore, each <body> may take @claimant and perhaps @claim-when, specifying by IDref who should be credited or blamed with the material. If either attribute is missing, it is assumed that the claims are the responsibility of the persons listed in <file-resp> at the time of the latest date or date-time. The values of @claimant and @claim-when are weakly inheritable.

Class 2 Pointer Syntax: Referencing Texts The class 2 formats have been designed to be human readable, particularly text references. In ordinary conversation, when refering to specific parts of a work, we prefer to use the numbers or names of pages, paragraphs, sentences, lines, words, letters, and so forth, and sometimes relational words (e.g., "first"). We might say, for example, "See page 4, second paragraph, the last four words." Sometimes we quote the very text itself: "See page 4, second paragraph, first sentence, second occurence of 'pull'." Those familiar conventions are the basis for the TAN pointer syntax, which differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers apply common reference terminology to four strata of a text: works, divisions, word tokens, and characters. Works, defined above (see ), are declared by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of the text in those divisions, defined according to one or more <token-definition>s declared in the class-2 file. And characters are defined as individual base letters in a word token (modifier characters are grouped with the preceding base character; see ). This approach not only makes the syntax human readable but mitigates the effect of changes to the sources. For example, if a <div> is deleted, moved, or changed, the alteration affects only references specific to that section; the rest of the reference system remains intact. The four parts of TAN's reference system are explained below, but you should consult other parts of the guidelines, or TAN examples, to see how they are used in practice.

Referencing Works: <code><link linkend="attribute-work" >@work</link></code> Class-2 files refer to works via meaningful IDrefs that point to the class-1 sources that transcribe the work or work-version, e.g., work="hamlet". The reference is understood to apply not merely to that particular source, but to any TAN-T file that claims to transcribe that work or work-version. (On the relationship between works and work-versions see .) Thus, the id of the source-scriptum becomes a proxy or alias for the work. A vocabulary item <work> may also be used; its @xml:id provides a way to refer to a work without requiring a corresponding source. Because TAN-A-tok and TAN-A-lm files deal with source-specific claims, the data for those formats do not refer to works. Only TAN-A <claim>s refer to works.

Referencing Divisions: <code><link linkend="attribute-ref" >@ref</link></code> Portions of text, i.e., <div>s, perhaps altered if <adjustments>s have been invoked (see , are pointed to via @ref. A @ref is constructed by taking the values of @n in the <div> in question along with its ancestor <div>s, and joining them with non-word characters. For example, @ref="I.1.1" might point to the following: <div type="act" n="1"> <div type="scene" n="1"> <div type="line" n="1"> . . . . . . </div> . . . . . . </div> . . . . . . </div> A @ref can express sequences and ranges of <div>s. In the example ref="1.2-4, 1.5", the hyphen and comma, which are reserved to signify ranges and series, are reserved. A hyphen always means "from...through" and a comma always means "and". Take note, if you are accustomed to editing conventions that use the comma as a subordinating punctuation mark. In the TAN format, commas are always paratactic, not hypotactic. For example, if referring to Hamlet, ref="I,2,3" is not a single reference to <div>, act I scene 2 line 3, but rather three of them: act I, act 2, and act 3 (notice how the commas in the attribute value behave like the commas in the written phrase). The periods (full stops) in @ref="I.1.1" are hypotactic markers, but they are arbitrary, and could be replaced with any mix of non-word character you like (except the hyphen or comma), including spaces, e.g., ref="I:1 1". The numeral system is also arbitrary. You may use any supported numeral systems (see section on numeration systems), even if the source uses a different one. Semantic equivalents to the preceding example are ref="A I i" and ref="1:a:I". Just remember, if you use either the Roman numeral system or alphabetic sequences, include a <numerals> in the <head> to specify which system should prevail in case of ambiguities (e.g., whether c means 3 or 100). Roman numerals are the default, but it is a good idea to be explicit.

Referencing Tokens: <code><link linkend="attribute-pos">@pos</link></code> and <code><link linkend="attribute-val">@val</link></code> To point to a token one normally uses <tok>, with one or more attributes, in three possible configurations: @val or @rgx alone: one or more tokens are pointed to by value. For example, val = "bird", points to every occurence of the token bird; rgx = "b.+d" finds every word that begins with a b, ends with a d, and has some characters in-between. Every value of @rgx is implicitly bound to the beginning and end of the string (see below). @pos alone: one or more tokens are pointed to by numerical position, via one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in a passage. The numerical value to which the keyword last resolves depends upon the context length. @val or @rgx combined with @pos: a combination of the previous two methods. For example, @val="bird" @pos="2, 4" picks the second and fourth occurences of the token bird. During Schematron validation, if @pos is missing, it is assumed to mean * or 1 - last; if neither @val nor @rgx appear, the assumption is @rgx with value .+ (any characters). That is, @pos by default points to every instance and @val/@rgx by default points to any string. When using @pos make sure you know the context. For example, the attribute combination

val="bird"
                        pos="last-1"

will produce an error if the token bird does not occur at least two times in the given context. It is advisable to use @val, and not merely @pos. If your source's text changes, and there is no @val, it may be difficult to determine the original intent of a claim, to determine whether changes need to be made. Furthermore, @val is generally speaking more efficient to process than is @rgx. A @rgx is more efficient to process only if it replaces numerous instances of @val. @rgx is a regular expression that must match an entire word-token. For example, @rgx="re.d" will match the tokens "rend" and "read" but will not match "already", "rends", or "bread". If you wish to allow for characters at the beginning or end, use ".*re.d.*". For more on regular expressions, see .

Referencing Characters: <code><link linkend="attribute-chars" >@chars</link></code> Individual letters are always specified by @chars, which points to a specific position, e.g.,

chars="2,
                        7, last"

. Combining characters are excluded from these counts; see .

Division-Based Annotations and Alignments (<code><link linkend="element-TAN-A" ><TAN-A></link></code>) TAN-A is the format for macroscopic, division-based alignment and annotations. It is dedicated to aligning any number of versions of any number of works on the basis of <div>s in its sources. The A also stands for annotations, because the TAN-A format allows you to make general assertions, usually but not necessarily about texts. TAN-A is a type of advanced RDF for textual scholarship (see ).

Root Element and Header The root element of a TAN division-based alignment file is <TAN-A>. TAN-A's <head> has zero or more <source>s. Any concepts that will be mentioned in the <claim>s (the only children of <body>) need to be supplied in <vocabulary-key>.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A file takes, in addition to the customary optional attributes (see ), @claimant, @object, @subject, or @verb, stipulating the default values for the enclosed claims. The rest of the body consists of zero or more <claim>s, each of which represents one or more claims. Claims can be used for a variety of purposes, e.g.,: to list quotations and allusions; to indicate which passages deal with what general subjects and topics; to connect commentary or notes from one source to another; to indicate where other scripta have different readings (apparatus criticus). <claim>'s data model is inspired by the Resource Description Framework (RDF; see ), where each statement consists of three items termed a subject, a predicate, and an object. The first and third are thought of as nodes, and the second as a connector (or edge) between the nodes. RDF follows a graph model, where the connector (edge) always links exactly two nodes. RDF is adequate for but a limited range of scholarly assertions. An RDF statement lacks context or qualifiers. No RDF statement can indicate who made the assertion, or when, or if it was uttered with any doubt or nuance. Sometimes we wish to claim a bare negation, e.g., "Aristotle was not the author of De mundo"—which cannot be expressed in RDF. TAN's <claim> extends the graph RDF model into a hypergraph, where the connector (edge) links two or more nodes. The following adjustments are made: Every claim must have at least one claimant, some person, organization, or algorithm to be credited/blamed for the assertion. Every claim must have at least one subject, the topic of the claim. Every claim must have at least one verb (in RDF called predicate), specifying something about the subject. Every claim may have at least one adverb, qualifying the verb. Every claim may assert a level or range of certainty, between zero and one, reflecting how certain the claimant is of the claim. Every claim may have at least one object, an entity or value expected by the verb. Every claim may have at least one temporal qualifier, restricting the claim to a specific time. Every claim may have at least one locative qualifier, restricting the claim to a specific geographical region. Every claim may have other components, if so defined by the verb. Currently, this entails for select verbs a language qualifier (@in-lang, <in-lang>) and a reference qualifier (<at-ref>). Items 1-3 above are required parts of any claim. Items 4-9 may be rendered as being required, optional, or disallowed by a <verb>'s definition. For example, a <verb> representing an idea that in normal discourse is intransitive (e.g., sleep) can be defined such that <object> is not allowed. Furthermore, a <verb> may be defined to restrict what kinds of objects or subjects are allowed. For example, the standard TAN verb lacks_text_at (see vocabularies/verbs.TAN-voc.xml) is defined to allow only scripta as a subject. An object is not allowed. A <claim> with this verb expects one or more <at-ref>s, which restricts the claim to a particular passage in a TAN-T file. A <verb> can specify that an object must be data, and it can also define the type of data allowed and its permitted lexical form. Claims may refer to other claims. That is, <claim>s can nest inside each other (e.g., X claims that Y claims that Z claims that...). Or a <claim> may take an @xml:id, whose value can then be cited as the object or subject of any other <claim>. If a <claim> is about a work or source in general, as a whole, one or more IDrefs may be placed in @subject or @object. But if the claim is about a specific part of the textual object, then more information is needed, so the attributes cannot be used. Such textual references come in three flavors: assertions pertaining to a work, assertions pertaining to a work in only some versions, and assertions pertaining to scripta. In the first case, <subject> or <object> must take @work, with IDrefs pointing to vocabulary items for <work>s. In the second case, @src is used, pointing by IDref to the applicable <source>s. In the third case @scriptum is used, pointing to vocabulary items for <scriptum>. Remember, you may combine commonly grouped IDrefs in an <alias>. A @work means that the claim applies to any versions of the work, whether a source or not; a @src specifies that the claim applies only to the specific <source>. In each case, <subject> or <object> may be given more attributes and elements to restrict the claim to specific parts of the work or source, with @ref, <tok>, @val, @pos, and @chars, following the conventions used in pointing to parts of texts (see ). If a <subject> or <object> points via @scriptum to a scriptum, specifying the claim necessarily takes a different approach than that used for @work or @src. Bear in mind, it is encouraged in these guidelines to avoid scriptum-oriented methods of dividing class 1 files. Therefore, clarifying a portion of a scriptum (e.g., a particular manuscript folio number) requires an apparatus that likely does not correspond to a TAN file. Therefore, a a <subject> or <object> with a @scriptum can be restricted through descendant <div>s that specify via @n and @type a specific region on the scriptum. These scriptum filters, unlike TAN-T <div>s, are always empty; their sole purpose is to point in native terms to a specific region on a scriptum. Multiple values in any component of a <claim> are distributed, which means that one <claim> might contain multiple assertions. For example,

<claim subject="A B" verb="taught promoted"
                     object="X Y Z"/>

has within it twelve claims (the combinatory permutations of the three attributes' individual values). The exception to this general rule is @adverb, whose multiple values are taken as ampliative and restrictive. For example,

<claim subject="A" adverb="probably not" verb="taught"
                     object="X"/>

is a single claim, not two, even though @adverb has two values. A limited set of verbs have been defined in standard TAN vocabulary; see . The strictures defined in these verbs are checked during Schematron validation. For a brief discussion on defining your own verbs in a TAN-voc file see .

Token-Based Annotations and Alignments (<code><link linkend="element-TAN-A-tok" ><TAN-A-tok></link></code>) TAN-A-tok files facilitate the microscopic alignment of two related sources. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions. The two class-1 sources should be two different versions of the same work. Most often, one will be a translation of the other, but the format can be used for two versions of the text in the same language, e.g., paraphrase, revision. Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. They should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than are TAN-A files because of the level of detail involved. To simplify the file, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. Annotating a bitext is oftentimes not easy, and requires you to consider and declare assumptions you have made in two key areas: the relationship that holds between two scripta and the types of reuse that was involved in turning one version into the other (or a common ancestor into both). Relationship of sources' scripta. What is the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, what common ancestor do they share? Here you should consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's. See . Types of reuse. What categories of text reuse do you consider operative? Users of your data should be informed of the paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using generic terms such as translation, paraphrase, quotation, without much specificity. On the other hand, you may subscribe to a detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See .

Root Element and Header The root element of a token-based alignment file is <TAN-A-tok>. The TAN-A-tok header builds upon the core and class 2 headers (see and ). TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id. <vocabulary-key> takes, in addition to all the elements allowed in class-2 files (see ), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scripta. The second attends to the qualitative aspect of the bitext relationship. See above.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see ), required @bitext-relation and @reuse-type, which take one or more IDrefs from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment. <body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, in case you or someone else wishes to refer to a particular <align>. Nothing should be inferred from silence in a TAN-A-tok file. There is no requirement that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit yourself to saying everything possible about the bitext. You might choose to encode only a few token clusters. Tokens that are not referred to should not be interpreted as gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens. (See discussion on comprehensiveness.) In fact it is oftentimes preferable to have a TAN-A-tok file that points to only a selection of tokens; a file with tens of thousands of <align>s could take a very long time to validate. Any token may be the object of as many <align>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives. If you wish to declare that one or more words in a source were omitted from a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a one-sided alignment, i.e., a token cluster that has tokens from only one source. A one-sided alignment implies insertions or omissions. If there are multiple values in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="translation paraphrase" would indicate that the token cluster results from an activity that is both translation and paraphrase. Commonly, <tok>s include @ref, pointing to a leaf <div>. But this is not required. The @ref may point to a <div> that takes other <div>s, or @ref may be altogether absent.

Lexico-Morphology (<code><link linkend="element-TAN-A-lm" ><TAN-A-lm></link></code>) TAN-A-lm files are used to annotate the lexical and morphological character of individual tokens or morphemes. These files have two kinds of dependencies: a class 1 source (optional) and the grammatical rules defined in one or more TAN-mor files. This section therefore should be read in close conjunction with ). TAN-A-lm files are either source-specific or language-specific. Source-specific TAN-A-lm files depend exclusively upon one class-1 source. Language-specific TAN-A-lm files depend upon an unknown number of sources. Some language-specific TAN-A-lm files might be based upon a small, specific corpus, others upon a vast, general one. Source-specific TAN-A-lm files are useful for analyzing closely one particular text. Language-specific ones are useful for building language resources for computer applications.

Principles and Assumptions Editors of TAN-A-lm files should understand the vocabulary and grammar of the languages of their sources. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files being used. Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to the analysis your own expertise and supply lexical headwords unattested in published authorities. Although TAN-A-lm files are simple, they can be laborious to write and edit, more than any other type of TAN file. They can also be hard to read if the morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file to use tools to create and edit the data.

Root Element and Header The root element of a lexico-morphological file is TAN-A-lm. If the file is source-specific, <source> points to the one and only TAN-T(EI) file that is the object of analysis. If the file is language-specific, <for-lang> is used in the declarations section of the <head> to indicate the languages that are covered. For language-specific TAN-A-lm files, this part of the <head> may also include <tok-starts-with> and <tok-is>, which improve performance when validating and processing numerous or large files. There is at present no mechanism for automatically reconstructing the corpus that underlies a language-specific TAN-A-lm file. Such a mechanism may be provided in a future version of TAN. <vocabulary-key> takes the elements other class-2 files take (see . It also permits two elements unique to TAN-A-lm: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential. There is, at present, no TAN format for lexica and dictionaries. So even if a digital form of a dictionary is identified through the , the Schematron validation routine will not attempt to check the TAN-A-lm data against the lexical authorities cited. Because you or other TAN-A-lm editors are likely to be authorities in your own right, <person> can be treated as if a <lexicon>, and be referred to by @lexicon.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-lm file takes, in addition to the customary optional attributes found in other TAN files (see ), @lexicon and @morphology, to specify the default lexicon and grammar. <body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s). An <ana> may take a @tok-pop, to specify the number of tokens that the assertion applies to. This is particularly helpful for language-specific files based upon a limited corpus of texts, where the underlying data for the assertion might be difficult or impossible to retrieve. The token population can be used to assign levels of certainty, or to compare statistical profiles of one TAN-A-lm file against another. If you wish to point to a linguistic token that straddles more than one token, you should use multiple <tok>s, wrapping them in a <group>. Any token may be the object of as many <ana>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives. Claims within an <ana> are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok> or <group>. Many TAN-A-lm files will be generated by an algorithm that automatically lists all possible morphological values of each token. It is advised that such automatic calculations always include in their output @cert, with weighted values. That is, if an algorithm identifies two possible lexico-morphological profiles for a word, but one occurs nine times more than the other, then it is advised that this be reflected in the two resultant elements, e.g.: <lm cert="0.9">...</lm> and <lm cert="0.1">...</lm>. If an algorithm is written with a more sophisticated way to weigh possibilities, then adjust the value of @cert accordingly. Be certain that the <algorithm> is credited in the <vocabulary-key> and in a <resp>. As with TAN-A-tok files, not every word needs to be explained or described. In fact, this is oftentimes undesirable, to avoid files that are overly long and time-consuming to validate or process. A TAN-A-lm file is rendered more efficient when claims can be grouped. If a particular token always has a particular lexico-morphological profile, this can be declared once, in a <tok> that does not have @ref, or it can be specified through a compound @ref. You do not need to provide a <tok> for every leaf div. In fact, such an approach can result in inefficient validation and processing. For example, in early versions of TAN, the lexico-morphogical values of the Greek Septuagint (8.3 MB) were converted to a TAN-A-lm file of 407,811 <tok>s grouped in 52,703 <ana>s (25.8 MB). Early 2020 validation routines took about 25 minutes (2018 validation routines took hours). That particular TAN-A-lm file itemized every single token in the text. It was revised to be more declarative along the lines advocated above. If a particular token had only one lexico-morphological profile throughout the text, then every instance was reduced to a single <ana>, with no @ref in <tok>. When a particular token value had different lexico-morphological profiles, @ref targeted the rootmost <div>. This revision resulted in a smaller file (15.8 MB; 158,376 <tok>s in 54,335 <ana>s) that validated in about a third of the time (8.5 minutes). In general, there is always a trade-off between convenience and efficiency. If your priority is speed, you should break a large file into several smaller ones, perhaps recombining them in a master file via <inclusion> (see ).

Class-3 TAN Files, Varia This chapter provides general background to class-3 TAN files, which are devoted to formats that do not fit the other two classes. For detailed discussion of specific elements and attributes, see .

Vocabulary (<code>TAN-voc</code>) All too often, a project has a set of vocabulary it draws from time and again. To repeat the can be both tedious and treacherous. If a project with hundreds of TAN files decides to change or augment its vocabulary it could take a long time to find and make all the changes, everywhere and consistently. The TAN-voc format addresses that problem. It is intended to allow a project to define, edit, and augment the IRI + name patterns for recurrent vocabulary. TAN supplies several standard TAN-voc files under the subdirectory vocabularies, supporting commonly used concepts such token definitions, div types, licenses, and many more. For a complete list of predefined TAN keywords, see It is quite common for a person or team to build vocabulary items in the course of developing a corpus, which means that TAN-voc files tend to changed as the project progresses. You can organize your vocabulary in whatever manner makes sense. You might create one large TAN-voc file for all vocabulary or one file per type of vocabulary, each independent of the other. Each approach has strengths and weaknesses. The latter, one TAN-voc file per type of vocabulary, can create quite a bit of extra work. Every TAN file that draws from the vocabulary must insert one <vocabulary> for each relevant TAN-voc file. The best approach we have found is to have one relatively small master TAN-voc file, which includes other TAN-voc files via <inclusion>s (along with <group


                  include="[IDREFS]"/>

or <item include="[IDREFS]"/>, pointing to the IDrefs of the included TAN-voc files). For more details on how this format relates to other TAN formats, see .

Root Element and Head A TAN-voc file has <TAN-voc> as the root element. The <vocabulary-key> of a TAN-voc file takes, in addition to core vocabulary items, any number of <group-type>s. A TAN-voc file may draw directly from the vocabulary in its body, as if it were referring to itself via <vocabulary>.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-voc file consists simply of <item>s or <verb>s, perhaps gathered into groups via <group> or @group. These groups have, at present, no effect upon other TAN files that use them, but they have been valuable in certain applications. For example, the standard TAN-voc file for <div-type> (vocabularies/div-types.TAN-voc.xml) groups textual division types into a rudimentary typology that allows applications to decide programmatically whether a particular division should be treated as a block or inline element, or whether it should be indented. The @affects-attribute or @affects-element, both weakly inheritable, defines the scope of the vocabulary items, i.e., what elements or attributes can the items be legitimately used for. The vocabulary item will be eligible only for specified attributes or elements. Nearly all <item>s in a TAN-voc file contain the IRI + name pattern. The only exceptions are <item>s pertaining to token definitions, which instead of <IRI>s take <token-definition>s. See . <verb> includes, in addition to the IRI + name pattern, the option to have <constraints> added. Those constraints define what components are permitted in any <claim> that uses the <verb>. At this time, verb constraints are at an early phase of development. Only those constraints that mirror standard TAN vocabulary for verbs, vocabularies/verbs.TAN-voc.xml, will be supported during validation. Study that file for examples of how to build a <verb>. See on the use of verbs in a TAN-A file.

Morphological Concepts and Patterns (<code>TAN-mor</code>) TAN-mor files are used to delineate the morphological characteristics or features of a given language, to assign codes to those features, and to define rules governing the application of those codes. It is a kind of Schematron for the grammar of human languages. The format allows specificity, flexibility, and responsiveness. Grammatical rules may be constructed to return warnings and error messages to users who use a code or pattern incorrectly, or not in accordance with best practices. Such rules may be qualified, or made contingent upon certain conditions. This chapter should be read in close conjunction with .

Principles and Assumptions Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see . TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing, and generally acquainted with how the grammars of comparable languages work. The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how categories should be defined and applied. TAN-mor allows scholars to declare clearly their operative assumptions and views. It is up to other users to decide whether or not to adopt them. The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized. Categorized codes are interpreted according to position. a b c would mean something different than c b a. For example, Perseus () adopts categorized codes for morphological analysis of Greek, Latin, and other highly inflected languages. Every code has ten positions, each one corresponding to a major grammatical category, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if a hyphen or null. A d in one position means something different from a d in another. Uncategorized codes, on the other hand, assign one unique code to each grammatical feature. In this approach, codes may be combined and arranged at will. a b c would be identical to c b a. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is in practice most often found serving languages that are not highly inflected, e.g., the Brown and Penn sets for English. TAN-mor morphological codes may not include either the space or the hyphen, and unlike IDrefs, they are case insensitive. The codes NOUN and noun are interchangeable.

Root Element and Header The root element of a morphological rule file is <TAN-mor>. Zero or more <source>s describe the grammars or related works that account for the morphological rules. If the categories, codes, and rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source may be inferred to be based upon the personal knowledge of the persons or organizations identified in <file-resp>. <vocabulary-key> is populated with the grammatical <feature>s that are allowed grammatical concepts in the language, and they are asigned codes via @xml:id. Because a grammatical feature is not allowed in a TAN-mor file until it is explicitly declared in a <feature>, @xml:id might simply repeat the value of @which. TAN has a standard vocabulary file for grammatical features: vocabularies/features.TAN-voc.xml. This vocabulary file encodes 746 vocabulary items corresponding to core grammatical features declared in the OLiA Reference Model for Morphology, Morphosyntax and Syntax (). See . If you wish to incorporate into your codes characters that are not allowed in @xml:id, e.g., $ or :, you should create an <alias>, whose @id allows such values. <alias> of course can be used to assign multiple grammatical features to a single id.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see ). Within <body>, you begin with a language declaration: one or more <for-lang>s. After the language declaration come rules: zero or more <where>s declaring rules to be followed for the feature codes. <where> has attributes that establish the context under which its enclosed rules are operative. Those rules are found in the enclosed <assert>s or <report>s, which declare rules that must be followed, or must never be followed, by any dependent TAN-A-lm file. An <assert> and <report> will be checked only if the conditions declared by the attributes in the enclosing <where> are met in the context of a given <m>: @m-matches (regular expression): <m> matches the pattern. @tok-matches (regular expression): one of the values of <tok> in the given <ana> matches the pattern (regular expression). @m-has-features (space-delimited strings): <m> has the specified features. @m-has-how-many-features (integer): <m> has the given number of features. An <assert> also has one or more of the truth conditions above. If the test proves false in a given <m> then the <m> will be marked as erroneous and the message included by the <assert> should be returned. <report> has the same effect, but the test looks for the opposite boolean value: the error and message will be returned only if the test proves true. After the rules come a structure declaration (if relying upon structured codes): zero or more <category>s . Each one sorts <feature>s into groups, assigning them @code values that are unique within the <category>. Sequence is important. The first <category> defines the features allowed in the first code position, the second in the second, and so forth. See sample TAN-mor files in the examples directory.

TAN Catalog Files (<code>collection</code>) TAN catalog files are used to locate relevant TAN files and to support the XSLT function collection(). They catalog or index any TAN files within a local directory and perhaps its subdirectories. These catalog files must always be named catalog.tan.xml. They depart from all other TAN files in their structure. They have no namespace. They have neither body nor head. Rather, they are patterned off the catalog.xml description provided by Saxonica (). Any XML file passed to the stylesheet

applications/create/create TAN catalog
                  file.xsl

will automatically generate one of these files, cataloging all the files in the local directory. The root element of a catalog file is <collection>, with children <doc>s that hold simple metadata about the TAN files that are in a directory and its subdirectories. Only TAN files may be registered in a <doc>. A <doc> may include other material such as each file's resolved <head>, but this is not mandated.

Working with the Text Alignment Network Working with TAN Files This chapter presents ways to manage, create, edit, and share TAN files. The material discussed here is non-normative. That is, these are suggestions based upon the experience of TAN users. Descriptions in this chapter are both brief and general. To understand better the underlying framework, study the files in the subdirectory functions, or their reformatted versions in the chapter .

Local Setup TAN can be downloaded from a master data repository listed at . The project has been developed using the version-control software Git. Whether you download the files directly or use Git, place the TAN code wherever is most convenient on your computer. The TAN files you create may be set up in whatever structure you want. Because TAN files are meant to be shared and interlinked, it is beneficial to develop predictable directory structures. In the 2018 version of TAN, advice was given on how to organize directories and files. But experience with a variety of projects, each with their own needs and preferences, has shown that such advice is shortsighted. One point does still seem valid, however: keep your TAN libraries separate from the core TAN files. Many TAN projects will find it necessary to work with dozens of versions of a particular work, and it is easy to get confused as to what file does what. In projects with many text versions, it is recommendad that your names for class-1 files (the filename, not the @id; see ) start with an acronym or short abbreviation for the author and work, followed by the language code, the last name of the editor/author of the scriptum, the date when the scriptum was created or published. If you have multiple TAN files that refer to each other via <redivision>, because each has a different reference system, you may need to include that in the filename. Some examples: ar.cat.grc.1949.minio-paluello.ref-logical.xml (Aristotle's Categories, in Greek, 1949, edition by Minio Paluello, following a reference system based on semantic units [paragraphs, sentences, independent clauses]). apocr.eng.kjv.1760.xml (apocrypha, English, King James Version, 1760 edition) tlg0059.tlg031.perseus-grc1-Pl.Ti.xml (Plato's Timaeus in Greek). This filename has some duplication in that tlg0059 already implies Pl and tlg031, Ti, but only die-hard users of the Thesaurus Linguae Graecae know the meaning of the numerical codes. pl.ti.grc.1905.burnet.stephanus.xml (Plato's Timaeus in Greek). This filename is an alternative way to construct the previous example. Class-2 files are tougher. They together multiple files and concepts, so filenames could become very long or unpredictably structured, especially if trying to express which class-1 sources they use. At this time, the best recommendation is to make sure that each class-2 file is put into its own subdirectory, separate from class-1 files, and given a brief but meaningful name that points to the research question that motivated its creation. Some examples: ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml (a sample of lexico-morphological data for Aristotle's Categories, in Greek) nt.grc-syr.selections.TAN-A-tok.xml (a selection of word-for-word correspondences between the Syriac and Greek New Testaments) plato.general.TAN-A.xml (a general alignment and annotation file on Plato's works) Class-3 filenames are a bit easier. It is recommended that TAN-mor files begin with the language code then an acronym for the person or group responsible for creating the features. TAN-voc files are written generally to serve a specific project or collection, so the collection name and the type of vocabulary should suffice. Examples: eng.example.com,2014.1.xml (tagging scheme #1 for English, by the owner of the domain example.com in 2014) ar.cat.general.TAN-voc.xml (general vocabulary items for a project for Aristotle's Categories) If you have a local copy of someone else's TAN collection, and you wish to create TAN files that depend on them, you are in all likelihood going to use relative URLs pointing to copies of the files stored on your local drive. It is recommended that you also point to the master versions through absolute URLs in extra <location>s. The validation routine checks only the first document available. From time to time, you might comment out the first <location> and run the validation process again. This will tell you if there have been any updates since you last accessed the file. Or you should occasionally validate other TAN files you have downloaded. If the <master-location> is intact, you will be notified of any updates. In a given project, you are likely to repeat basic information, particularly <person>, <role>, and <work>. such as elements with the , consider moving those to a project TAN-voc file. It is almost always preferable to develop TAN-vocs before resorting to <inclusion>s. <inclusion>s are powerful, but they can become quickly complex and confusing to navigate.

Using TAN with Oxygen XML Editor If you use an advanced XML editor such as oXygen, you can set up a project so that TAN validation files can be easily located and validation can be automatically applied. A sample oXygen project file is included within the TAN library to get you started. You may wish to create a copy of that project file for yourself before developing it. TAN also includes select oXygen frameworks files, which provides editing tools for oXygen's Author mode. The Author mode includes a variety of editing tools, primarily for class-1 files. After opening the supplied Oxygen project file, tan.xpr, use Author mode to view at a sample TAN file and look at the options in the menu, the toolbars, and the context-click menu, to see what is possible. Both the project file and the frameworks files are in their early infancy, and are therefore incomplete and imperfect. They have tremendous potential for development, slated for future versions of TAN.

Creating and populating TAN files TAN is a representational format. Every TAN file models some source. If those sources are non-digital, it is a relatively straightforward task to create and populate a TAN file. You just start editing everything by hand. In some cases, you might get a head start with an algorithm. For example, optical character recognition (OCR) on an edition might give you a dirty but useful start for a TAN-T file. Applying OCR to a printed index of quotations might get you the basic start to a TAN-A file. Despite the computer's assistance, the majority of the task will be spent in correcting any conversions. Thoughtful, scholarly attention is critical to making these files suitable for use. In many other cases, you are trying to take something that already exists digitally and convert it into a TAN format. If you find a Word file, a web page, or a plain text file that can serve as the basis for a TAN file, a common first impulse is to copy the desired content, paste it into the body of an empty TAN file, then manually correct the material. That solution is quick and easy, but short-sighted. You may find that you made a major mistake, and you have done so much work, you cannot backtrack. Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you eliminated line breaks that you didn't realize at the time were useful signals about where <div>s should be separated. Even if all goes well, after all that hard work you might be find out that the pre-TAN data sources you started out with have been updated, with other things corrected. If any significant time has elapsed, you may have forgotten what procedure you followed to convert the data. And even if you remember, you will have to repeat the steps again, and dread the day when those pre-TAN sources are updated yet again. In these cases, it is advised to think not about fixing the files, but rather about developing a system to fix the files. Your goal try to create a digital pipeline/workflow that can be applies when needed, so that changes to those pre-TAN versions can be channeled into your TAN library. If you or a project member has experience in XSLT, it is a good idea to develop stylesheets to convert the data to TAN. When you find mistakes such as those described above, no harm is done. You can simply adjust and re-run your process, each time getting better and better results. An XSLT-based approach requires extra work, initially. Establishing a stable transformation process can be time consuming, since it requires repeated sequences of trial, error, and diagnosis. But the investment pays off in the long run, especially if you are dealing with dozens, hundreds, or thousands of files. The routines you write for one set of files might be useful for the next. Here is one approach. Create a template skeleton TAN file that resembles your desired output. Develop a XSLT stylesheet that does the following: Fetches the pre-TAN file (main input). Puts the main input in an XML tree, then applies select alterations. Fetches the template TAN file. Pushes the altered pre-TAN content into the template file. Saves the infused template, either as the primary output, or as a result document. One of the challenges to this method is that the pre-TAN input might not be XML, in which case it cannot be the initial, catalyzing input to the XSLT file. But that is fine. For such conversions, you can make your XSLT file a MIRU (main input resolved uris) stylesheet. A MIRU stylesheet has as its catalyzing input any XML file, including itself. That initial, catalyzing input is unimportant, because a MIRU stylesheet, through global parameters and variables that point to resolved uris, fetches the main input. For an example of a MIRU stylesheet, see applications/compare/compare TAN class 1 files.xsl. The method described above has been used successfully to handle several different kinds of conversion, including ones where the source files are updated very frequently. In such scenarios, the traditional cut-paste-and-edit method is not only unproductive; it is foolish. Writing transformations can be laborious at first. Finding the best way to handle and manipulate a pre-TAN file is an intellectual challenge with multiple solutions. But there is a good chance that some of the labor you have in mind has already been done for you in a TAN function (see ) or application (see the subdirectory applications).

The TAN Validation Process TAN files are validated when the file, along with its associated TAN schemas, are passed to a validation engine. Validation can be set up either by pointing explicitly to the schemas within a TAN file (via <?xml-model ?> statements in the prolog), or by setting up an oXygen project or framework to automatically apply the schemas to TAN files (see ). There are two types of TAN validation. Structural validation is conducted through RELAX-NG files that define the attributes, elements, and patterns that are allowed or required in a given TAN format. These files are kept in the schemas project subdirectory. If you are editing a TAN-T file, for example, its RELAX-NG schema is schemas/TAN-T.rnc. The RELAX-NG files are written principally in the compact syntax (.rnc), then converted to XML syntax (.rng). The TAN-TEI format is an exception. The schema begins with schemas/TAN-TEI.odd. This file, linked as it is with the other RELAX-NG files, is processed by TEI stylesheets to generate the master TAN-TEI.rnc and TAN-TEI.rng files that validate TAN-TEI files. The ODD file is processed against TEI All, the largest of the TEI formats, in the version available at the time of the release of a given TAN version. The second type of validation uses Schematron to check rules that cannot be expressed in RELAX-NG, e.g., no @when should have a date in the future. More than one hundred types of errors are checked during Schematron validation. For a comprehensive list see . Some of these errors can be quite time-consuming for a computer to check. For example, if a class-1 file has a <redivision>, the text should be identical. On short texts, the test can be made in seconds; on longer texts it might take minutes. Therefore Schematron validation allows three different levels: terse, normal, and verbose. The names reflect not only how fast each phase takes but how much feedback is provided. The Schematron files themselves are rather small. The majority of the work is done by a large library of XSLT code that takes the file, resolves it, and expands it, inserting errors and help messages along the way. A greatly reduced version of the expanded file is then passed back to the Schematron processor as a global variable. The Schematron processor returns as messages any errors or warnings found in the generated file, and for any suggested corrections (also embedded as children), it returns a Schematron Quick Fix. TAN's Schematron validation is more computationally intensive than is its RELAX-NG. The longer and more complex your file and its dependencies, the longer its validation will be. Files such as the Ring-a-roses examples in the examples subdirectory will take a split second to validate, but a TAN-T file of the Old Testament of the King James Version has been known to take about 33 seconds to validate in the normal phase (the whole Bible about a minute). A TAN-A-lm file with a full morphological analysis of that long TAN-T file will take a long time to validate. Tests were performed on TAN-A file that had three very large TAN-T sources (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron validation under the normal phase took about 13 seconds (run on oXygen 22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was expanded to 546, the same process took 63 seconds. When the file had 5,421 claims, the file took 78 minutes, 45 seconds to validate. Much of the expansion is due to the Schematron process itself. The XSLT component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23 minutes 57 seconds, respectively. The Schematron component becomes more time-consuming faster than does that of the XSLT. In future versions of TAN this process will be further optimized (the figures above are a very significant improvement over 2018 figures). For now, you must make decisions that pit speed against convenience. If you wish to have validation happen quickly, break files into smaller ones, perhaps to be joined later in a single TAN file via <inclusion>s. Validating ten component files each with ten thousand elements will take aggregately less time than validating one long file with one hundred thousand elements. Had the example TAN-A file mentioned above been split into 43 different files, the entire collection would have been validated in less than 12% of the time. The process behind Schematron validation can be used not only for validation but for other applications, so should be explained. Any TAN file that is processed by the TAN XSLT library goes through two major transformations. The first transformation resolves the file. The goal is to get the file into a state where it can be evaluated without having to consult any <vocabulary> or <inclusion> dependencies. (See for background on TAN's approach to inclusion.) This process also does some basic file-specific normalization; it will: Prepare the file. This includes evaluating <alias>, stamping the root element with a base URI (the path location of the file), and every element with a @q (an arbitrary name), which contains a unique identifier. This identifier is used by the Schematron file match an element with any error messages in the corresponding element in the XSLT output. Identify those nodes that need to be changed by <vocabulary> or <inclusion> dependencies. Insert required components from <vocabulary>s or <inclusion>s through the following method: Relevant external vocabulary items are inserted into the <head>, either as descendants of the appropriate <vocabulary> or if derived from TAN standard vocabulary as new <tan-vocabulary> elements immediately following the <vocabulary-key>. All vocabulary items are imprinted with an <id>, to facilitate rapid retrieval of vocabulary. Any vocabulary <name> that is not normalized is given a copy that is name-normalized (signaled by @norm): lower-case, hyphens and underscores changed to spaces, and space-normalized. Any element with @include is replaced by the elements of the same name found in the target inclusion document. In addition, <inclusion> is populated with any vocabulary items required to resolve the newly included material (recursively, if that inclusion requires other inclusions). This last point is important, because all IDrefs must be interpreted in light of the original context. IDrefs are brought into the host document, so when you use <inclusion> you must ensure there are no id conflicts. Normalize all numbers in original components (i.e., excluding included elements or vocabulary items) as Arabic numerals. Files are resolved recursively. That is, no <vocabulary> or <inclusion> components are imported until the files pointed to are themselves first resolved. Numerals fall at the end of the process because they might need to be resolved in light of resolved vocabulary and inclusions. The description above is necessarily generalized. For details consult the function library, particularly . In cases where there is a conflict between the code and the description above, the code is to be interpreted as more current and authoritative. The second transformation expands the file. The goal is to unpack the components of a resolved document and identify any errors along the way (see the master list of errors). There are three levels of expansion, corresponding to the three levels of Schematron validation: terse, normal, and verbose. In terse expansion, for each value of an attribute, an element with the attribute's name is placed within the parent (e.g., @type="a b" produces <type>a</type> and <type>b</type>). If the value is an IDref, and it points to an alias, a copy is made for the IDref of each target vocabulary item. If an id reference does not point to a vocabulary item of the expected type, an error message is also copied in the parent. Any values that are ranges are expanded, if need be. Select networked files are checked for basic validity. Class-2 files include a special set of rounds during terse validation, where their sources are adjusted, and then checked against specific references made in the class-2 file. (See .) In terse expansion, all pointing mechanisms are checked, to make sure they point to a valid location. Because of this basic requirement, some terse expansion can take a long time on lengthy files, or ones with complex <adjustments>. Normal expansion builds on terse expansion by interrogating networked files more closely. Any errors that were reported during the terse stage but were suppressed to avoid clutter are enabled. Verbose expansion generally attends to procedures that are complex, or are not critical to validation. For example, a <model> of a class-1 file will be checked, to find references that one has but is lacking in the other. A class-1 <redivision> will be analyzed, to make sure that the two transcriptions are identical. A catalog file in the same directory will be checked, to see if it has faulty entries. Many errors lend themselves to solutions that can be recommended by the TAN function library. Some solutions are returned to the Schematron validation method as Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs (e.g. oXygen XML Editor) can then prompt users to fix an errant section with a quick replacement. For example, if text has not been NFC Unicode-normalized, an SQF will allow a user to make the change in two clicks. Thus, TAN validation does not merely tell you what the problems are; it tries to help fix them. The term "expansion" describes the process but possibly not the output. If the global parameter $is-validation is true, then in the course of expanding the file the TAN templates will abandon any parts that are no longer needed. The output is normally much smaller than the input file, restricted as it is to the root element and elements that have been marked with errors, warnings, or fixes. So although during validation the file is really being expanded, at the end only a small portion of the expanded file is returned to the Schematron processor, to expedite validation. But if $is-validation is false (the default value, if the file is not being validated), the entire expanded file and its dependencies are returned. Such output can be very useful in applications. The description above of file expansion is necessarily generalized. For details consult the function library, particularly . The validation rules have been tested not only on the files in the examples subdirectory, but more importantly upon the files in functions/errors. The files there attempt to provide at least one example of every error, and they are validated in reverse: a file is valid if and only if every error has a corresponding comment signaling the error.

Sharing TAN files TAN files have been designed to be shared and linked, just like any network of files. Most often, TAN files will be created and distributed as collections, not single files. One way to distribute a collection is by making it available as a repository via Git or some other version control software (VCS). This approach has many advantages. The files become available to anyone who wants them, and the editorial history is preserved. VCS features and tools are extremely fast and useful, and they allow users to modify TAN collections without impacting the original source. Collections may also be distributed through shared syncing services (e.g., Drive, Box, or Dropbox), or put on a Web server. In the latter case, it may be difficult for users to browse or download wholesale. In that case, you may wish to expose the collection as a compressed ZIP archive. This saves on your server's bandwidth, and it still exposes the files for XML processing. But a ZIP archive is not suitable for linking from one TAN file to another, nor is it appropriate as a <master-location>. Unpacking a compressed file requires writing to the disk, which is considered to be a security risk during validation, and so is disallowed. Such zipped archives are good ways to distribute a collection, but they should not be used as a primary repository or a master location.

Doing things with TAN files TAN files are suited for dozens of types of applications, many of which at this point are only imagined or being written. The subdirectory applications is populated with folders named with actions you might want to perform on a TAN file, and they contain XSLT stylesheets that give you but a taste of what is possible. Because the applications in that directory are still under development, this section is devoted not to the specifics of those applications but to the theoretical background behind practical applications of TAN files. It is aimed particularly at those readers who are comfortable working with XSLT or related XML technologies, and want to do something important and useful with their TAN files. The Schematron validation process was designed with a view to the next steps in practical applications. The extensive function library upon which validation is based provides a foundation for a variety of applications. When developing an application, the first point of order is normally to find an entry point in the functions subdirectory to the TAN function library. In that directory, each XSLT file is named after one of the TAN formats. Point via <xsl:include> to the file that most resembles your main input. You could also try to fetch the TAN library via <xsl:import>, but results may be erratic, particularly if you have not put the import command in the right order, or if templates in your master stylesheet override templates in the TAN library. <xsl:include> is always a more certain option. If you point to , you will have most of the functions and templates used for both class-1 and class-2 files. It tends to be a good default entry point if you are uncertain which master function file to use. It is also common to include , which is the entry point for all TAN functions, global variables, and templates that do not play a role in Schematron validation. Those extra functions include many global variables that are excluded from the core TAN library, so as not to encumber Schematron validation. You should also pay attention to the files in the subdirectory parameters. Some of the global parameters there can be used profitably to change the way an application runs. All XSLT transformations require at least four components: an input XML file an XSLT file a URL for the output an XSLT engine (e.g., Saxon HE) to process #1 against #2 and send the output to #3. Although #1 is the principal or catalyzing input, it need not be the main input. Sometimes an XSLT application is written with an eye toward non-XML as the main input. In such cases, it is impossible for the main input to be the catalyzing input. Furthermore, although there is only one principal output document, an application may need to create many other output documents. Those are normally created through <xsl:result-document>. So in any XSLT operation, there are really two possible types of input and two types of output. We use the terms catalyzing input for #1 and secondary input for input that is added during the process. We use the term primary output for #3 and secondary output for any other output created along the way. The terms primary and secondary refer only to their position in the process, not their importance. Indeed, there are XSLT applications where the secondary input and secondary output are far more important than the catalyzing input or primary output. In its documentation, an XSLT file should indicate whether the main input is the catalyzing input, the secondary input, or both, and whether the main output is the primary output, the secondary output, or both. When developing an application where the main input is a TAN file, it is often best to start with it in its resolved or expanded state. (See on resolving and expanding TAN files.) If that TAN file is the catalyzing input, use the global variables $self-resolved and $self-expanded. If it is secondary input, use tan:resolve-doc() and tan:expand-doc(). For a class-2 file, $self-expanded or the output of tan:expand-doc() is a sequence of documents, starting with an expansion of the class-2 file itself, followed by expansions of its dependencies (TAN-T or TAN-mor). Its expanded class-1 sources will be tokenized where required, and marked with anchors for each reference in the class-2 file. If a token straddles leaf <div>s, the token will be reconstituted by moving the tail of the token up. These expanded sources are excellent candidates for other types of transformation. For example, HTML pages can be created to integrate class-2 annotations and their class-1 sources, in a variety of ways. At the verbose level, an expanded TAN-A file will conclude its $self-expanded sequence with one or more documents with a root element <TAN-T_merge>, one file per detected work. A TAN-T_merge file has one <head> per class-1 source that has been merged, and the <body> contains a master set of <div>s that merge all the other sources' <div>s that share the same reference, after all <adjustments> have been made. Each leaf <div> in each source appears in the appropriate place, but as a child of a common <div> that encompasses all other leaf <div>s with the same reference. For each version's leaf div, @type is changed to #version, and other markers signify which source it corresponds to. A TAN-T_merge file is a good basis building parallel displays or statistical analyses. These merge files can be created on an ad hoc basis through the function tan:merge-expanded-docs(), applied to individual class-1 files, after expansion. If you are fetching other TAN files as secondary input, and you want to work with them, use tan:resolve-doc() and tan:expand-doc(), which will put the files in their resolved and expanded states. You must resolve a TAN file before you try to expand it. If you wish to create a TAN file as output (whether primary or secondary), it is advised that you prepare ahead of time a skeleton TAN file, introduce that skeleton as secondary input, infuse it with the new content, and let it become the primary or secondary output. Because the application you are using to create a TAN file is responsible for creating that file, and because responsibility for TAN files should be documented, the algorithm used to create that new TAN file should be declared in the <vocabulary-key> and credited with a <resp>, and a <change> should be entered in the change log. Users of the file will be warned, during Schematron validation, that the last change was made by an algorithm. If you are working with a TAN file as catalyzing input, you may want to take advantage of some other global variables derived from its key files (see ): Global variables for networked files Raw (first document available) Resolved Expanded <inclusion> — $inclusions-resolved — <vocabulary> — $vocabularies-resolved — <source> — $sources-resolved $self-expanded[tan:TAN-T] <see-also> $see-alsos-1st-da $see-alsos-resolved —

The column labeled "raw" lists variables that hold the first documents available, without alteration. Variables in the next column hold the resolved form, following the same process described above for $self-resolved. The resolved forms of <inclusion> and <vocabulary> are sufficient for validation, therefore they do not have expanded versions. Expanded sources are always bundled with their class-2's $self-expanded. For relatively simple applications, a resolved file is sufficient. But even then, there will be places where you will want to fetch the vocabulary bound to a particular attribute or element. One of the more important functions to familiarize yourself with is tan:vocabulary(), which can be used to get the IRI + name pattern of a specific node, or to get all the vocabulary available for a given type. Some developers will find even tan:vocabulary() a hassle to use. Consider setting the global parameter $distribute-vocabulary (default false) to true. If that happens, whenever an IDref appears, it will be imprinted with the corresponding IRI + name pattern for the referred vocabulary item. Exercise this option with care: such repetition will result in a document considerably larger than the original.

Using TAN outside the Network The function library behind TAN is quite powerful, and it can be used in non-TAN applications. Below is a list of some functions that have been extremely helpful. Some of the functions are not central to validation, so must be retrieved through . For a complete list of all functions, see . tan:batch-replace(): runs a sequence of regular expression replacements on any string. The sequence is prepared by constructing a series of <replace pattern="" replacement="" [flags=""]> whose attributes follow the rules of tan:replace() or fn:replace(). tan:chop-string(): changes a string into a sequence of characters, as defined in TAN (i.e., combining characters are always kept with the base character). It is roughly equivalent to the XPath expression

for $i in fn:string-to-codepoints(.) return
                           fn:codepoints-to-string($i)

. tan:collate(): like tan:diff(), but applied to any number of strings. The results are treated much like a collation of manuscript readings, with the output xml fragment tethered to sigla corresponding to the input strings. The function can be used to optimize the order of the input strings, and to compute pairwise similarity of each string. tan:copy-indentation(): applies the white-space indentation of an element to any other XML fragment. Useful for when you want to insert items in an XML file and preserve/imitate its indentation. tan:diff(): compare any two strings for differences. Includes an option to mark the changes letter-for-letter, or merely word-for-word (easier to read in some contexts). This function, which was written under the assumption that the input strings would have some resemblance, has been used successfully on pairs of strings as long as 5M characters. tan:duplicate-items(): like tan:duplicate-values(), but applied to any item. If a node, duplication is determined based on whether it is deeply equal to any other node. tan:duplicate-values(): finds distinct items in a sequence whose values are repeated in the sequence. This function complements fn:distinct-values(). tan:fill(): repeats a string a given number of times. Helpful for formatting plain-text output. tan:get-chars-by-name(): retrieves Unicode characters based upon words in their name. tan:glob-to-regex(): changes a glob-like expression (normally used for filenames) into a regular expression (e.g., *.* becomes .*\..*). tan:lang-code(): retrieves an ISO 639-3 code for a language of a given name. tan:lang-name(): finds the name of a language, given its ISO 639-3 code. tan:median(): retrieves the median from a sequence of numbers tan:most-common-item(): from a sequence of items, returns the one that occurs most frequently tan:most-common-item-count(): returns the number of times the most common item appears in a sequence tan:no-outliers(): removes outliers from a sequence of numbers tan:outliers(): returns only outliers from a sequence of numbers tan:search-morpheus(): retrieves lexico-morphological data for Greek and Latin from the Morpheus service tan:search-wikipedia(): retrieves a set number of records from Wikipedia tan:shallow-copy(): returns a copy of a node to a set depth. Useful for messages, to provide feedback on a particular element and its attributes, without any descendants (which would make the message hard to read). tan:uri-relative-to(): converts an absolute URI to a relative one, based on some context URI Some numeral functions might prove useful: Letter numerals ↔ integers: tan:aaa-to-int(), tan:int-to-aaa() Roman numerals → integers: tan:rom-to-int() (reverse not available) Greek numerals ↔ integers: tan:grc-to-int(), tan:int-to-grc() Syriac numerals → integers: tan:syr-to-int() (reverse not available) Hexadecimal ↔ decimal: tan:hex-to-dec(), tan:dec-to-hex() String range ↔ integers: tan:expand-numerical-sequence(), tan:integers-to-sequence()

Appendixes