The Text Alignment Network: Official Guidelines

The Text Alignment Network: Official Guidelines Text Alignment Network: Official Guidelines 2015-present Joel Kalvesmaki Joel Kalvesmaki kalvesmaki@gmail.com This document and the files it describes are licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/ Latest version: http://textalign.net/release/TAN-1-dev/guidelines/. 1 dev 2017-05-24 Working draft. Please send corrections to the author (see above). Formats: HTML • PDF • Docbook (master) In case of contradictions, apparent or not, between these guidelines and the core TAN files, priority should be given to the RELAX-NG schemas (compact syntax), then to the functions, and then to these guidelines. Chapters 1-7 and 10 are written by hand, and are relatively accurate. Chapters 8, 9, 11, and 12 are written by an algorithm that selectively reformats normative TAN files. Errors or inconsistencies found in those chapters will be due to the XSLT stylesheets that produce them or to the files upon which they are based. General Overview Introduction

Definition and purpose The Text Alignment Network (TAN) is a suite of highly regulated XML formats intended for scholars to align and share texts and textual analysis at a maximal level of syntactic and semantic interoperability. TAN is particularly suited to textual works with multiple versions (translations, paraphrases), and to related datasets on quotations, word-for-word alignments, and lexicomorphological features. TAN files are simple, modular, and networked, allowing users, working independently and collaboratively, to edit, study, and annotate shared files. The extensive validation rules depend upon a library of processing functions that definitively interpret the format, thereby informing and helping editors in research and publication, and providing a basis for developing tools and applications. Although expressive of scholarly nuance and complexity, the TAN format has been designed to benefit everyone, scholars and non-scholars alike, and can be used broadly for multilingual publishing, language learning, and machine translation.

Rationale and Purpose Different versions of texts—translations, quotations, paraphrases, and so forth—are important sources for scholars. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of the genius or idiosyncrasies of those who translated or quoted the original, which in turn sheds light on how words, concepts, and works were preserved, altered, or combined across the generations and cultures who read and circulated the versions. The comparison of versions of texts requires words, sentences, paragraphs, and other text segments to be aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have chosen a segmentation system not easily applied to other versions. Identifying which words or phrases in a translation correspond to which words or phrases in the original might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a proper study of that context requires not only multiple versions of different works, but collaboration across projects and fields of study. The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. The format is modular, with each module designed to allow an editor to focus on a single set of tasks without having to worry about other related but separable ones. The format encourages or requires editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (both human and computer) can determine whether the data is suitable for their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications. TAN has been designed to support specific research desiderata such as the following: I want to share the transcription of a particular version of a textual work. How do I encode it such that it is most likely to align with any other version of that text created by someone else? I have an index of quotations I wish to make available. How do I encode it such that the data is semantically rich and can be applied to other, perhaps unknown versions of the same work? How do I align multiple versions of a single work when those versions may not match very well, or when the reason for alignment may be vague or ambiguous? How do I publish a word-for-word analysis of a source and its translation, when there may be messy overlapping or ambiguous relationships, and where I might need to express doubt or alternative possibilities of alignment? How do I publish a dataset that lists passages in two or more works that share a common feature, such as verbatim text or a parallel topic? How can I share my data with others, and notify or warn them when I make corrections or changes to the master version? The last question is especially significant. As TAN files are published, there emerges a web of primary sources—a decentralized corpus of texts that "talk" to each other. As this TAN-compliant corpus expands across linguistic, chronological, and spatial boundaries, the interoperability of its parts allows the development of third-party tools and applications to expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as: For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones? How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts? How do Arabic translations of Greek texts from the Abbasid period differ from those of Sanskrit? Can an anonymous English translation of a modern French novel be identified with known translators of French novels from the same period? How do present-day translations of official United Nations documents differ across languages? Optimism that TAN could be used to address such research questions should be tempered: Although TAN comes with an extensive library of functions and templates, it is not a tool per se. It does not provide software or applications to create, edit, or display TAN-compliant files, nor does it dictate the behavior of such tools. TAN does not on its own create alignments or answer research questions. It merely lays a framework within which such questions can be investigated. TAN has a restricted field of inquiry (defined and explained in these guidelines). The format is not suitable for many lines of iniquiry, e.g., reconstructing the format of an original book or article. TAN is just one of many formats for texts. It supplements, and does not replace, other common markup formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Conversion to and from TAN to these formats is usually straightforward, but may not be lossless, and should be given some thoughtful planning. TAN has not been designed to prioritize computational efficiency. It sacrifices repetition and explicitness in favor of terseness and human readability. The extensive TAN validation routines—essential to aiding interoperability—can be taxing to run on numerous or enormous files. This choice has been made upon the principle that users of the format prioritize quality and readibility over speed.

Design Principles To facilitate the research questions mentioned above, the TAN encoding formats and this manual have been designed around a few core principles. Scholarly freedom: Scholars should be able to create data within their sphere of inquiry simply, expressively, independently, and with fidelity to their guiding lights. Given two ways of expressing the same idea, simplicity is better than complexity, expressiveness than silence. Simplicity and expressiveness should be treated as complementary ideals. In cases where one must be chosen over the other, simplicity is to be preferred. Editors should be able to register doubt about claims. If in doubt about an assertion, an editor should be able to state alternatives. Editors should be able to work on the same material indepedently but interoperably. Editors should work freely within their theories, opinions, and assumptions about language. They should declare those positions, not suppress or alter them. Scholarly responsibility: Scholars must make their data uniquely citable, and responsibly describe how that data was created. Each TAN file should have an expressive, unique, persistent name that can be cited and used independent of the file's location or availability. Editors must supply, at the very minimum, the core statements of responsibility that are normally expected in any scholarly work: What was done by whom, when. What sources have been used. Who holds rights over the data, and what reuse is permitted. What editorial assumptions and decisions were made in creating the data. Utility to both computers and humans: Data should be easy for both humans and computers to read and write; the latter should be able to import, process, and create the data reliably, consistently, and interoperably. The format should depend upon stable technologies or standards. All classes and types of formats in the TAN suite should be structured consistently and predictably. As many as possible computable inconsistencies or errors should be flagged by validation rules. Every datum should be expressed in both a form that is as human readable as possible and a form that is computer-readable, to make the material suitable for linked data (semantic web) or for processing via an algorithm. In a given file, data should not be redundant, irrelevant to the immediate points of inquiry, or more reliably and authoritatively found elsewhere. References to textual units or linguistic concepts should be expressed . Each TAN file, or collection of files, should be integrally complete and fully useful, independent of any other software such as text processors or version control software.

Participation Participants in testing, using, and developing the Text Alignment Network are welcome. Our core purpose is to develop and maintain the schemas, the guidelines, and the functions and templates. Inquiries about participation should be sent to the project manager, Joel Kalvesmaki, by email: kalvesmaki at gmail.com. Official announcements are made by email (Google Group) and by Twitter.

Starting off with the TAN Format If you are new to markup languages, or if you are unfamiliar with acronyms such as XML, RDF, XPath, or technical terms such as Unicode, you should start with this chapter, which uses a simple example to illustrate the steps typically taken to create and edit TAN files. By the end of this chapter, you will be able to create and edit a simple collection of TAN transcriptions and alignments. If you are familiar with basic markup concepts, you may wish to read through the chapter very quickly, or skip it altogether. The discussion touches on a number of general concepts, some of which may be new. These concepts will be introduced only briefly. Further reading elsewhere will give you better grounding in a particular topic or technology.

Creating TAN Transcription and Alignment Data Let us take a simple example, that of aligning two English versions of the nursery rhyme Ring-a-ring-a-roses, sometimes known as Ring around the Rosie. Our goal here is to publish two versions of the nursery rhyme in the TAN format so that they are most likely alignable with any other TAN version of the poem that someone might create. We begin by finding previously published versions. In this case we have taken an interest in the versions published in 1881 and 1987 (one published in the UK and the other, the US). Each of these books have other rhymes, but we've already decided to focus upon the one particular nursery rhyme, so we transcribe those parts and nothing else:Ring around the Rosie 1881 (UK) version 1987 (US) version Ring-a-ring-a-roses, A pocket full of posies; Hush! Hush! Hush! Hush! We're all tumbled down. Ring-a-round the rosie, A pocket full of posies, Ashes! Ashes! We all fall down.

We must be sure to save each of the two transcriptions as plain Unicode text, preferably with .xml at the end of each file name. Do not bother with word processor (Word, OpenOffice, Google Docs, and so forth), because those programs are too sophisticated for our work. They sometimes generate erroneous data, even when you export to plain text. We will be working with raw text, and will not be concerned with italics, colors, fonts, margins, and so forth. Much better for our work is a text editor, which handles nothing but plain text. But even those are inadequate, because they do not check to see if the rules of the format have been followed. So the best tool is an XML editor, which does the same thing a text editor does, but with shortcuts that save much typing and prevents syntax errors. More important, an XML editor will tell us when our TAN file is invalid, and will provide information and help in our TAN files. Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists for both text editors and XML editors. TAN was developed using oXygen, which is so powerful it may be very confusing to use at first. To avoid exasperation or despair, take advantage of tutorials and documentation associated with the XML editor you have chosen. Our first task is to get these two versions into separate files with the appropriate markup. Each TAN transcription file has two major parts: a head and a body. For now, we focus on only the second part, the body, as well as a few the necessary preliminary lines that stand above both the head and the body. First, the 1881 (UK) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring01"> <head> . . . . . . . </head> <body xml:lang="eng" in-progress="false"> <div type="line" n="1">Ring-a-ring-a-roses,</div> <div type="line" n="2">A pocket full of posies;</div> <div type="line" n="3">Hush! Hush! Hush! Hush!</div> <div type="line" n="4">We're all tumbled down.</div> </body> </TAN-T> And now the 1987 (US) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring02"> <head> . . . . . . . </head> <body xml:lang="eng" in-progress="false"> <div type="l" n="1">Ring-a-round the rosie,</div> <div type="l" n="2">A pocket full of posies,</div> <div type="l" n="3">Ashes! Ashes!</div> <div type="l" n="4">We all fall down.</div> </body> </TAN-T> These are standard eXtensible Markup Language (XML) files. (If you are already familiar with XML you may wish to skip ahead to the next section.) XML is rather simple. It provides a way to take a text or a collection of data and give it some structure through markup. In the examples above, the markup is in boldface. Each file begins with a prolog, marked by the lines that begin with <?. The first line in the prolog simply states that what follows is an XML document. The next two lines point to the files that will be used to check to see whether or not our data is valid. For now we will skip the specific details of those first three lines, which will be identical, or nearly so, from one TAN file to the next. We can simply cut and paste those lines when we want to start a new one. The fourth line is the opening tag of what is called the root element, here called <TAN-T>. That opening tag, <TAN-T...> is answered by a closing tag, </TAN-T>, the last line. The paired-tag relationship is true for all the other elements in this example. <head> is answered by </head>, <body> by </body> and each <div...> by </div>. These elements nest within or beside each other, but they never overlap. (The prohibition on overlapping elements is one of the cardinal rules of XML.) This relationship means that every XML file can be thought of as a tree, with the root at the trunk and the enveloped elements as branches, terminating in metaphorical leaves. It is helpful to use the tree metaphor when we describe the path we take, toward either the leaves or the root. In this manual, we may use the terms rootward and leafward when we want to trace movement within an XML document. An XML document is also profitably thought of as a family tree, a metaphor that provides commonly used terminology. In our examples above, <TAN-T> is the parent of <body>, and <body> the parent of the four <div> elements. Likewise, each <div> is the child of <body>, and <body> is the child of <TAN-T>. Distant parental relationships can be described with the terms ancestor and descendant. <TAN-T> is the ancestor of every element it encompasses, and every element encompassed by <TAN-T> is its descendant. Paratactic relationships are also important. <head> and <body> are siblings to each other, and every <div> is a sibling to every other <div>. Inside of the opening tags for the <TAN-T>, <body>, and <div> elements are pairs of text joined by an equals sign, collectively called an attribute. The left side of the equals sign is the attribute name, and on the right side, within the quotation marks, is the attribute value. <TAN-T> has two attributes, @xmlns and @id (when we discuss an attribute outside its original context, we often preface the name with @). We will skip @xmlns for now; this attribute (actually, a pseudo-attribute) specifies the namespace of the XML file, a somewhat advanced topic. The value of @id, however, is quite important and our first item of business. Every TAN file has an @id that uniquely and permanently identifies the file itself. It is quite similar to the name we give a file when we save it, and to the names we see when we browse the local contents of our computer, except that it should not be changed from one revision to the next. When we want to record changes to our file, we will not alter the @id value, but simply note the change elsewhere in the document (see below). The value of @id is always what is called a tag uniform resource name (tag URN). It always starts with tag:, followed by an email address or domain name that we own or owned. (It is okay to use an obsolete address.) After that email address or domain name comes a comma (no spaces) and a date on which we owned it, in the international standard format of year, month, and date, joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be the first of the month; if we leave off the month value it is assumed to be January. In the examples above, [USER@DOMAIN.NET],2014 indicates that the email address was owned on the stroke of midnight (Coordinated Universal Time) January 1, 2014. After that comes a colon, and then any name we wish to assign to the file. We have anticipated a simple collection of texts, so we've called the files ring01 and ring02. (If we run out of names, or want to restart, we can simply use a new email-date preface, e.g., [USER@DOMAIN.NET],2014-01-02.) The element <body> contains our transcription. @xml:lang, required, specifies the principal language of the transcribed text. We use the standard 3-letter abbreviation for English. (See later in the guide for more complex language requirements.) By saying that @in-progress is false, we indicate that we have finished our transcription and have no further plans to develop it. It doesn't mean that the file is free of errors. We will can make corrections later. It just means that we have no more revisions planned, and any further changes will be restricted to corrections of errors. This attribute is optional. If it is left off, our TAN file is assumed to be a work in progress, and it serves as a kind of warning to anyone who might want to use it. Our transcription has been divided into four <div> elements. How we divide up the work is entirely up to us. But we must make sure that every bit of text is enclosed by a leafmost <div>. That is, every <div> must be the parent of only other <div>s, or none at all. We cannot have a <div> that mixes text with other elements (such as other <div>s). The values of @type and @n indicate, respectively, the type of division and the name of the division. We have used line in the first example, but we could easily have also used l (as we did in the second) or ln or any other phrase that we think will make intuitive sense to other users. The choice is arbitrary (we will see why below). We have used arabic numerals for the values of @n, but the value, once again, could have been anything. We could have used Roman numerals, or some other naming scheme that is standard in the field. Aside from the <head> element (discussed later), that's all we need in the transcription. We can now move to alignment. There are two different types of alignment, one emphasizing breadth, the other, depth. The broad type of alignment, called TAN-A-div, allows us to specify TAN transcriptions of as many versions of as many works as we wish, and to fine-tune the alignment upon the basis of the <div> elements within the transcription. We do not specify why we wish to align the versions. We only declare our interest in doing so. The other type of alignment, emphasizing depth, is called TAN-A-tok and allows us to take any two (and no more) TAN transcriptions, create word-to-word (or better put, token-to-token) relationships, and specify what type of relationship holds between each set of aligned words. TAN-A-div is suitable for work that focuses on the general alignment of multiple versions of one or more works at a single time. TAN-A-tok is for highly detailed, precise alignment of two text versions. For our example, we start with a TAN-A-div file (once again suppressing <head>):<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> . . . . . . . </head> <body/> </TAN-A-div> In the prolog, the first line is identical to the first line of our transcription files. The second and third lines are identical, aside from pointing to the validation files for alignment. Even the fourth line looks like the transcription file, other than the new name for the root element, <TAN-A-div>, and the new value for @id. The penultimate line, <body/>, is what is called an empty element, and is equivalent to <body></body>. Collapsing the opening and the closing tags of the element into a single tag provides a shorthand syntax for elements contains nothing. It will become apparent, when we discuss <head> below, why our <body> can be empty. The other kind of alignment, TAN-A-tok, takes a bit more work, because we must first identify words that correspond with each other. Even before we do that, we need to decide what kind of relationship holds between the two texts. Let us pretend, for the sake of example, that the 1987 version is a direct descendant (and therefore variation) of the 1881 one. So our task is to show exactly what parts of the the older version correspond to those of the newer one. We will simplify in this case, and assume an interest only in words, ignoring space and that punctuation. We will also adopt, tokens instead of words (word is notoriously difficult to define, and has connotations lacking from token). We now create a TAN-A-tok file:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-tok.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> . . . . . . . </head> <body bitext-relation="B-descends-from-A" reuse-type="adaptation" in-progress="false">  <align> <tok src="ring1881" ref="1" ord="1"/> <tok src="ring1987" ref="1" ord="1"/> </align> <align> <tok src="ring1881" ref="1" ord="2"/> <tok src="ring1987" ref="1" ord="2"/> </align> <align> <tok src="ring1881" ref="1" ord="3"/> <tok src="ring1987" ref="1" ord="3"/> </align> <align> <tok src="ring1881" ref="1" ord="4"/> <tok src="ring1987" ref="l" ord="4"/> </align> <align> <tok src="ring1881" ref="1" ord="5"/> <tok src="ring1987" ref="1" ord="5"/> </align>  <align> <tok src="ring1881" ref="2" val="A"/> <tok src="ring1987" ref="2" val="A"/> </align> <align> <tok src="ring1881" ref="2" val="pocket"/> <tok src="ring1987" ref="2" val="pocket"/> </align> <align> <tok src="ring1881" ref="2" val="full"/> <tok src="ring1987" ref="2" val="full"/> </align> <align> <tok src="ring1881" ref="2" val="of"/> <tok src="ring1987" ref="2" val="of"/> </align> <align> <tok src="ring1881" ref="2" val="posies"/> <tok src="ring1987" ref="2" val="posies"/> </align>  <align> <tok src="ring1881" ref="3" ord="1, 2"/> <tok src="ring1987" ref="3" ord="1"/> </align> <align> <tok src="ring1881" ref="3" ord="3 - 4"/> <tok src="ring1987" ref="3" ord="2"/> </align> <align> <tok src="ring1881" ref="4" ord="1"/> <tok src="ring1987" ref="4" ord="1"/> </align> <align> <tok src="ring1881" ref="4" ord="2"/> </align> <align> <tok src="ring1881" ref="4" ord="3"/> <tok src="ring1987" ref="4" ord="2"/> </align>  <align> <tok src="ring1881" ref="4" ord="last-1"/> <tok src="ring1987" ref="4" ord="last-1"/> </align> <align> <tok src="ring1881" ref="4" ord="last"/> <tok src="ring1987" ref="4" ord="last"/> </align> </body> </TAN-A-tok> Once again, the first four lines, the prolog and root element, should look familiar, with the only significant changes being the names of the validation files, the name of the root element (<TAN-A-tok>) and the value of @id. The heart of the data is <body>, which has, in addition to @in-progress, two more attributes, @reuse-type, which specifies the default type of relationship between the two sources, and @bitext-relation, which specifies how the versions relate to each other. Our two values, B-descends-from-A and adaptation, are arbitrary names that we define in the <head> (discussed later). <body> is the parent of one or more <align> elements, each of which correlates a set of tokens in the two texts through its <tok> children. Each <tok> has, in this example, three attributes. @src takes a nickname (an @id reference) that points to one of the two transcriptions; we have used ring1881 and ring1987 but we could have just as easily used anything else such as uk and us. @ref has a value that points to a specific <div> in the source transcription; and @pos or @val specify which token is intended, either by word number (@pos) or text of the actual word (@val). Either technique is fine, and can be mixed, as in the example. You may also notice that the comma and hyphen can be used in @pos to point to multiple words within the same <div>, and that last and last-X (where X is a digit) can be used to point to a word token relative to the last one in a <div>. Each <align> can establish one-to-one, one-to-many, many-to-one, or many-to-many relationships between the two texts. Words may feature in multiple <align> elements (that is, overlapping is permissible). And if an <align> has <tok> elements belonging to only one source, such as in the fourth-to-last <align> above, we have what is called, in these guidelines, a half-null alignment. This half-null alignment indicates that the second word of line four of the 1881 version is excluded from the act that we have called adaptation (which is, as we shall see, defined in the <head>). If this were a translation, it would be as if we were saying that this word was excluded from the translation. (A half-null alignment containing only tokens of the later source might point to words that the translator added.) A half-null alignment should not be confused with our own silence. As creators of this file, we are under no obligation to indicate every word-for-word correspondence. If we fail to mention certain words, all that can be implied is that we opted not to say anything about them. We could have aligned the two texts in different ways. Perhaps further study will reveal that we were in error to associate the second "ring" with "round" in line 1. We can make corrections, even after publication, and signal the change to users of our data. There are also ways to express doubt or alterative opinions. We can even correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are discussed in the detailed parts of these guidelines.

The Principles of TAN Metadata (<code><link linkend="element-head" ><head></link></code>) At this point, we have finished four TAN files: two transcriptions, one TAN-A-div file, and one TAN-A-tok file. But we've suppressed the <head> in all of them, until now. But before getting into details, we need first to discuss a few principles that TAN relies upon. Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> contains data that describes the data. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the basic ones behind XML), the metadata requirements are stricter than those of other formats. Scholars who use our data really need to know some essential things before they can responsibly use the data we produce. For example, what are the sources we have used? Who produced the data? When? What key assumptions have been made in producing the data? What rights do other people have to use the data? The questions are not difficult to answer, but they are critical, and we should take the time we need to get correct answers. Some of these questions are specific to certain types of data. For example, in a TAN-A-tok file, we ask what relationship the two sources hold to each other. But that makes no sense for a TAN-T file. But other questions apply universally across all TAN files, no matter what kind of data. As we go from one TAN format to the next, we need to deal as much we can with similar structures and expectations. This reduces any potential confusion in creating and editing a TAN file, and helps other people using our data to find the information they want. More importantly, what we write in one file might save us some work in another. The rigorous scholarly requirements for TAN metadata are offset somewhat by another principle that was adopted in the design of TAN, namely, that each format's <head> should focus exclusively upon the data in <body> and not other things. That is to say, in a transcription, we should definitely indicate what our source is. But we should not try to write a catalog entry, or even a structured citation, for the book we have used. We are not library catalogers. Our obligation is merely to point somewhere a reader can get more complete information. The <head> is designed to help us to stay focused on the task and data at hand. TAN was also designed with the assumption that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen in such a way that the phrase Ring around the Rosie is comprehensible not just to the reader but to the computer, using syntax that a computer can be programmed to act upon. Take for example the 1881 book we have used for our first transcription. For the human reader we can say simply something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But computers need a more controlled, predictable syntax before they can be directed to the correct edition of Mother Goose (or rather to a digital surrogate of the edition). The human-readable string is too complex, and syntactically opaque. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs), in this case urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below other ways to name it.) URNs are families of formalized naming schemes regulated by a central body (Internet Assigned Numbers Authority, IANA) to ensure that people and organizations can legitimately coin and use permanent, persistent, unique names for various types of things. There are URN schemes for journals (via ISSNs), articles (DOIs), and movies (ISANs), which means that anyone can refer to them unambiguously in a manner that is computer-friendly. All URNs are simply names. They don't tell you where an object is, just what its name is. To provide a unique location, however, we have universal resource locators (URLs), which might be much more familiar from daily use of the Internet, e.g., http://academia.edu. Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor). Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms can be easily confused, and it is best to disambiguate them by thinking of the last letter in each. URIs/IRIs Incorporate both Locators (URL) and Names (URN). IRIs are essential to a system frequently called the semantic web or linked (open) data, an agreed way of writing and processing data that relies upon IRIs and a simple data model to connect them. The semantic web allows independent parties to make assertions about things, and if they happen to use the same IRI vocabulary to describe those things, then we can program computers to make associations between disparate, heterogenous datasets. This allows us to find connections across disciplines and projects, to marshall computers to make inferences we not make on their own, and to create a network of linked data. TAN has been designed to be linked-data friendly, and so requires in its <head> almost all data to be representable not just in a human-readable form but also computer-readable, as an IRI. Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to the community of practice most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar. Highly familiar: The two books that provide the basis of our transcription are well catalogued and generally known. A number of services provided by librarians provide a controlled IRI vocabulary that can be used by anyone to describe uniquely a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found accurate Library of Congress IRIs for both editions of Mother Goose: http://lccn.loc.gov/12032709 and http://lccn.loc.gov/87042504. Observe that these two IRIs are also, perhaps confusingly, URLs. If we paste these strings into our browser, we retrieve a record that describes the book. This locator does not lead us to the book per se, only to information about the book. Nevertheless, the Library of Congress has decided to coin this URL also as an IRI name for the book. Anyone who owns a domain name can designate a URL as a name for an object. And that allows them to set up their server to also return information about the object the IRI names. This subtle ambiguity—that the URL both names an entity and is a location for a webpage—can sometimes be confusing to those who are new to the semantic web, because such URLs name in reality two types of things: an entity and a location to find out more information about that entity. We now have IRIs for the sources. Let's now find an IRI to name the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is fortuitous. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which is committed to providing a unique URN for every Wikipedia entry in the major languages. The DBPedia URN for the work we have chosen is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses. Once again, this is both a name and a locator. It names a specific intangible object, namely a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that name into your browser, you will get back more information about that named object. Familiar, but only in small circles: We will need to have names for some of the people who edited the file. Here we're not interested in the authors of our books. We are interested in crediting the people who helped make the TAN file. Most people who contribute to the creation of the data file will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for persons. Many contributors to TAN files, however, will not be listed in these general databases. In these cases, we can assign our own IRI to name these participants. We have already done something like this by assigning tag URNs to our four transcriptions (the value of @id in the root element). We can do the same for our editors. If a student Robin Smith has been helping with proofreading, we can take an email address for Robin (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as tag:smith.robin@example.com,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about the Robin, but it at least allows us to assign a name that will not be confused as the Robin Smith identified by ISNI as http://isni.org/isni/0000000043306406. (If we want to go a step further, we could mint a URN from a domain name that we own, and set up a linked data service that offers more information, human- and computer-readable, about Robin, but this is not required. And it can be a lot of work to maintain.) Another example of field-specific IRIs is the concept of relationship between two text-bearing objects. We are assuming for the sake of illustration that the version published in the 1987 Mother Goose is a direct descendant of the 1881 version. Our assumption is important to declare, because if we had a different view on how one related to the other, it would probably affect the specifics of our word-for-word alignments. Because no suitable IRI vocabulary yet exists for such concepts, TAN has coined an IRI that can be used by anyone wishing to declare that the second of two sources descends from the first through an unknown number of intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b. We face a similar issue when thinking about text reuse. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are not stable, well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI, tag:textalign.net,2015:reuse-type:adaptation:general. For other examples of IRIs coined by TAN, see . Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps only to us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we hadn't taken into account other things that should be named. So if we wish to avoid these kinds of situations, we can assign a random IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. These uuid URNs, which are generated by computers through randomizing functions, are very useful. The likelihood that a randomly generated uuid will be identical to any other uuid is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that uuid URN to name some other object or concept). Numerous free UUID generators can be found online. To humans, a UUID on its own is meaningless, and rather ugly. But it is a good start. We always have the option, later, of adding an IRI. It's perfectly fine to give one object or concept multiple IRIs. But the reverse is never true. One should never use the same IRI to identify more than one object or concept.

Creating TAN Metadata (<code><link linkend="element-head" ><head></link></code>) Now that we have explored various IRI vocabularies for concepts around our versions of Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version: <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location>ring-o-roses.eng.1881.xml</master-location> <rights-excluding-sources rights-holder="park"> <IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US</IRI> <name>This data file is licensed under a Creative Commons Attribution 4.0 International License. The license is granted independent of any rights and licenses that may be associated with the source. </name> </rights-excluding-sources> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <declarations> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name> </work> <div-type xml:id="line"> <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> <name>line of poetry</name> </div-type> </declarations> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name>Jenny Park</name> </agent> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> <change when="2014-08-13" who="park">Started file</change> </head> <name> is the human readable form of the @id that is inside the root element, <TAN-T>. It can be anything. And we can supply more than one <name>, in case we wish to provide it in different languages or variations. <master-location> is mandatory only if we have claimed through @in-progress that the file is no longer in progress. One or more of these elements provide URLs where master versions of the file are kept (and updated). They may be absolute URLs, such as an address on the Internet, or it may be a relative URL, in case we are working exclusively on our local computer. We provide this as a courtesy to others who might be using our data. If someone downloads a copy and starts working with it, then whenever they validate the file, if it does not match the one in the master version, a warning is returned, along with a message or a location of the elements that were last changed. This allows users to found out if changes have been made, and it allows us to make corrections and silently notify other users of our alterations. To communicate this, we do not have to keep track of who is using the file. <rights-excluding-sources> contains information about rights to the data we are releasing. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). This once again gets to the TAN metadata principle of describing our data and not other things. We have the option to describe the license of the source we have used (see the rest of the guidelines for guidance), but we absolutely must declare whether we have placed additional scrictures on the dataset we have created. That is, we are declaring the rights attached to the data, not its source. In this example, we have released the data under a creative commons license. The child element <IRI> specifies the IRI assigned by Creative Commons, and <desc> describes it in human-readable format. The conjunction of <IRI> and <name>, the IRI + name pattern, is a recurrent feature of TAN files. We may include any number of <IRI> or <name> elements in an IRI + name pattern. But if we do so, we are stating that they all name the same thing, not different things. <source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen. <declarations> contains data that is specific to TAN file types, to declare the assumptions we have made relevant to the kind of data we have created. In this case, because we are working with transcriptions, we have two major components: <work> and <div-type>. <work> uses the IRI + name pattern to name the work we have chosen to transcribe. <div-type> specifies the type of divisions we have chosen to use to segment the transcription. In a more complex text, there would be several <div-type>s. Each one has an @xml:id, which takes as a value some nickname that we wish to use for @type values of <div>s. The IRI + name pattern is also used for <agent>, which describes who was involved in creating the data, and <role>. We may have as many <agent>s and <role>s as we wish. The agent in this case, Jenny Park, has been given a tag URI. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles. If you decide to modify someone else's TAN file, then you become responsible for changes, not the original person or organization. Your first point of order should be add an <agent> to the head, identifying yourself. You need not change the document's @id, but you should take responsibility for any changes you make, probably using <change> or an @ed-who and an @ed-when. Otherwise you are incorrectly attributing your changes to someone else. Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who created the source. If someone wants that information, or anything else about the source, they should pursue the identifier we have provided under <source>. <change> has attributes @when and @who that specify who made the change/comment and when. The value of @when is always a date plus optional time formatted according to the standard YYYY-MM-DD + time (optional). @who always carries a value that refers to an agent/@xml:id. Both <change> (as well as <comment>, missing here) lack any IRIs, mainly because the likelihood that the data would ever be reused, repeated, or linked to is altogether too remote to be make a mandated <IRI> useful. So now we have finished one transcription file's metadata. The other one will look similar, but we'll also take a couple of nice shortcuts: <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <rights-excluding-sources which="by-nc-nd_2.0" rights-holder="park"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <declarations> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <div-type xml:id="l" which="half-line (verse)"/> <filter> <normalization which="no hyphens"/> </filter> </declarations> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> </head> One significant difference is that three of the elements that normally take the have been replaced with a simpler form that takes merely @which and @xml:id. That is because TAN has predefined vocabulary that can be invoked by calling it (through @which) and giving it an abbreviation to be used elsewhere in the document (@xml:id). <declarations> has a new child, <filter>, which contains a <normalization> statement that declares, through the name and the IRI in the underlying TAN definition, that we have opted to remove word-break line-end hyphenation. This provides a cautionary note to users of our data who might value line-end hyphenation. Any number of <normalization>s can be used to describe any alterations we might have made in our transcription. In other transcriptions we could use this feature to declare other suppressions, such as editorial comments or footnote signals. Note that the value of div-type/@xml:id here, the letter l, differs from our previous transcription file, line. Even though we have adopted a different nickname, they are treated as equivalent because in each file we have defined l or line with the same IRI, http://dbpedia.org/resource/Line_(poetry). A computer that later looks for files with lines of poetry will not care about l and line, but will look at the underlying IRI that defines these terms. This exemplifies how linked data (see above) can support our work. We are free to use abbreviations and terms that make sense to us, yet we can also tie those abbreviations into the larger infrastructure by means of IRIs. It also means that we can tether our texts to others on the basis of segmentns that may be generally rare and unfamiliar or common but only to a specific field (e.g., sections of a legal document). Now that we have created the metadata for our transcriptions, we turn to the alignment files. Those <head>s will look slightly different. We start with the TAN-A-div file: <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <declarations/> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2014-08-14" who="park">Started file</change> </head> Much of the code above will look similar to the previous two examples. Every alignment file has only one kind of source, namely TAN transcription files, nothing else. Therefore <source>'s <IRI> always takes the @id value of the corresponding TAN transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other texts aligned in the file. <source> also has an child element not seen in the earlier two examples, <location>, which specifies where the digital file was accessed and when (through @when-accessed). We may include as many of these <location> elements as we wish, with the most preferred or reliable location at the top, since the validation process will use first document that is available. The @when-accessed value is important, because the validator will look for changes in the file, and if there have been changes since we last accessed the file, it will return a warning with a summary of the number and kind of changes. If such a report is returned, it is up to us to determine if the alterations merit any action on our part. Our TAN-A-div file could have any number of <source>s, and not necessarily for the same work. It also does not matter in which order we put the <source>s. <declarations> is empty, mainly because we have, in this case, no working assumptions to declare. In more advanced uses, this element would not be empty. This <head> explains why the <body> of our TAN-A-div file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. All TAN-A-div files assume, by default, that every source that is a version of the same work should be aligned upon the basis of the @n value of <div>s. That is, any user or processor of a TAN-A-div file may assume that all implicit alignments should be made unless otherwise specified. For transcriptions that are already similarly structured and labeled, a TAN-A-div file is unnecessary for alignment. But we will see that the options available in a TAN-A-div's <declarations> and <body> will allow us not only to deal with inconsistencies in source transcriptions but to make important statements, such indicating where one work quotes from another. Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> looks like this: <head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location>ringoroses.01+02.token.1.xml</master-location> <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="ring1881"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Ring o roses 1881</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="ring1987"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Ring o roses 1987</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <declarations> <bitext-relation xml:id="B-descends-from-A"> <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI> <name>B descends directly from A, unknown number of intermediaries</name> <desc>The 1987 versions is hypothesized to descend somehow from the 1881 version, mainly for the sake of illustration.</desc> </bitext-relation> <reuse-type xml:id="adaptationGeneral"> <IRI>tag:textalign.net,2015:reuse-type:adaptation:general</IRI> <name>general adaptation</name> </reuse-type> <token-definition src="ring1881 ring1987" which="letters"/> </declarations> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2015-01-20" who="park">Started file</change> </head> The TAN-A-tok <head> looks similar to the previous examples, except that <declarations> has three children. <bitext-relation> states through an IRI + name pattern the stemmatic relationship we think holds between the two sources. (Stemmatics is the study of the chain of transmission by a single work eventually became the multiple copies, versions, and editions that are extant; it frequently involves the creation of genealogical-like trees to illustrate the work's version history.) We have used the entire IRI + name pattern, but we could have substituted it with @which and the value a/x+/b. One or more <reuse-type>s specify how one text has reused another. The IRI we have used shows that we believe that the later text has generally adapted the earlier one. If this were a translation or a quotation or some other kind of text reuse, we might have used a different IRI. A third declaration, <token-definition>, specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources. The value for @which, letters, is a reserved TAN keyword that specifies that any consecutive string of word characters, ignoring spaces and punctuation. Under this token definition the phrase "Hush!" said he would have three tokens. Had we set the value of @which to the reserved TAN keyword letters and punctuation, we would have six tokens, since each punctuation mark would be defined as a token. <token-definition> is optional. If we leave it out, users are to assume that we mean letters. This is because most often, whenever in ordinary conversation we refer to the nth word in a sentence we assume people will skip punctuation marks in their counting.

Aligning across Projects We now have a small, tightly knit corpus of TAN files. Let us imagine what it might be like to connect our TAN corpus to another. Let us assume that we have found in a German project a TAN transcription of a work that looks quite similar to our own:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <rights-excluding-sources rights-holder="schmidt"> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz.</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </rights-excluding-sources> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <declarations> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <filter> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </filter> </declarations> <agent xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </agent> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> </head> <body xml:lang="deu" in-progress="false"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </body> </TAN-T> It seems clear to us that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or even if we choose to make an alignment that we have to align everything. In this case, we choose not to worry about word-for word alignments, and we focus here only on the TAN-A-div alignment, so that, for example, we can later generate an HTML report that will allow us to more conducively read the three versions in parallel and study their relationships. To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, the <div-type> is defined as http://dbpedia.org/resource/Gedichtzeile (Gedichtzeile = line of poetry). Third, the lines have been lettered instead of numbered. And last, the editor seems to have made a typographical error, making the last line n="e" instead of n="d"). These four differences typify some of the inconsistencies that are commonly found in digital texts. There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a uuid. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions. These are points we can easily reconcile in our TAN-A-div file, which we now expand to include the German version. We make the following adjustments (in boldface):<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <declarations/> <agent xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </agent> <role xml:id="creator" which="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> </head> <body> <equate-works src="eng-uk ger"/> <equate-div-types> <div-type-ref src="ger" div-type-ref="Zeile"/> <div-type-ref src="eng-uk" div-type-ref="line"/> </equate-div-types> <realign> <anchor-div-ref source="ger" ref="5"/> <div-ref source="eng-us" ref="4"/> </realign> </body> </TAN-A-div> The first major change is the insertion of a new <source>, identifying the name and location of the third example. Note that two locations have been provided, one for the original location and another for the copy saved locally into our project folder. Validation will occur at the first document available. If we wanted to work primarily off our local copy, we would have put it first. By placing it second, we allow the validation engine to look for updates and changes in the master version. If that version is unavailable, validation will be made against second, local copy. The second major insertion is a new <change>, documenting when we made the alterations. The value of @when effectively updates the version of our TAN-A-div file. The third major change populates the <body> with elements that calibrate the new version to the other two. <equate-works> says that, for the sake of this alignment, the works defined in the UK version and the German version to be considered equivalent. We did not mention the US version because we do not need to. TAN rules specify that all alignments are transitive unless otherwise specified. If A and B are already defined to be the same work, and we equate A and C as the same work, then B and C will be equated as well. Note, we are not committing ourselves to the proposition that they are in reality the same work. We are making this statement only provisionally, to facilitate the alignment. <equate-div-types> declares that what the German version calls Zeile is, for the sake of this alignment, equivalent to what the UK version calls line. Transitivity means that Zeile is inferred to be equivalent to what the US version calls l. This element is completely optional. If we left it out, the alignment, which is based upon references, not division types, would not be affected. But by creating it, we assist users who may care about textual divisions. A <realign> takes care of the apparent typographical error, this time anchoring the German version to the US one. Any <div-ref> in a <realign> is wrested from automatic alignment and attached to an <anchor-div-ref> and, by the law of transitivity, anything that aligns to it, in this case the UK version. Note that we have used 5 and not e to point to the stray reference in the German version. But we could have used e, or even the Roman numeral v, had we wished to, but we should find a single numbering system we're comfortable with for our TAN-A-div file, and stick with it. Every TAN file's numeration system is evaluated locally, independent of any companion files. That way a single TAN file can use a single kind of numbering to access multiple TAN documents that may each use different numerals. Therefore we do not need to reconcile the letter labels a, b, and c in the @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format allows four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems will be converted to hyphen-joined Arabic numerals before comparison (e.g., 1-1, 1-5, 1-7, 1-4, 1-5, 2-5). With these changes, the new version is completely synchronized with the other two. Our work may have been simplified if we had just modified the German version ourself. But such changes would have affected only our local copy, not the master one. Changing only our local copy would not allow us to connect our work to other TAN files that may be depending upon the same master file. But the format has also been designed to anticipate a living, growing network. Perhaps Hans Schmidt, the producer of the German version, can be contacted. We do so, and we suggest that he modify the version to make it align better. In the case of <div-type>, he need merely add another element: <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI>. This line, in addition to the preexisting <IRI>, specifies that the two IRIs are equivalent. Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. (Remember, TAN is meant to provide a framework within which opinions can be registered, even counterintuitive ones.) But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent discrepancy in the last line. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made. At this point we have a network of five TAN files, four in our corpus and one from outside. Although simple, the network could be the basis for some creative and complex research questions. Stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. Study of the rest of these guidelines, as well as example TAN libraries, will suggest numerous ways to create, manage, share, and use TAN files.

Detailed Description This part of the guidelines provides a detailed description of the formats of the Text Alignment Network. The material is organized according to the structure that governs the schema files, so both can be read in tandem. outlines, in a non-technical way, the principles and technical foundations of the TAN format. , , , and comprehensively describe all the TAN formats. Each chapter covers preliminary theoretical or scholarly considerations, discussiong how the features of each TAN format are meant to be interpreted as a whole. , the first of two very long chapters, provides a comprehensive, detailed explanation of the rules for every element and attribute, as well as the patterns into which they fall. This chapter includes a thorough list of relevant validation rules and examples. It has been written using a stylesheet that traverses the official TAN schemas, functions, and examples. lists all the vocabulary items that have already been defined as a core part of the format. This chapter is, essentially, a re-presentation of the TAN-key files that are in the TAN-key folder. The chapters in this part of the guidelines should be read selectively, not consecutively. They have been written with the assumption that you have already read the previous part () and that you have already started to create and edit a TAN collection. Because readers will come from different specialties, all acronyms, abbreviations, and concepts are defined and explained, albeit tersely. Concepts or technologies are discussed only insofar as they affect the use of TAN; suggestions for further reading are provided for those who want a more thorough introduction to a topic. General Underpinnings This chapter retains something of the introductory spirit of the previous one by providing an overview of the fundamental principles and technologies behind TAN. The overall goal of this chapter is to document the definitions, assumptions, and other matters that have shaped the design of the format. Although this chapter assumes on your part no prior knowledge of any particular technology, it is also not meant to be a tutorial. Links to further reading will take you to more adequate introductory material.

The Big Picture The Text Alignment Network is a modular suite of XML encoding formats. Each TAN format is designed for a specific type of textual data, divided into three classes: transcriptions (class 1), annotations of transcriptions (class 2), and everything else (class 3). Class 1, representations of textual objects, consists solely of transcription files. Each transcription file contains the text of a single work from a single text-bearing object, whether physical or digital (an object we sometimes term scriptum). There are two types of transcription file: a standard generic format and a TEI extension. Both are TEI conformable. These two types are differentiated by the root element, <TAN-T> and <TEI> respectively. In the future, class 1 may expand to include formats intended to segment (and therefore align) visual, audio, or audiovisual files; it may also expand to include a customized form of HTML. Class 2, annotations of class 1 files, encode data concerning alignment, lexico-morphology, and other textual claims. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A-div> as the root element, aligns any number (one or more) of class 1 files, and permits assorted claims about those files. The latter, <TAN-A-tok>, aligns only pairs of class 1 files. Lexico-morphology files, <TAN-LM>, are used to encode the lexical and morphological (or part of speech) forms of individual words in a single class 1 file. In the future, class 2 may expand to include syntax (treebanking). Class 3, covers everything else. <TAN-mor> declares the grammatical categories or features of a given language and stipulates rules for tagging words. <TAN-key> collects and defines terms frequently used in other TAN files. <TAN-c> supports assertions (in a syntax inspired by RDF) to provide context to other TAN files. Class 3 may expand in the future to include transliteration, lexicography, and syntax. Inclusions: Any TAN file may include any other TAN file, no matter the class of either the including or the included files. Inclusions in TAN behave differently than other kinds of inclusions in markup languages. For example, in XSLT, if file A includes file B, all of B's first-tier children are copied into the root element of A before A is processed. In XML Inclusions, inclusion pertains either to the entire file or to a specific element, named through XPointer. For these reasons, mutual inclusion is not allowed because of its inherent circularity. In TAN, inclusion is a two-step process. First the included file B is declared by means of an <inclusion> in the <head> of document A. Second, certain elements in document A may include an @include, specifying that the host element should be replaced by all elements of the same name found in document B. Because of this behavior, the prohibition on circular inclusion pertains only to select element names. That is, A and B may validly invoke each other as inclusions, or share inclusions, as long as there is no circularity in the elements that are included. TAN files that refer to or are referred to by other TAN files form a kind of network. Alignment files become the principal point of connection. Below is an illustration of how an ecosystem of independently curated TAN files might interrelate, with arrows showing lines of dependency. In this hypothetical example, Editor 1 has transcriptions of four different high medieval works and she wants simply to make them available to anyone who want to use them, and posts them on Server 1. Editor 2 (= Server 2), interested primarily in Old French morphology, finds three versions in Server 1 that are in that language and publishes a morphological analysis of them. Editor 3 has provided a small collection of two early interrelated medieval Latin works. Editor 4 has found an Old English version missing from Editor 1's collection, and has decided to provide not only a word-for-word correspondence between it and a key Old French version, but to create a morphological analysis of that Old English version, as a counterpart to Editor 2's work on the Old French version. (He is interested in computing the morphological differences between the Old French and Old English versions.) Editor 5 is interested primarily in showing where Server 1's collection quotes from the works on Server 3, and so merely puts together an alignment of quotations. This approach adopts what is sometimes called stand-off annotation (or stand-off markup), in contrast to in-line annotation, in which a transcription and its alignments, morphology, and other annotations are placed in a single file. (Most TEI and HTML files rely upon in-line annotation.) In the TAN format, stand-off annotation has been extended into a modular design, with each module designed to to be simple and complement the other modules. (In fact, the combined sum of elements and attributes from TAN modules are roughly equivalent to the number of elements in HTML.) Modular stand-off annotation has been adopted for several reasons: An editor can work on a file with minimal distraction, focusing on a limited set of closely related questions. (Editors 2 and 5 can work off the same master files provided by Server 1, even though they have very different research interests.) Complementary or competing annotations can be made, even if those annotations overlap (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another). (Editor 5 may choose to incorporate or ignore the alignments that Editor 3 has made of her collection.) Annotations can be made concurrent to any others that may already exist, allowing for rich and complex analyses. After a TAN collection is published, any other TAN files that it refers to, or any TAN files referring to it, can be aggregated into much larger and more complex datasets, which can then be queried to answer questions that might not have been anticipated. Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently, on separate files. When errors are found, they can be corrected in master files. Anyone depending upon that master file as a source will be notified of changes that have been made and they can deal with them accordingly. (Editor 1 can post typographical corrections, and if she logs the change with a time-date stamp, anyone using the file, upon validating their files, will be sent information or a warning about the change. Similarly, Editors 2 and 4 can let Editor 1 know about their work, and Editor 1 can update the Old French versions with cross-references.) Any data file can be released, circulated, and used independent of any other that points to it, or to which it points. Connected files can be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis). A transformation created for one set of TAN documents will work identically on other TAN documents of the same format. (If someone creates a tool to synthesize a transcription and an associated TAN-LM file, it can be applied to both Editor 2's and Editor 4's work.) The TAN family of formats can be expanded to allow other types of linguistic data, and therefore other lines of research. Stand-off annotation is not without its liabilities. Files might be altered or altogether deleted, rendering dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are significant challenges, but TAN validation rules have been designed to mitigate them somewhat.

Assumptions in the Creation of TAN Data All creators and users of TAN files are expected to share few basic assumptions. First, all TAN-compliant data is to be understood as largely derivative. That is, data files have no originality or creativity independent of their sources (but see below about interpretation). TAN-compliant data is to be created with intent of adhering as closely as possible to some model or archetype. For example, a transcription should replicate faithfully some earlier digital edition or text-bearing material object (e.g., stone, papyrus, manuscript, printed book for written text; audiovisual media for oral or performative texts). Morphological files and alignment files should describe as clearly and as reliably as possible their source transcriptions. In creating and publishing a TAN file you claim to have offered a good-faith representation or description of something; in using a TAN file, you hold the creator to that expectation. Second, all core TAN files are interpretive. That is, they are permeated by editorial assumptions and opinions that might not be shared by everyone. If there is any originality or creativity in a TAN file, it is in that interpretive outlook. For example, if you edit a transcription file you must decide how to handle unusual letterforms and other visible marks. Your decisions will be informed by how you view the original text and its native writing system, and how you interpret and use Unicode. If you write an alignment file, you must make decisions about what factors caused one text to be transformed into another. Lexicomorphological files require you to commit to one or more grammars and dictionaries, and you must discern how best to handle cases of vagueness and ambiguity. As a general rule, the TAN classes go from least interpretive (class 1) to most (class 3). But no matter which class, no TAN data file ever stands completely outside the interpretive act. In creating and publishing a TAN file you claim to have disclosed as best you can the assumptions behind your interpretive outlook; in using a TAN file, you hold the creator to that expectation. Third, all core TAN files are useful. That is, the interpretive impluse is assumed to be coupled with an equally strong desire to make the data as useful to as many users as possible, even those who may not share your assumptions or interpretation. A creator of a transcription file, for example, should normalize and segment texts with a minimum of idiosyncracies, adopting when possible reference systems that are widely used so as to optimize the alignment process. Morphological files should depend whenever possible upon commonly accepted grammars and lexica. Alignment files should work with comprehensible categories of text reuse. No TAN file will always be useful to everyone, but it should be as useful to as many as possible, as frequently as possible. In creating a TAN file you claim to use common, shared conventions whenever possible, and to note any departures; in using a TAN file, you hold the creator to that expectation.

Core Technology TAN depends upon a core set of relatively stable technologies. Those technologies and the underlying terminology are very briefly defined and explained below, as far as they affect the TAN format. References to further reading will lead you to better and more thorough introductions. The central goal of this section is to highlight any decisions made in the design of TAN that significantly affect how anyone might create or interpret TAN-compliant data.

Unicode

What is it? Unicode is the worldwide standard for the consistent encoding, representation, and exchange of digital texts. Stable but still growing, Unicode is intended to represent all the world's writing systems, living and historical. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, independent of individual fonts. Any Unicode-compliant text is in general semantically interoperable on the character level and can be exchanged between users and systems, no matter what font might be used to display the text. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still intact and valid. Styling the text with a font that does support the alphabet will reveal this to be the case. With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular alphabet or a set of characters that share something in common. Within each block, characters may be grouped further. Each character is assigned a single codepoint. Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode with a related numeration system. Codepoints are therefore numbered according to a hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The number 10 in decimal is A in hexadecimal; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) To find Unicode codepoint values is therefore helpful to think of the corpus of glyphs as a very long ribbon sixteen squares wide. This is illustrated nicely in this article. Each position along the width is labeled with a hexadecimal number (0-9, A-F) that always identifies the last digit of a character's code point value. It is common to refer to Unicode characters by their value or their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four digits. The official Unicode name is usually given fully in uppercase. Examples: Unicode characters Character Unicode value Unicode name " " (space) U+0020 SPACE ® U+00AE REGISTERED SIGN ю U+044E CYRILLIC SMALL LETTER YU

Normalization TAN validation rules require all data to be normalized according to the Unicode NFC algorithm. Any text in a TAN body that does not comply will be marked as invalid. Validation engines that support Schematron Quick Fixes will allow users to easily convert non-normalized to normalized Unicode.

Unicode characters with special interpretation The TAN format allows the following characters anywhere, but assign special meaning in certain contexts: U+200D ZERO WIDTH JOINER U+00AD SOFT HYPHEN When these characters occur at the end of a leaf <div>, perhaps followed by white space that will be ignored (see below), processors will assume that the character is to be deleted, and when combined with the next leaf div, no intervening space should be allowed. Furthermore, because these characters are difficult to discern from spaces and hyphens, any output based on the character mapping of the core functions should replace these characters with their XML entities, ‍ and .

Combining characters At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently or as part of a base character, nor does the family of XML languages. In most circumstances, this point is negligible. But it affects regular expressions and XPath expressions (see below). Two of the class 2 formats allow the counting of characters. Such counting is assumed to be made exclusively of non-combining characters, defined as the regular expression [^\p{M}]. Any numerical reference made in a TAN file to an individual character will be found by counting only non-combining characters, and will return that base character combined with all combining characters that immediately follow. Any <div> that starts with a combining character will be marked as invalid. See also .

Deprecated Unicode points Because TAN is focused not at all on appearance, the following characters will generate an error if found in a TAN file: U+00A0 NO-BREAK SPACE U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE

Further Reading Unicode Consortium Unicode (Wikipedia)

eXtensible Markup Language (XML)

What is it? Defined by the W3C, the eXtensible Markup Language (XML) is a machine-actionable markup language that facilitates human readability. At its heart, XML is rather simple. It begins with an opening line that declares that what otherwise would look just like plain text is an XML file. It then proceeds to the data, which must marked by one or more pairs of tags. An opening tag looks like <tag> and a closing like </tag> (or if the tags contain no data, this can be collapsed into one: <tag/>). A pair of matching tags is called an element. Elements must nest within each other. They cannot overlap. For example:<?xml version="1.0" encoding="UTF-8"?> <p>A paragraph about <name> <first>Mary</first> <last>Lee</last></name>.</p> This nesting relationship of elements means that an XML document can be pictured as a tree, a metaphor that provides a host of technical names for the relationships that hold between elements: root, parent, child, sibling, ancestor, and descendant. In the example above, the root element <p> is the parent of <name> and the ancestor of <name>, <first>, and <last>. The element <first> is a child of <name> and a descendant of both <name> and <p>. <first> and <last> are siblings to each other. The opening tag of an element might have additional nodes called attributes, recognized by a word, an equals sign, and then some text within quotation marks (single or double), e.g., id="self". An element may have many attributes, and those attributes can appear in any order. Attributes can be thought of as leaves on an XML tree. They are intended to carry simple data (usually metadata about the data contained by the element), because they cannot govern anything else. <?xml version="1.0" encoding="UTF-8"?> <p n="1" id="example">A paragraph about <name><first>Mary</first> <last>Lee</last></name>.</p> The two examples above are functionally equivalent. The first takes up several lines whereas the second has only two. But they're still equivalent. That is because in most XML projects extra lines, spaces, and indentation are effectively ignored by processors, to give human editors the flexibility they need to optimize indentation for readability. Therefore, continuous strings of multiple spaces, tabs, and newline/carriage return are to be treated as a single space. (See below.) XML allows for other rules to be added, if an individual or group so wishes. These rules, called schemas, can allow great flexibility or be very strict. The TAN schemas tend to the latter.

Schemas and validation Validation files are found here: http://textalign.net/release/TAN-1-dev/schemas/. Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type (written in RELAX-NG) the other with very detailed rules (written in Schematron). The RELAX-NG rules are written primarily in compact syntax (.rnc), and converted to the XML syntax (.rng). For TAN-TEI, the special format One Document Does it all (.odd) is used to alter the rules for TEI All. The Schematron files are generally quite simple, acting as a conduit to a large function library written in XSLT. For more on this process, see . Some validation engines that process a valid TAN-compliant TEI file may return an error something like

conflicting ID-types for attribute "who"
                        of element "comment" from namespace "tag:textalign.net,2015:ns"

. Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. Very often, it is possible for you to configure an XML editor to ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

White space In any XML file, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom they need to format XML documents as they like, for either human readability or compactness. All TAN formats assume data will be pre-processed with space normalization, as defined by the standard XML function fn:normalize-space(), which trims space from the beginning and end of a text node or string, and replaces consecutive space marks with a single space. Some space is assumed to exist between adjacent leaf <div>s, even if no space intervenes (unless if the first <div> ends in the soft hyphen or the zero width joiner; see ). What type of space is not dictated by the TAN format. It is up to processors to analyze the relevant <div-type> to interpret what kind of white-space separator is appropriate. If retention of multiple spaces is important for your research, then TAN formats may not be an appropriate format, since TAN is not intended to replicate the appearance of a scriptum. Pure TEI (and not TAN-TEI) might be a practical alternative, since it allows for a literal use of space, and encourages XML files that try to replicate the appearance of a scriptum. For more on white space see the W3C recommendation.

Non-mixed content Many familiar text formats such as TEI, HTML, and Docbook allow what is called mixed content, i.e., elements and nonspace text nodes may be combined as siblings. The TAN formats, aside from TAN-TEI, are committed to a non-mixed content model. Nonspace text nodes and elements are never siblings. The practical effect of this policy is that indentation may be applied to a TAN file as one wishes, and space text nodes may be inserted between any two adjacent elements, without affecting the meaning. To specify in a class 1 file that two adjacent leaf <div>s should have no intervening space, see .

Namespaces

What are they? XML allow users to develop vocabularies of elements as they wish. One person may wish to use the element <bank> to refer to financial institutions, another to rivers. Perhaps someone wishes to mention both rivers and financial institutions in the same document. XML was designed to allow users to mix vocabularies, even when those vocabularies use synonymous element names. This means that anyone using <bank> must be able to specify exactly whose vocabulary is being used. Disambiguation is accomplished by associating IRIs (see below) with the element names. The actual full name of an element is the local name plus the IRI that qualifies its meaning, e.g., bank{http://example1.com/terms/} and bank{http://example2.com/terms/}. The relationship between the element name and the IRI is analogous to that between a person's given name and family name. The IRI—the family name—is called the namespace. If the term sounds like meaningless jargon, you may find it easier to think of it as the name of a group of elements. Namespaces look a lot like attributes (they aren't). They take the form <bank xmlns="http://example1.com/terms/">...</bank>, which states, in effect not only which namespace governs bank <bank>, but what the default namespace will be for any descendants. But supposing we wished to combine the two type of <bank> elements, we can assign abbreviations to select namespaces, then append those abbreviations to the element names, separated by a colon. Here are three ways to say the same thing, showing the use of prefix abbreviations and default namespaces: <bank xmlns="http://example1.com/terms/"> <bank xmlns="http://example2.com/terms/"> ... </bank> </bank> <bank xmlns="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/"> <e2:bank > ... </e2:bank> </bank> <e1:bank xmlns:e1="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/"> <e2:bank > ... </e2:bank> </e1:bank>

TAN namespace and prefix The TAN namespace is tag:textalign.net,2015:ns. The recommended prefix is tan. The namespace is expected to remain the same from one version to the next. The TAN-TEI format uses as its default the TEI namespace, , normally given the prefix tei.

The Text Encoding Initiative

What is it? The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software developed for or adapted to the TEI. Taken from the TEI website , accessed 2017-05-21. Any TAN-T module can be easily cast into a TEI file, although much of the computer-actionable semantics will be lost in the process. Likewise, a TEI file can be converted to TAN-T, but there is a greater risk of loss of content, particularly in the header, since the TAN format is intentionally restricted to an important but small subset of TEI tags. The TAN-TEI module is a TEI extension to the format, based on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release. For more about the strictures placed upon the TEI All schema see . See also and .

Further reading Text Encoding Initiative

Data types Being a written purely in XML technologies, TAN adopts its data types, e.g., strings, booleans, and so forth, from the official specifications made by the W3C. The following data types require some special comments.

Languages TAN adopts for language identification Best Common Practices (BCP) 47, which standardizes with high precision the way languages are identified. For most users of TAN, this will be a simple three-letter abbreviation, sometimes supplemented with a hyphen and an abbreviation designating a script or regional subtag. For example, eng, eng-UK, and eng-UK-Cyrl refer, respectively, to English generally, English from the United Kingdom, and English from the United Kingdom written in the Cyrillic script. As a general rule, values of this type should begin with a three-letter language code, preferably lowercase. ISO codes for human languages appear in @xml:lang and <for-lang>. The first indicates the principal language of the text enclosed by the parent element. The second indicates that some statement or claim is being made about a specific language language. For example, <for-lang> in the context of a TAN-mor file indicates languages for which the encoded morphological rules are appropriate. For more information, see one of the following: BCP 47 official specifications BPC 47 technical details

Dates and times TAN adopts the standardized ISO form of dates and date-times, as interpreted by XML data types. These begin with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest). This standard allows for easy sorting The simplest date takes this form: YYYY-MM-DD. If a time is included, it is specified by continuing the string, first with a T (for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm). For example, the following is 2016-09-20T20:38:27.141-04:00 is an ISO date-time for Tuesday, September 20, 2016 at 8:38 p.m. on the Eastern Time Zone. More reading: W3C specification Wikipedia entry on ISO 8601

Identifiers and Their Use The acronyms for identifiers, and the meanings of those acronyms, can be mystifying. Here is a synopsis: IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987 URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986 URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource. URN: Uniform Resource Name, a term that originally referred to persistent names using the urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs. UUID: Universally Unique Identifier, a computer-generated 128-bit number used to assign identifiers to any entity. UUIDs can be built into a URN by prefixing them with urn:. The TAN format generally prefers to refer to IRIs. See also .

Resource Description Framework (RDF) and Linked Open Data

What are they? Identifiers are used in many contexts for many purposes. One of the key purposes close to those of TAN involves what is called variously Linked Open Data (LOD) or the Semantic Web. These technologies rely upon a very simple data model called Resource Description Framework (RDF), a family of World Wide Web Consortium (W3C) specifications originally designed as a data model for metadata. The foundation of the model is the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that act as names of things, as does the object, which also allows for data type. The practical impetus to LOD is that if we use URLs as identifiers for things, then we can create web pages at those URLs that provide humans and computers with related, linked information. And as we begin to use the same URLs for the same concepts, then independently created datasets can be combined and compared into a whole that admits inferences not possible with the parts alone. These URL identifiers look like a web page address (e.g., http://...), but are first and foremost names for things (the "Resource" behind RDF is a clumsy term pointing to person, place, concept—anything at all). Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found. But ordinary users may be forgiven for not knowing whether the URL is a web page or a name for something else.

TAN and RDF Many parts of TAN map nicely onto RDF and vice versa. In fact, TAN tends to be easier for humans to read and write than does RDF, even in its most straightforward syntax. Compare, for example, this snippet (taken from ), written in Turtle syntax, ...1 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 2 @prefix foaf: <http://xmlns.com/foaf/0.1/> . 3 4 <http://biglynx.co.uk/people/dave-smith> 5 rdf:type foaf:Person ; 6 foaf:name "Dave Smith" . ...with the TAN equivalent:<person xml:id="dsmith"> <IRI>http://biglynx.co.uk/people/dave-smith</IRI> <name>Dave Smith</name> </person> In this case TAN and RDF are converted losslessly. But in many cases, TAN statements cannot be reduced to the RDF model. This happens most often in the context of <claim>, which is designed to allow scholarly assertions and claims that are difficult or impossible to express in RDF. For example, RDF does not allow one to say "Person X is not the author of text Y." TAN claims have been designed specifically to cater to such common scholarly expressions. For more details see .

Further reading W3C recommendation Linked Data Linked Open Vocabularies

Tag URNs TAN files make extensive use of tag URNs (see ). In fact, TAN's namespace is a tag URN (). A tag URN has two parts: Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01. Name of the TAN file. : + an arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for the entire file and related versions. It need not be the same as the filename stored on a local directory. You should pick a name that is at least somewhat intelligible to human readers. Great care must be taken in choosing the IRI name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple IRIs, but never acceptable for an IRI to name more than one thing. It is a good practice to keep a master checklist of IRI names you have created. If you find yourself forgetting, or think you run the risk of creating duplicate IRI names, you should start afresh by creating a new namespace for your tag URNs, easily done just by changing the date in the tag URN namespace. That is, if tag:textalign.net,2015:... seems to be overly cluttered, you may start a new set of names with something else, e.g., tag:textalign.net,2015-01-02:.... TAN IRI names tag:jan@example.com,1999-01-31:TAN-T001 tag:example.com,2001-04:hamlet-tan-t tag:evagriusponticus.net,2014:tan-lm:Evagrius_Praktikos_grc_Guillaumonts tag:bbrb@example.org,1995-04-01:pos-grc The first example comes from someone who owned the email address jan@example.com on January 31, 1999 (at the stroke of midnight, Universal Coordinated Time). The other examples follow a similar logic. The namespace of the second and third examples are tied to the owners of specific domain names, not those of email addresses. The 2014 in the fourth example is shorthand for the first second of January 1, 2014. The TAN encoding format has chosen tag URNs over URLs for several reasons: Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries in the future, well after specific domain names have changed ownership or fallen into obsolesence, and well after the creators are dead. To mint names according to URLs is inadequate for long-term use, since it has no built-in mechanism to identify who owned the domain name in question when the name was minted. Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, organization, or some other agent). Tag URNs attached the responsibility for naming objects to a particular person or organization that owned the tag namespace at the specified time. Accessibility. Tag URNs are available to anyone who has an email address. No one has to register with any central authority. You can begin naming anything you want, any time you want, without seeking anyone's approval. Ease. Tag URNs are easier to use than, say, http-form URLs, as recommended by RDF (see ). Many potential TAN authors never have owned a domain name, and never will. Further, many of those who do own domain names cannot or do not wish to configure and maintain servers that will administer the referral mechanisms upon which the semantic web depends. Disambiguation of name and location. In the semantic web, conflation of name with a location to resolve it is considered a virtue because a single string answers two questions: what is the resource and where can I find out more about it. But this conflation is unhelpful for those who use the TAN formats, who are encouraged to distribute their TAN files widely, and not rely upon a single location. And URLs are in common parlance interpreted as locations for data, not as names for things. TAN-compliant tag URLs ensure that the names of concepts and objects do not look like locations, maintaining a distinction that has always been a foundational principle in scholarly citation, namely, that one should always distinguish the name of a resource from where it might be found. Further reading: RFC 4151, the official definition of tag URNs

Regular Expressions Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, it means rules (Latin regula), and points to a rule-based syntax that provides expressive power in algorithms that search and replace text. Regular expressions come in different flavors, and have several layers of complexity. So these guidelines are restricted to a synopsis that illustrates very common uses that conform to the definition of regular expressions found in the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Fuctions 3.0. XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the how codepoints U+0020..U+00FF are categorized according to XML (and therefore TAN): Word characters (\w):

$ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W):

! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § «  ¶ · »
                           ¿

Some of these choices may seem counterintuitive or wrong. But at this point it does not matter. The distinction is a legacy that will remain in place. It is advisable to familiarize yourself with decisions that, in some respect, are arbitrary. A regular expression search pattern is treated just like a conventional search pattern until the computer reaches a special escape character:

. [ ] \ | - ^
                     $ ? * + { } ( )

. Here is a brief key to how characters behave in regular expressions, provided they are not in square brackets (on which see the recommended reading below): Special characters in regular expressions Symbol Meaning $ end of line . any character | or (union) ^ start of line ? zero or one * zero or more + one or more [ ] a class of characters ( ) a group \w any word character \W any nonword character \s any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D) \S anything not a spacing character \d any digit (0-9) \D anything not a digit \p{IsGujarati} any character from the Unicode block named Gujarati \\ backslash (the backslash alone suggests that the next character is a special character) \$ dollar sign \( opening parenthesis \[ opening square bracket

Some examples: Examples of Regular Expressions Expression Meaning What the expression matches when applied to "Wi-fi, good. A_hem* isn't!" ^.+$ one whole line of characters "Wi-fi, good. A_hem* isn't!" [ae] a or e "e" [a-e] a, b, c, d, or e "d", "e" [^ae]+ one or more characters that are anything except a or e "Wi-fi, good. A_h", "m* isn't!" .i any character followed by i. "Wi", "fi", " i" (.i) when a character followed by an i is found treat it as a capture group (used only in a search pattern) "Wi", "fi", " i" $1 first capture group (used only in a replacement pattern, and corresponds to the sequence of capture groups in the search pattern) In the example above, each match corresponds to $1 [aeiou]\w* any lowercase vowel along with every word character that follows "i", "i", "ood", "em", "isn" [t*]. any t or * and the following character "* ", "t!" Note that the asterisk, if inside a character class, acts as itself. \s+ match one or more space characters " ", " ", " " \w+ match one or more word characters "Wi", "fi", "good", "A_hem", "isn", "t" \W+ match one or more nonword characters "-", ", ", ". ", "* ", "'", "!" [^q]+ one or more characters that are not a q "Wi-fi, good. A_hem* isn't!"

The examples above provide a taste of how regular expressions are constructed and read. For further examples especially relevant to TAN see <filter>. Regular Expressions and Combining Characters Regular expressions come in many different flavors, and each one deals with some of the more complex issues in Unicode in their own manners. This ambiguity will be most keenly felt in the use of combining characters in Unicode. Given a string áb = áb (i.e., an acute accent over the a), a search pattern a. will in some search engines include the b and others not. Unicode has differentiated three levels of support for regular expressions (see official report). Only level one conformance in TAN is guaranteed. Combining characters fall in level two. If you find the need to count characters, and you are working with a language that uses combining characters, you should count only base characters, not combining ones. In fact, TAN assumes that in cases where characters are identified with a numeral, the numeral excludes combining characters. See . Further, any regular expressions with wildcard characters cannot be expected to be treated uniformly. TAN includes several functions that usefully extend XML regular expressions. See tan:regex, tan:matches(), tan:replace(), tan:tokenize(). Further reading: Various tutorials on Regular Expressions Wikipedia, Regular Expressions Regular Expressions in XSLT 3.0 Unicode and Regular Expressions XML Schema Datatypes

Interpretation of multiple values The interpretation of an element with multiple child elements, which occur frequently in TAN files, or an attribute with multiple values can be quite unclear. Do those multiple values represent intersection, union, or distribution? For example, attribute="A B" could be interpreted to mean, using the diagram below, one instance in y (intersection), one instance in the region of x or y or z (union), or one instance in x or y and one instance in y and z (distribution).

Venn%20diagram.jpeg The interpretation of multiple values in any TAN element or attribute is based upon perceived common usage in ordinary English language. For example, any element that takes the allows multiple <IRI>s. If entity j has <IRI>s A and B, and entity k has <IRI>s B and C, can j be inferred to be the same entity as k? Because people commonly use the same term while meaning different things, TAN can answer only the first half of this question. The IRI + name pattern is to be interpreted as union. But the TAN schemas cannot predict how people will interpret the extent of those two unions, or for that matter how they will interpret a single IRI. The TAN schemas interpret the meaning of multiple values in an element or attribute in one of three ways: Intersection. Qualifications of claims, e.g., @adverb, @claimant. For example, "...probably not..." does not mean "...probably..." and "...not..." Not a transitive property (for j = A, B; k = B, C, nothing can be inferred about the relationship between j and k). Union (default). Anything that takes the , <equate-works>, @when <when>, @where. For example, "entity j is [urn:A], [urn:B]" means that entity j is urn A, urn B, or both. TAN interprets this property as being transitive (for j = A, B; k = B, C; l = C, D, one may infer j = k = l). The interpretation of union as being transitive may result in inferences you disagree with. It is your responsibility to interrogate inferences in the TAN files you are using. Distribution. @affects-element, @object, <object>, @src, @subject, <subject>, @verb. For example, "[Source A], [source B], are Z" means "Source A is Z" and "Source B is Z." This property is not transitive. The above has ignored the important question of range. If entity x is said to be A, does it mean that it is true for all of x and all of A, or just some part of each? If the entity is one or more word tokens, then the statement is assumed to hold over the entire entity. If the claim is being made of a range of text, that assumption cannot be made. For example, to say that passage x quotes from passage y should not be interpreted to mean that the entirety of x quotes the entirety of y. At present, TAN does not address this ambiguity, and leaves judgment, based on common sense, to you.

Patterns and Structures Common to All TAN Encoding Formats This chapter provides general background to the elements and attributes that are common to all TAN files. For detailed discussion see .

Common Patterns

IRI + name Pattern Both humans and computers need to read and write TAN metadata. Very often what is readable to humans is unreadable to computers, and vice versa. So the TAN format requires that all metadata be provided whenever possible in both forms. Although this rule may appear to introduce redundancy and therefore new opportunities for error, the clarity is critical. It is the only way at present to ensure that anyone who approaches the data—computer or human—can parse and use it. In addition, doubly expressed metadata provides a safeguard much like a checksum: human- and computer-readable descriptions should correspond. Any discrepancy is a signal that an error should be diagnosed and fixed. Some metadata, such as comments, are neither easily nor profitably translated into a computer-actionable string. In such cases only the human-readable form is required. Other metadata involve regular expressions or ISO-compliant dates, both of which are well formed and are usually human-legible. In those cases the data is not repeated. In cases where a datum is not understandable to humans, such as a complex regular expression, a <comment> may be provided. Those exceptions aside, all other metadata takes what is called the IRI + name pattern: one or more <IRI> and <name> and zero or more <desc>s. If the thing being described is a digital file, then the IRI + name pattern is part of a larger pattern, the .

Digital Entity Metadata Pattern Some entities identified by the will be digital resources. In those cases, the IRI + name Pattern is extended in two different ways, according to whether the entity is a TAN file or not. If the entity is a TAN file, then <IRI> (one and only one) must be a valid tag URN that matches the @id value of the TAN file being referred to. This may seem excessive, since in other contexts (HTML, TEI), one need only the @href or @src. This extra measure has been introduced because TAN files are meant to be valid long after their creation, when they may be separated from their original context, or when a server no longer has the files referred to. Without the @id value, recovering the referred to file would be difficult or impossible; with it, easier, and perhaps possible. If the entity is not a TAN file, then any IRI may be used. If you choose to use the digital resource's URL as its name (and as its location; see below), then it will be inferred that you mean to identify the digital resource that appeared at that URL at the date or time you accessed it. In either case, the pattern adds to the IRI + name pattern one or more <location>s and an optional <checksum>.

Edit Stamp Most TAN elements allow for an optional edit stamp, an @ed-who and an @ed-when, stating who created or edited the enclosed data and when. Neither attribute is allowed without the other. @ed-when, along with @when and @when-accessed, are the attributes through which a TAN file's version is calculated. The latest date serves as the version number. An edit stamp performs the same function as <change>, except that no description can be provided, and it points precisely to the element where a change has been made. If a description of the alteration is necessary, <change> should be used.

Overall Structure (root) All TAN-compliant files, no matter the type or class, follow a common basic structure: (1) at least three processing instruction nodes, (2) a namespace node, and (3) a root element. Processing instruction nodes: The first of three required processing nodes is the standard declaration made in every XML file's prolog: <?xml version="1.0" encoding="UTF-8"?> After that come two more processing instruction nodes specifying the two schema files required for validation

<?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR c]"
                           type="application/relax-ng-compact-syntax"?>

<?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].sch"
                           type="application/xml"
                           schematypens="http://purl.oclc.org/dsdl/schematron"?>

The first processing instruction node points to the RELAX-NG schema that declares the major, structural rules. The second points to the finely tuned rules, written in Schematron. Both processing instructions are required. [PATH] represents the pathname to the schema file, whether local or on a server and [ROOT-ELEMENT-NAME] stands for the name of the root element (the element that is the ancestor of all other elements in the document and the descendant of none). An exception to this rule is that a TAN-LM file may alternatively point to TAN-LM-lang.rng, TAN-LM-lang.rnc, and TAN-LM-lang.sch. These are cases where the TAN-LM file is not based on a particular source but on a language in general. See . It is your choice whether you use .rnc or .rng as the extension for the RELAX-NG schema. The former is the compact syntax and the latter, the XML format. They are equivalent. The schemas are written primarily in the compact sequence, then converted to the XML format. Some files admit different levels of validation, sorted into what Schematron calls phases. TAN-A-div phases are termed basic and verbose, and are chosen by specifying the phase in the prolog, e.g.,

<?xml-model
                  href="TAN-A-div.sch" phase="basic" type="application/xml"
                  schematypens="http://purl.oclc.org/dsdl/schematron"?>

. The verbose version makes extra calculations that go beyond mere validation, and analyze the differences between source files. In most cases, if you have not specified which phase you prefer in the prolog, you will be prompted for a choice when you validate your file. Master files are kept at the TAN git repository and website, but anyone may cache, save, serve, and use copies of the TAN schema files anywhere. Namespace node: All TAN elements take the namespace tag:textalign.net,2015:ns. In most cases, this value is placed in the root element. (The only exception are TAN-TEI transcription files, which take as a default namespace http://www.tei-c.org/ns/1.0 everywhere but in /TEI/head, which takes the TAN namespace.) For more about namespaces, see . Root element: The name of the root element identifies the type of TAN file:Root TAN elements Root element name Type of data TAN class <TAN-T> plain text transcriptions 1 <TEI> TEI transcriptions 1 <TAN-A-tok> token-based alignments 2 <TAN-A-div> division-based alignments 2 <TAN-LM> lexico-morphological analysis 2 <TAN-mor> part of speech / morphology patterns 3 <TAN-key> glossaries 3 <TAN-c> claims 3

Each root element takes a mandatory @id and @TAN-version. The root element takes only two mandatory children: <head> and <body>, the latter containing data and the former, metadata (data about the data). The only exception to this rule are TAN-TEI files, which take three children: <teiHeader>, <head>, and <text>, because the TEI header is inadequate for TAN purposes. See . All TAN files may take one final optional child, <tail>, a private use element that allows any well-formed XML. Nothing in a TAN file should be dependent upon the <tail>. That is, if you are editing a TAN file and you add a <tail>, assume that it will be disregarded by other users. Similarly, you may delete any TAN file's <tail> without consequence.

<code><link linkend="attribute-id">@id</link></code> and a TAN file's IRI Name Every TAN file requires in its root element an @id. Its value, termed the TAN file's IRI name, must take the form of a tag URN (see for syntax). The file's IRI name is the primary way other TAN files will refer to it. The namespace of the current file's IRI name must match at least one namespace in one <agent>'s <IRI> value. This helps tie the responsibility for the TAN file to at least one person. The first such <agent> is called the key agent. In choosing a value for @id you might borrow the filename, but you do not have to. Indeed, it is probably not a good idea, since files are frequently renamed, often with good reason. A TAN file's IRI name should not be changed, especially after publication, because the name is supposed to be permanent and stable. On occasion during editing, it will become clear that revisions are so deep that the file is substantially different from how it began. If a previous version has been published, then coining a new IRI name is advised, to dissociate the file with its ancestry. You may always document the connection by supplying a <see-also> element in the <head>, specifying the <relationship> between the two. If you take someone else's data and alter it then you should not change the IRI name, even the namespace. To avoid suggesting that the owner of that namespace is responsible for the revised file, you should add yourself as an <agent> and then document your alterations through <change> or @ed-when and @ed-who. You should also probably add a <see-also> element, pointing to a version of the file that predates your intervention. The name of the version of a TAN file is identified by the most recent date in a file's @when, @ed-when, or @when-accessed. It is important, therefore, whenever you change a TAN file that has already been published to provide at least an edit stamp () in the part of the file you changed or in a <comment> or <change>, so that anyone validating a TAN file dependent upon yours will be warned that changes have been made. The user may then either continue to process the file (the changes may be minor on inconsequential) or investigate the changes before deciding what to do. Because the IRI name is stable, it is suitable for use outside of TAN, in, for example, RDFa, JSON-LD, and linked open data (see ). The IRI name kept at @id is the only metadatum positioned outside <head>. It is placed as rootward in the document as possible to emphasize that it names the entire document. @TAN-version must be 1 dev, indicating that the files have been made in light of the development files of version one.

Metadata (<code><link linkend="element-head"><head></link></code>) No matter how much one TAN format differs from another, the metadata are quite similar. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore find easily and predictably, the following: the stable name of the file; its version; its sources; other files upon which it depends or otherwise have an important relationship; the most significant parts of the editorial history; the linguistic or scholarly conventions that have been adopted in creating and editing the data; the license, i.e., who holds what rights to the data, and what kind of reuse is allowed. the persons, organizations, or entities that helped create the data, and the roles played by each. To answer these questions completely, consistently, and predictably the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, thus allowing anyone to work easily and predictably across large numbers and types of TAN files. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. That is, your metadata should focus on the data itself and not other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. You merely refer through <IRI> to other authoritative sources that can provide background information. The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. Because of its design principles, the <teiHeader> is impossible to map onto a TAN <head>. But that <teiHeader> has valuable, sometimes critically important, information, and should be retained. Or it may be left empty. Detailed descriptions of <head> and its components are in . Here we provide a summary, general description of TAN metadata. To describe the current file, <head> takes one or more <name>s, zero or more <desc>s and <master-location>s, and one <rights-excluding-sources>. Next come a list of files upon which the file depends: zero or more <inclusion>s, zero or more <key>s, zero or more <source>s, and zero or more <see-also>s. All editorial assumptions are placed in <declarations>, whose contents differ from one TAN format to the next. Finally comes the responsibility section stating who did what when: one or more <agent>s, <role>s, and <change>s, and zero or more <agentrole>s.

Rights and Licenses Two TAN elements cover rights and licenses: <rights-excluding-sources> (mandatory in every TAN file) and <rights-source-only> (optional, and never allowed in class 2 files, because a statement on rights is required in each source). The first element covers the work specific to a given TAN file. The second pertains to the rights for the sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to do so for that of others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional (see below). As an editor, you are strongly encouraged in the <desc> element of <rights-excluding-sources> to emphasize the distinction between the rights you have over your data and the rights held by others over your source, for the benefit of those who may not be familiar with the TAN format. A statement something like this is recommended:

<desc>The data in this file, only insofar as it constitutes an
                     independent work, is licensed exclusive of any licenses held by parties over
                     the source or sources listed below.</desc>

When using a TAN file, you should investigate the entire chain of rights. If you find a discrepancy between the two licenses—that of a TAN file and that of its sources—you should respect the more restrictive license. If a TAN file has a very liberal, open license for the data, this does not necessarily mean that the material upon which it depends is in the public domain. The TAN file's source may be under tight restrictions. It is recommended that you not declare who own what rights over your source unless you are quite certain. Copyright laws differ from one country to another, and they change. A source may be protected by copyright in one place and simultaneously be in the public domain in another. (At the time of this writing, dozens of scholarly editions of ancient texts are in the public domain in Germany, where copyright of a new edition lasts forty years, but not in the U.S. or Canada, where there is no explicit legislation on this issue.) Some copyright statements in books are false, or cannot be proven. Some persons or entities who claim rights over a source may have no legal basis for the claim, at least in some jurisdictions. Furthermore, if you mischaracterize the rights that are held over a source, you may be held liable by a putative rights holder. It is safer to use the <IRI> of <source> (described below) to point the user to a publisher or some other entitiy that has greater authority and specificity about who owns what rights. TAN adopts the Creative Commons licenses as its default key vocabulary. See . Copyright Law versus Contract Law Some third-party services, such as the Thesaurus Linguae Graecae for Greek texts, require users to agree not to copy and reuse the texts in service's databases. Such agreements fall under the area of contract law and not copyright law. That is, many of these third parties have no intellectual property rights (or only derivative rights) over the texts they store. Therefore, they should normally not be credited in any <rights-source-only>.

Inclusions and Keys Many if not most TAN files are created alongside or in the context of a project, where certain elements will be repeated. Such repetition makes the files prone to errors, where editorial corrections made in one place are mistakenly not made everywhere. TAN has two features that help avoid duplication, reduce the likelihood of incomplete editing, and lead to cleaner, smaller files.

Keys Most often, an editor wants a simple, shorthand reference to an entity commonly referred to from one file to the next in a single project, e.g., the person who is the principle editor. Writing individual IRI + name patterns can be time-consuming, and if a change needs to be made, it is easy to be inconsistent or incomplete. Vocabulary commonly used in a project may be kept in a <TAN-key> file. This file is made accessible to any other TAN file via <key>. The key vocabulary is then invoked by using @which, whose value should match a <name> value in the TAN-key file. A number of standard keys have already been predefined, documented in . It is strongly recommended that you not depend upon the supplementary TAN-key files of a different project. Rather you should develop your own. You may also wish to create a workflow where the TAN-key is used for private editing, but the published versions have their keywords resolved to their full value.

Inclusions More powerful than TAN-keys are inclusions. Unlike other forms of inclusion you may be familiar with, TAN inclusion involves only select elements, never an entire file. As with keys, TAN inclusion is a two-step process. First, a TAN file is made available for inclusion by invoking <inclusion>s (inside <head>). Like <key>, an <inclusion> does nothing on its own. It merely indicates a file that may be used for patterned inclusions. Inclusions are acted upon only in the second step. Many elements allow @include, which points to the @xml:id reference of an included file. In the validation process, those elements will be replaced with every element of that name found in the inclusion file, checked recursively (see below), and ignoring duplicated elements. <inclusion>s are critically important to the content of the TAN file, so any file with <inclusion>s that cannot be located will be regarded as being in fatal error. Because of the importance of access to included files, it is strongly recommended that inclusions be limited to files locally available, in the same project. Inclusions are recursive. If a TAN file A has

<x
                        include='B'>

and file B has <x include='C D E'> then the validator for file A will replace the element with all <x>s found in B, C, D, and E. In any recursive activity, circularity is fatal. That is true for TAN inclusion as well, but only within the domain of a given element name. It is perfectly legal for two files to include each other, as long as they do not try to include elements of the same name. TAN inclusion removes elements from their original context, which means that values that must be interpreted locally are converted before the elements are included. For example, @which must be interpreted in light of the included document's keys, not those of the including document. Similarly, different numeration systems, e.g., Roman numerals, must be interpreted locally and converted, before inclusion (see ).

Distinguishing <code><link linkend="element-source"><source></link></code>s and <code><link linkend="element-see-also"><see-also></link></code>s Creating and editing a class 1 TAN file frequently involves working with non-TAN digital files. In the course of editing, and making the material TAN-compatible, you will likely start to correct errors, to normalize conventions, or to bring the transcription closer to an earlier version. At such times it may unclear how to credit the digital files. To answer this, first determine a class 1 file's <source>s. Everything else is then a <see-also>. If you find that you are changing the material to go back to the source of your source, then that earlier version should be the <source> and the file you were using should be credited under a <see-also>. But beware, lest using a particular source (such as the TLG) puts you in violation of contract law (see ).

Interpretation of inheritable attributes Some attributes are inheritable attributes, in that they affect not only the host element but all descendants as well. Some inheritable attributes in co-occurrence fall into an interpretive sequence. That is, in any given element, some attributes must be interpreted before others. @claimant falls first in the sequence, and @cert second. Each attribute qualifies the data governed by the elements they modify. Put another way, the two attributes are to be interpreted to mean: "@claimant has @cert confidence about the following data:...." Suppose you encoding claims made by someone else, and you are not certain if you are faithfully representing their point of view. In those cases, your doubt should be registered in a @claimant and @cert that is a parent to the secondary claim you are representing. If @claimant is missing, it is to be assumed that the assertion is being made by the key <agent> (see ). If @cert is missing, it is to be assumed that the data is asserted with full confidence.

Defining Words and Tokens At the heart of interaction between class 1 and class 2 files is a reference system that counts or names words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the language. In different contexts, for example, "New York" and "didn't" can each be justifiably defined as one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., ancient Greek and Latin). In the end, the number of meanings for "word" reflects the rich variety of scholarly disciplines. TAN adopts the proximate term token—a word that is defined not linguistically but computationally, according to a regular expression (see ). A TAN token is a reference pointer, not a linguistic marker. To define a token in TAN does not entail any linguistic commitments. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed. In TAN, a token is purely a method of reference. TAN requires all class 2 files that handle tokens to define them, either implictly through TAN defaults, or explicitly by using <token-definition>. TAN was developed in service of ancient literature, where punctuation is anomalous, or of little use. Furthermore, even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters, the soft hyphen, the zero-width space, or the zero-width joiner, formally defined: <token-definition regex="[\w‍]+"/> This pattern will result in a close resemblance to what is ordinarily thought of as words, but perhaps with some surprises (see above, ). If no <token-definition> is invoked for a particular source, the pattern above will be assumed. It may also be explictly called through @which (see ). If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword general (or

letters
                     and punctuation

): <token-definition regex="\w+|[^\w\s]"/> This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string "(I go!)" (the text inside the quotation marks) would have five tokens:

( I go !
                     )

. Above are the two built-in, TAN-defined <token-definition>s. You may customize your own <token-definition> to suit your needs. But keep in mind that TAN files were meant to be shared across fields and disciplines. You are encouraged to to define tokens in manner customary to users of the text. Specialized definitions make it less likely that your TAN file will be able to mesh well with other TAN files. Two class-2 files annotating the same class-1 file cannot be easily compared or synthesized if they use different definitions of token. Given those caveats, consider a specialized case, where you wish to prepare your transcriptions such that certain Unicode characters precisely delimit tokens that are synonymous with a particular linguistic category, say lexeme. Say, for example, you use specialized control characters (e.g., U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text of your class 1 file. You might then create a <token-definition> like this: <token-definition regex="[^\p{Cf}\s]+"/> The statement defines a token as any consecutive sequence of non-spacing and non-control format characters. Such customized approaches may make the technique unwieldy or impossible to use, thereby limiting your TAN file's interoperability and utility. It is recommended that if you use control formatting characters or other special characters that are invisible to use the xml entity, e.g., ‍, so they can be seen in your file.

Class-1 TAN Files, Representations of Textual Objects (Scripta) This chapter provides general background to the elements and attributes that are common to all class 1 TAN files. For detailed discussion see . Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Files of this class are the foundation of any project. No class 2 files (e.g., alignment, morphology) can be created without class 1 files. Transcriptions come in two different formats, identified by the root element. <TAN-T> is a simple, generic format, as close as one can get to plain text. <TEI> (also referred to in this manual as TAN-TEI), on the other hand, can be complex and highly expressive. Because the two types function almost identically, the generic TAN-T format is described first, followed by supplemental comments on TAN-TEI.

Principles and Assumptions

General (For more general principles and assumptions applying to all TAN files, not just class 1, see .) Class 1 formats are designed for faithful but judiciously edited digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a common reference system. Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how users of the transcription might use it in other contexts, especially translation studies or a study of quotations. Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicology. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills are better used in other TAN formats. TAN-T(EI) editors stand at the beginning of a larger workflow for text alignment. It is critical that work not be published hastily, and only after careful proofreading, especially of white space. Many transcriptions, especially those of long texts, have typographical errors. Eliminating as many as possible before publication will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed with the assumption that all our files have typographical errors that we need to correct as they are found. If you are creating a TAN-T(EI) file, you are doing so primarily to service text alignment. To align is to correlate texts that are similar because of copying, translating, paraphrasing, revising, quoting, summarizing, and so forth. In all these processes, one or more texts, usually called the source (or sources), serves as the basis for a new text, oftentimes called the target. In many cases, the target and source bear little resemblance to each other. Therefore the best transcription files are those whose structures look to an archetype, not a particular version. Editors of TAN transcriptions should not worry about preserving the appearance of its source (i.e., it should not be a diplomatic edition), and they should structure the text, when possible, by the most familiar reference system for that work. If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should be prioritized over visual (lines, columns, pages, volumes). See below on reference systems.

Domain model Contributors and users of TAN files must assume a firm distinction between a scriptum (text-bearing object) and a conceptual work, e.g., a specific printed copy of the Iliad versus the Iliad concieved generally. The former has materiality (digital files are treated as having materiality) and the latter does not. Even though both are constitutively necessary for any transcription, the two are sharply differentiated in the TAN format: <source> and @src point to physical exemplars; <work> and @work to the conceptual. The distinction may remind some readers of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies four types of entities for what they call Group 1 (Products of intellectual & artistic endeavor): Work, Expression, Manifestation, and Item, the first pair being conceptual, non-material entities and the latter pair material ones. TAN has been designed with a slightly different domain model in mind. FRBR Items are equivalent to what TAN calls scripta. Multiple scripta that for all intents and purposes are indistinguishable (i.e., items reproduced mechanically) are equivalent to FRBR Manifestations, but in TAN no corresponding entity has been defined. It is best to think of TAN scripta as being equivalent to FRBR Items, with FRBR Manifestations being sets of indistinguishable TAN scripta. As for conceptual entities, TAN has been designed with the assumption that most users will find the distinction between Works and Expressions to be unhelpful or false. What one person calls a FRBR Expression another may legitimately call a Work (e.g., the King James Version is more than just a translation of the Bible). TAN assumes that any derivation of a Work (or Works) is itself a Work, which is really shorthand for work-version. Thus, in this manual the term version indicates merely a type of work that is known either to derive from another work or to be the basis for other versions of a work. TAN avoids altogether the term Expression. Aside from the issues mentioned above, the term implies a medium (without which nothing can be expressed) and therefore materiality.

One version, one work, one object, one reference system Every TAN-T(EI) file must be restricted to a transcription of a single version of a single conceptual work found on a single scriptum, segmented and labeled according to a single reference system. This restrictive principle is critical to the the success of the network. It reduces the risk of confusion, simplifies the files, and shifts markup complexity from an individual transcription file to the network in which that file participates.

One scriptum Each TAN-T(EI) file transcribes one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created in the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies is to be treated as a single object (an entity some librarians call a manifestation). The definition of some scripta require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; other manuscripts have been physically altered. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files. It is a good idea to name your scriptum in <source> with an <IRI> value in the form of an http URL provided by a library catalogue. This way you provide a way for others, perhaps through an algorithm, to retrieve extensive, structured bibliographical information. You also save yourself the hassle of writing a detailed bibliographical description that your users would have to tailor to suit their distinctive purposes. If a URL cannot be found for <IRI>, you may simply coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses the same source, it would be a good idea to adopt that name.

One work The transcription must be restricted to a single creative work, identified by <work>. Many scripta have more than one work. Identifying and defining the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. The work you choose from that object can take whatever contours you wish. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or simply the Beatitudes. Any reasonable definition of a work is permitted, but a TAN-T(EI) file must contain nothing but the work you have defined. It should be a complete representation of what is found on the object (even if only partially preserved), and respect as far as is practical the order found in the scriptum. Well-known works may have a suitable IRI name already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN. Any IRIs that you mint are free to be used by other people writing TAN files about the same work. Similarly, if you find that another TAN-T file has transcribed a version of your work, you may also use that URN (you don't need to ask permission, since no URN can be copyrighted). As with other parts of the metadata, multiple <IRI>s and <name>s are names for the same work, not individual names for different works.

One version The transcription must be restricted to a single version of the creative work, identified by <version> (optional). In most cases, <version> is unnecessary, because <work> in conjunction with <source> are sufficient to identify a particular work-version. But if the source carries multiple versions (e.g., a bilingual edition of a text), then <version> must be included. If you wish to include other versions from a source, each one should have its own separate TAN-T(EI) file. Notes should be included only if they are an integral part of the primary work (i.e., by the same author). Otherwise, you should ask yourself whether the notes are of any real interest. If they are not, ignore them. If they are important, put them in their own TAN-T(EI) file, or convert them to claims in a TAN-A-div file. If you need to specify exactly where on a scriptum a version appears, <desc> or <comment> should be used. Very few work-versions have their own URN names. It is advisable to assign a tag URN or a UUID. If the IRI you have used for <work> is in a namespace that you own or control, then you are entitled to modify it, and you may wish merely to add a suffix to the work IRI to name the version.

One reference system Every TAN transcription must be segmented into a hierarchy of uniquely labeled divisions, defined in the <body> through <div>s and their @type and @n values. Those divisions, whenever possible, should align with the reference system that prevails for the work across versions or translations, what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute the term canonical is problematic, so reference system is preferred in these guidelines. If you have your choice, preference should be given to systems that follow the semantic contours of the work, not the physical features of a particular object. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other derivative versions of a work (e.g., translations, paraphrases) will only roughly, if at all, follow an object-oriented reference system. Sometimes an object-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on the Categories). It is perfectly acceptable to adopt that scheme, but it may eventually entail more labor for the alignment process. If a given work has multiple systems (e.g., the works of Plato and Aristotle, which have two reference systems—semantic- and object-oriented—both of which are standard and important), then the recommended practice is to encode the same text twice, placing in each file a <see-also> pointing to the other and a <relationship> with the keyword

alternatively
                        divided edition

as the value of @which. A pair of alternatively divided editions can usefully serve as the basis for concordances. In fact, the pair can be used as the first step in converting another version of the same work from one reference system to the other. If there is a good reference system, but the divisions are overly lengthy, you may introduce subdivisions. Such subdivided texts are compatible with references to the older system. But there is no guarantee that the provisional subdivisions you introduce will be adopted by other editors who create or edit TAN versions of the same work, and in the end editors working independently upon the same text may produce discordant schemes. The TAN-A-div format was designed to reconcile such differences. If there is no reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to optimize for all versions of the work, whether known or not. In the <declarations>, at least one <div-type> must be supplied, declaring the types of divisions into which the text has been segmented, to be referred to by @type in <div>s. To declare a <div-type> does not require you to use it in the transcription. It is advisable to keep the abbreviation coined in @xml:id brief but meaningful. Well-known division types already have suitable IRI names. See for a list of core TAN vocabulary for division types, both common and uncommon. If you encounter a rare division type, or one that needs specificity not provided for in a well-known URN, you should mint your own, either in the declarations or in a separate TAN-key file. Reference systems have as a central component numbering systems. TAN supports five numeration systems: Arabic numerals. 1, 2, 3, etc. Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value is assumed to be a subtraction from the total value; all others are positive values). Alphabetic sequences. The 26-letter Roman alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed. Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c.... Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals. TAN file processors will attempt to convert all values of @n to Arabic numerals. Some values are ambiguously Roman numerals or alphabetic sequences, e.g., c (= 3 or 100), so this conversion takes place within the context of a single document, without reference to any associated files. You may not mix Roman numerals and alphabetic sequences in the same div type. You should also avoid any string labels that would be misinterpreted as a Roman numeral. For example, if you are labeling a book whose title is "Civilizations," you should not use n="Civ", since all values of @n are treated as lowercase. There are also tools for other numeration systems, but they have not been implemented in the validation process. See tan:arabic-numerals(), tan:grc-to-int(), and tan:syc-to-int().

Normalizing transcriptions You should declare how you have normalized the transcription via <filter> and its children, <normalization>, <transliteration>, and <replace>. (For suggestions on values of <IRI> for <normalization> see .) Generally speaking, normalization entails the suppression of things extraneous to or separable from the work you have chosen. You are encouraged to omit parenthetical editorial insertions, stray handwritten remarks, discretionary word-breaking hyphens, editorial comments, inserted cross-references, and reference numerals (page numbers, section numbers, etc.). The goal is a transcription whose text is free of the interpretive voice of later editors. In addition, you should resolve ligatures and correct unintended typographical errors. (Such orthographic corrections are useful to those users who want to generate lexico-morphological data automatically or semiautomatically.) In a digital source, variable lengths of spacing marks (e.g., General Punctuation U+2000..U+200B) should be converted to ordinary spaces, and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see ). Keep in mind that your transcriptions will be used by other people doing, e.g., word-for-word translation alignments, quotation checking, syntactical analysis, and they will want transcriptions that are as clean as possible. You should remove from the text anything that is not part of the work proper and would interfere with detailed word-for-word alignment, or would require extra preprocessing or postprocessing work for later users. If you are segmenting a source into line breaks, and you are required to break a word between divisions, you should either use the soft hyphen () or the zero-width joiner (‍) at the end of the first <div>. TAN processors that handle a <div> will automatically normalize the space in the element, then place a space between that <div> and the next unless if one of those two characters are present, in which case the character will be deleted and the two <div>s will be joined with no intervening space. For more on issues regarding whitespace, see . If you are working with a text with notes, distinguish between those written by the same person who wrote the work you're transcribing from those that aren't. Treat the former as part of the work proper and give each note a <div> with a suitable @type and place it after the <div> it annotates. It will be assumed by processors of the data that, absent more specific information, any <div> of an annotating @type is an annotation of the last <div> that is not an annotation. (Alternatively, you may use the <note> feature of TAN-TEI, but bear in mind that this element will be treated by users as part of the leaf div to which it belongs, not separate from it.) If the notes are not part of the work per se—for example, translator's notes in a translation of a primary source—you should treat them as a separate work altogether, and put them in a separate TAN-T(EI) file, perhaps linking the two through <see-also>. You may wish to structure that file so that it mirrors the reference system of the primary source, in which case further alignment between the two is not needed. Or you may wish to use a reference system that reflects how you would cite the note, e.g., page and note number. In this latter case, you would then create a companion TAN-A-div file that establishes links between the primary source and its annotations. Remember that the note signals in the main text and in the footnote area are metadata meant to help readers link corresponding passages of texts, and should be deleted. If the connective function served by the note signal is important, use a TAN-A-div file to link the notes to the main text. This principle holds true for transcribing texts that have variants to the work integrated into the document. For example, a manuscript may have correctors' marks. Or a set of footnotes (or apparatus criticus) might comment on how and why the main text differs from previous readings. In those cases, each set of corrections might be wholly incorporated into the <claim>s of a TAN-A-div file, perhaps also with a separate TAN-T file. Overall, normalization is a difficult topic, and it is not well studied. Not all decisions will be clear-cut. You may justly hesitate before normalizing orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode that lend themselves to varying conventions may need special consideration. You may need to consider whether an unusual or rarely used Unicode character might be misinterpreted, or a hindrance to other users (especially for parsing word tokens). Describe any decisions that might not be agreeable to everyone who uses the file in the <filter>. In some ambiguous areas, you can use TAN-TEI to your advantage. Suppose, for example, a manuscript has reference numerals that are sui generis. That is, these reference numbers do not correspond to the "canonical" reference scheme. On the one hand, they are metadata, and should arguably be deleted; on the other, they are part of the text, and witness to how a text was read and changed over time. A middle-ground approach would move these references to TAN-TEI's <milestone rend="">. In that way, the numerals are removed from the main text; on the other hand, the information is retained. Generally speaking TEI's @rend is an excellent way to remove something from the main text, without removing it from the file altogether.

Transcriptions The sole purpose of the <body> of a class 1 file is to contain a segmented transcription of a single version of a single work from a scriptum. <body> may take @in-progress and must take @xml:lang that the majority of the text is in. If a change in language occurs in a descendant <div>, ensure that its @xml:lang value (explicity or by inheritance) indicates the language that is used. <body> takes one or more <div> elements, each of which govern either other <div> elements, or text (or TEI elements). The term leaf div refers to those <div>s that contain text and therefore no other <div>s. Within this treelike structure of <div>s, the concatenation of @n values, starting from the most ancestral <div>, provides the flat ref, the reference system used by class 2 files to refer to parts of TAN-T(EI) files.

Flattened References, and the Leaf Div Uniqueness Rule One of the most important validation rules is the Leaf Div Uniqueness Rule, which states that the flat ref for each leaf <div> must be unique. This rule applies only to leaf <div>s and not to <div>s in general, since on occasion a major textual unit will be broken by another. For example, chapters 24 and 30 in the book of Proverbs of the Septuagint are split and interleaved (24.1–22e [22a–e are verses not extant in the Hebrew]; 30.1–14; 24.23–34; and 30.15–33).

Transcriptions Using the Text Encoding Initiative (<code><TEI></code>) This section is to be read in conjunction with and , which address some technical issues that relate to TAN-compliant TEI to XML and validation generally. Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further, or already have a library of transcriptions whose annotations are desirable to keep, even if some users may not disinterested. To serve these needs, you should use TAN-TEI, an extension to the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in scholarship. TEI was designed to be maximally expressive and flexible, to serve the detailed needs of humanities scholars. In serving this mission, TEI has come to define more than five hundred different element names, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library. Although the TEI format is oftentimes seen as a standard, it lacks some of the charactistics expected in a standard. It is greatly flexible, admits flavors and interpretation, and has been designed to encourage customization. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations. The major difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to ensure that transcriptions are maximally likely to be interchangeable with other TAN files. TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace, http://www.tei-c.org/ns/1.0): Synopsis of TAN-TEI customization TEI element summary of alteration <TEI> must have @id with IRI name should take new namespace declaration, xmlns:tan="tag:textalign.net,2015:ns" takes a new child element, <head>, placed between <teiHeader> and <text> <text> Only the child <body> will be regarded by other TAN users. <front> and <back> will be ignored. <body> must take @xml:lang may take @in-progress must take exclusively one or more <div>s any elements or text between <div>s will be ignored overall contents must be restricted to a single work any and all text nodes will be treated as part of the transcription <div> must take either only <div>s or no <div>s at all must take @type and @n (@include is not allowed in TAN-TEI, but is allowed in TAN-T)

Like all other TAN files, the root elements of TAN-TEI files must take an @id, the IRI name. See above, . TAN-TEI files have two heads, which may strike you as odd. The TEI head and the TAN head were designed for different purposes. Whereas the TAN <head> is meant to be brief and keyed to both IRIs and human-readable data, the <teiHeader> has been designed principally for human readability, and permits quite an expansive range of metadata, and about matters that bear on the transcription only indirectly (e.g., manuscript descriptions). Processors of TAN-TEI files will in general ignore the contents of <teiHeader>, since the contents are unpredictable. If your <teiHeader> has any kind of metadata relevant to TAN users, you will need to adapt it for the standard TAN <head> (see and ). You may find that some of the material you put in <teiHeader> is not suitable for <head> and vice versa. This conversion needs to be performed manually, since the two headers are incommensurate, and writing each one requires a different kind of outlook. In a TAN-TEI file, the TAN <head> must declare the TAN namespace to be its default, i.e., <head xmlns="tag:textalign.net,2015:ns"> or <tan:head> if the prefix tan: has been defined in the root element. Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. All users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. For this reason, even if you change the value of @xml:lang within a leaf <div>, there is no guarantee that readers or processors of your data will take it into account. TAN-TEI should not be used to try to represent the physical appearance of the text on the object. Write a separate TEI (non-TAN) file first, and then use TAN-TEI to create a more normalized version. You may need to prepare a TEI file to be TAN compliant. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps: Structure: insert new processing instructions (TAN-TEI validation files); adjust root element by supplying IRI name to @id, TAN namespace to @xmlns:tan. Metadata: create new <head> and populate it Data: edit <body> to restrict the content to a single work; restructure <body> content into nesting <div>s with correct @type and @n values. It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming. The TAN <head> requires one to more carefully curate the metadata than does <teiHeader>. But step 3 should not be overlooked, either. Many people write TEI files with a focus on the original textual object, and they make editorial decisions that look toward the scriptum and not the intertextual ecosystem that TAN supports. It is advisable to trim from the body of your TEI file any elements that would interfere with direct comparison with other versions of the text in the TAN format.

Class-2 TAN Files, Annotations of Texts This chapter provides general background to the elements and attributes that are common to class 2 TAN files. For detailed discussion see . At present, class 2 files are restricted to alignment or lexico-morphology. Alignment files come in two different formats, identified by the root element. TAN-A-div provides macroscopic alignment; TAN-A-tok, microscopic. TAN-A-div aligns one or more class 1 files. It is intended for broad, general alignments of any number of versions of any number of works. The scope of TAN-A-tok is more restricted, to two class 1 files, allowing one to declare alignments with detailed specificity, certainty, and type between words (tokens). TAN-A-div focuses on works, regardless of version; TAN-A-tok focuses on individual versions. Lexico-morphology files (also called part-of-speech files), TAN-LM, are used to encode the lexical headwords and morphological forms of individual words in class 1 files.

Common Elements The class 2 formats have been designed to be human readable, particularly references to class 1 files. In ordinary conversation, when refering to specific parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters, and so forth. We use relational words (e.g., "first"), and the very text itself. We might say, for example, "See page 4, second paragraph, the last four words." Or, "See page 4, second paragraph, first sentence, second occurence of 'pull'." The TAN pointer syntax differs from other pointer systems (e.g., URLs, XPath, and XPointer) in that it depends upon a hierarchy of four features: works, divisions, word tokens, and characters. Works, defined above (see ), are defined by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of those divisions, defined according to one or more tokenization rules. And characters are defined as non-modifying codepoints in a word token. (A modifying character are treated as a piece with the non-modifying base character it modifies.) Parts of this fourfold hierarchy—works, divisions, tokens, and characters—are named with vocabulary that the editor of a class 2 file finds most useful. Sources are given a nickname (e.g.,

xml:id =
                  "hamlet-1741"

); divisions are named using the values for @n; tokens are referred to by position, by their actual values, or both (e.g.,

pos =
                  "1 - 5", pos = "last-1 - last", val = "hath"

; see ). Characters are always identified by number (e.g., chars = "2, 7"). This approach not only makes the syntax human readable, it also mitigates any disruptions that corrections or alterations might incur. For example, if an incorrectly duplicated <div> is deleted, disruption to the reference system is isolated and does not affect the rest of the document.

Class 2 Validation Some Class 2 files may be time-consuming to validate fully. The length of the <body> could be enormous. Or the number and length of sources may be taxing. Or validation may depend upon time-consuming transformations of the source documents. Most oftentimes, this problem affects TAN-A-div files, so to facilitate editing within an XML editor, where regular validation is essential, Schematron validation falls into one of two phases: basic: All regular Schematron tests are suspended, and reports are devoted exclusively to assisting in looking for and checking the validity of references in <div-ref> and <tok>. verbose: complete testing of class-2 files, including checks on source files to determine whether they adhere to the LDUR (see ). In addition, information is given on where there are discrepancies in the numeration system across versions of the same work. If you do not specify in the prolog which phase you intend to be the default, you will be prompted for the phase you wish to use whenever you validate the file.

Class 2 Metadata (<code><link linkend="element-head" ><head></link></code>) Class 2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system outlined above. All class 2 files have as their sources nothing other than class 1 files. Therefore each <source> must take the . Because the rights have already been declared in the source files, <rights-source-only> is disallowed. Editors of class 2 files must be able to name or number word-tokens in a transcription, via an optional <token-definition>. See . There may be some cases where a source has a div type that is unnecessary, is confusing, or should be ignored. One or more optional <suppress-div-types>s may be used to specify division types that you wish to suppress in references. Optional <rename-div-ns> provide a convenient way to provisionally rename @n values. This is useful for cases where you wish to use division labels that more familiar to users of the class 2 files, or are easier to edit and read. It can also be used to harmonize discordant @n values, especially helpful for divs that are named, not numbered, such as the books of the Bible.

Class 2 Data Patterns (<code><link linkend="element-body" ><body></link></code>) The three types of class 2 files treat different kinds of phenomena, so their data structures look quite different. Nevertheless, a few elements and attributes are shared by at least two class 2 formats. Many class 2 elements take @src and @ref. @src points via ID reference to one or more <source>s and @ref points to one or more <div>s through their flat ref (perhaps substituted with their new values if <rename-div-ns> have been invoked (see ). In the example

ref = "1.2-4,
                     1.5"

, the periods are arbitrary (but the hyphen and comma, which have special meanings here, are not). You may use any punctuation you wish, or even space, but it is recommended you use what will be most familiar to users. You may use non-Arabic numerals, regardless of the numbering system used by your sources. @chars and @pos follow a useful compact syntax, described below ().

<code><link linkend="attribute-pos">@pos</link></code> and <code><link linkend="attribute-val">@val</link></code> To point to a token, one of three methods may be used. @pos alone. Under this method, one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas indicate one or more token numbers. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in a sequence of word tokens. The numerical value to which the keyword last resolves depends upon the context of each source and ref. @val alone. Under this method, a single token is picked by means of a string value equivalent to the token. For example,

@val =
                              "bird"

, points to the first occurence of the token bird. @pos and @val together. Under this method, specific occurences of a token are picked. For example,

@val="bird" @pos="2,
                              4"

picks the second and fourth occurences of the token bird. Any time @pos appear in an element, and @val doesn't, @val is assumed to allow matches to any word. Vice versa, if @val appears but @pos doesn't, the latter is assumed to equal 1. @pos and @val must be used carefully. For example, the attribute combination val="bird" pos="last-5" will produce an error if the word token bird does not occur at least six times.

Alignments: Principles and Assumptions TAN alignments attest to acts of translating, paraphrasing, revising, quoting, summarizing, and so forth. All these are treated as types of text reuse, where one or more texts, usually called in translation studies the source (or sources), are transformed into a new text, customarily called the target. Text reuse has chronological directionality and is asymmetrical (a quoted text affects a quoting text but not vice versa). But many times we deal with texts where the original lines of direction are contested or unknown. In those cases, it is hasty or misleading to refer to either of the texts as a source or a target. Indeed, the two texts may in fact derive from a common source, or be only indirectly related, the result of multiple generations of copying and translating. In these guidelines, therefore, we avoid the term target altogether, and when we use the word source, we are referring only to one of the class 1 files upon which a class 2 alignment depends. Thus, the order of <source>s in an alignment file's <head> does not imply chronological precedence. The only implication is that of processing order: the first will be the foundation or base against which subsequent sources will be aligned. It is usually a good idea to list as the first <source> the version that is most complete or most important to a given alignment.

Division-Based Alignments (<code><link linkend="element-TAN-A-div" ><TAN-A-div></link></code>) TAN-A-div is the format for macroscopic, division-based alignment, and is dedicated to aligning any number of versions of any number of works on the basis of <div>s, or even smaller, ad hoc segments in the sources invoked. A TAN-A-div file provides two major services. Reconciling structural differences between versions of the same text. Some independently created transcriptions of the same work will, no matter the good intentions of the transcribers, fail to correspond exactly to related versions. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows a reference system at odds with the majority of other versions. Perhaps a version is interpolated or lacunose. TAN-A-div is used to reconcile such inconsistencies, to make special alignments that a computer might not be able to make accurately, and to refine the alignment of parallel sources, even down to the word level. Make general claims about a work, or a particular version of a work. Scholars working with texts regularly wish to make claims about those texts, e.g., work A passage b quotes from work X passage c; work A passage b deals with topic M; work A passage b word 7 has a variant reading b' in version A1. For the first purpose, the motivations of an aligner are opaque. A TAN-A-div file says, in essence, "Please align the following sources," but it does not say why the alignment is requested, and it does not indicate what relationship holds between the various sources. In fact, a TAN-A-div file could be used to align texts that have no apparent relationship (to what end would be unclear). For the second purpose, the aligner makes claims about the texts, and motivations and assumptions are made as clear as possible. Processors of a TAN-A-div file will assume greedy alignment. Alignments will be inferred wherever possible, when not explicitly overridden. Alignments are also transitive. If passage A is declared to align with B, then, barring any exceptions, anything that aligns with A will be assumed to align with anything that aligns with B (see ).

Root Element and Header The root element of a TAN division-based alignment file is <TAN-A-div>. TAN-A-div's <head> has some special rules. One or more <source>s must be declared (). That an alignment file would have only a single source may seem strange, but such a scenario could be useful for self-alignment (i.e., to indicate places where a source reuses itself), or to make claims about that text. <declarations> takes zero or more of the declarations common to class 2 files: <token-definition>, <suppress-div-types>, <rename-div-ns>. See . TAN-A-div also allows declarations unique to .

Data (<code><link linkend="element-body"><body></link></code>) A TAN-A-div may have an empty <body> because the format by default demands greedy alignment. That is, it effectively states, "Take the list of sources in the header. First group (align) them by work, then by <div>s according to flat refs." A processor will create groups of works according to the <IRI> values under <work> in each source. To those matches will be added any sources you claim are equivalently the same work. Then within each group of versions of the same work, the processor will align (group) <div>s based on their flat ref (based on @n), after normalization and after taking into account exceptions declared in the TAN-A-div file. If sources representing different versions of the same work already have <div>s whose flat refs match well, then nothing needs to be declared in a TAN-A-div <body>. A TAN-conformant processor will perform the alignment. Within the <body> of a TAN-A-div file, the first optional procedure, reconciliation, is an up-to-four-step process. Each step is optional and sequence-specific. That is, each statement assumes actions specified by previous siblings have already been implemented. After reconciliation happens, the second optional procedure, claims, are handled.

Process 1, Step 1: Correlate Works In the first step you may declare an ad hoc equivalence between sources that do not already share an <IRI> value for <work>. Each equivalence is made through an <equate-works>, which groups together under @src the ids of sources that should be treated as containing the same work. Transitive alignment holds:

<equate-works

work

="a
                        b"/>

means that any sources that share the same works as a and b will also be treated as equivalent. This declaration does not imply that the works are, in reality, one and the same. It merely states that, for the purposes of this alignment, they should be treated as equivalent.

Process 1, Step 2: Correlate Division Types The second step does for div types what the first step did for works, with <equate-div-types>. Across all sources, every <div-type> that shares an <IRI> value will be treated as equivalent. But you may augment that automated alignment through an <equate-div-types>, which takes one or more <div-type-ref>s, each of which takes a mandatory @src and @div-type-ref, to point to one or more sources and division types. You must use the @xml:id assigned by the source to that div type. As with <equate-works>, <equate-div-types> assume a greedy, transitive alignment. The ad hoc declaration does not imply that the two types of division are in reality one and the same; it just correlates them for the sake of the alignment. This step is not likely to be used in most TAN-A-div files, because it has no impact on the steps that follow, or even on alignment proper, since it does not affect the reconciliation of flat refs. It is useful mainly in those cases where you expect users of your file to be interested in comparing division types (e.g., calculating ratios of paragraphs to chapters per version per work).

Process 1, Step 3: Refine Segmentation Suppose you have two transcriptions where a phrase ending one leaf <div> in source A actually corresponds to the beginning phrase of the next leaf <div> in source B. Or suppose that you wish to break down a leaf <div> into smaller constituent parts, to facilitate more exact alignment against another version that is divided more granularly. Before these refined alignments can occur, you must first segment specific leaf <div>s through <split-leaf-div-at>, which contains one or more <tok>s pointing to individual words (see ) that should begin a new segment in each reference in each source. @ref must refer only to leaf <div>s. Any leaf <div> may be split as many times as one wishes, but never at the first token.

Process 1, Step 4: Realign Versions of the Same Work After step 3, some of the divisions and segments of a work may not be properly aligned. Segments newly created by <split-leaf-div-at>s may need to be realigned. Or perhaps one of the sources uses a reference system that is out of step with the others. <realign> is used to reconcile differences. It is not used for aligning across works. There are two types of realignment: anchored and unanchored, discussed in detail at <realign>.

Process 2: Make Claims At this point, each work should have its versions properly aligned. You are now in a position to indicate other places where one work quotes from another, or make other comments on specific textual passages. In this process, <claim> may be used to indicate such things as: textual passages where one work quotes or alludes to another work or itself (index of quotations and allusions); textual passages deal with a certain topic (general index); where notes in one source correspond to main text in another (tethering separated notes from main text); alternative readings of a textual passage (apparatus criticus). These alignments occur through <claim>s whose <subject> or <object> points to passages of text. Any textual <subject> or <object> may take @work or @src. The former takes a single reference to a <source>, but adopts the reference as a proxy to make a claim applicable to all versions of the same work. @src restricts the claim to specific versions, not to the work as a whole. <claim> is most commonly used to create an interoperable index, indicating where one work quotes from another. Such claims should not be taken to apply to the whole (see ). A claim that passage b quotes passage y means only that some part of b quotes from some part of y, not that the whole of b quotes from the whole of y. Specificity must made on the level of <tok>, a child of a textual <subject> or <object>. Furthermore, if that <tok> is governed by @work and not @src, then two statements are implied, first that the claim pertains to such-and-such a particular range of tokens in a particular source, and second that the claim pertains to other versions of the same work, but at unspecified ranges of words. For example: <claim verb="quotes"> <subject work="nt-grc"> <tok ref="Mk 10:6" pos="last-4 - last"/> </subject> <object work="lxx"> <tok ref="Gen 1:27" pos="last-4 - last"/> </object> </claim> might correlate the following leaf divs (matches in bold):<div n="27" type="v">καὶ ἐποίησεν ὁ θεὸς τὸν ἄνθρωπον κατ' εἰκόνα θεοῦ ἐποίησεν αὐτόν ἄρσεν καὶ θῆλυ ἐποίησεν αὐτούς</div> . . . . . <div type="v" n="6">ἀπὸ δὲ ἀρχῆς κτίσεως ἄρσεν καὶ θῆλυ ἐποίησεν αὐτούς·</div> Even though the claim is about the work in general, the statement provides specificity to only two sources. The claim will be regarded as holding over other versions of the same works, but only on the leaf div level. On the token level, it is up to a processor to determine if and where the relative position of the quote might be found.

Token-Based Alignments (<code><link linkend="element-TAN-A-tok" ><TAN-A-tok></link></code>) TAN-A-tok files provide a microscopic view of how two sources relate to each other. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions. Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. You should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A-div files are because they reflect work that is more detailed, and therefore more time-consuming, than simple en masse alignment of sources. Because of the detailed nature of the inquiry, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. So alignment of a bitext is oftentimes not easy, and requires you to think hard about assumptions you have made in two key areas: the relationship that holds from one source's scriptum to the other and the types of reuse that was involved in turning one version into the other (or a common ancestor into both). Relationship of sources' scripta. What is the the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, where is their common ancestor? Here you consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's, because that goes a long way to explaining the relationship that holds between the immaterial texts. Types of reuse. What categories of text reuse do you hold to? Such a declaration tells users of your data what paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using loosely defined concepts such as translation, paraphrase, quotation, and so forth without offering a specific definition. On the other hand, you may have a specific and detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See .

Root Element and Header The root element of a token-based alignment file is <TAN-A-tok>. The TAN-A-tok header builds upon the core and class 2 headers (see and ). TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id. <declarations> takes, in addition to all the elements allowed in class 2 files (see ), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scriptum. The second attends to the qualitative aspect of the bitext relationship.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see @in-progress and ), required @bitext-relation and @reuse-type, which take one or more id references from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment. <body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Each token cluster in a given TAN-A-tok file is valid independent of any other token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, to facilitate external discussions about an assertion. Nothing should be inferred from silence in a TAN-A-tok file. Unmentioned tokens in either source do not represent gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens. If you wish to declare that one or more words in one source were left out of a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a half-null alignment, i.e., a token cluster that has tokens from only one source. A half-null alignment corresponds—to draw from the terminology of translation studies—to implicitation or explicitation of entire words or phrases. A fully aligned bitext may result in a TAN-A-tok file with a very long <body> (in contrast to the typical TAN-A-div file). That does not mean, however, that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit you to saying everything possible about the bitext. You might choose to encode only a few token clusters. If there are multiple IDs in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="trans para" would indicate that the token cluster results from both translation and paraphrase. If you wish to claim that the token cluster might be a translation or it might be a paraphrase, then you should create two separate alignments, and add @cert.

Lexico-Morphology TAN-LM files are used to associate words or word fragments with lexemes and morphological categories. They are intended primarily to facilitate research that depends upon alignments, but they can be valuable on their own, whether or not there are other versions or alignments. These files rely upon the grammatical rules defined for a given language in a TAN-mor file. Therefore this section should be read in close conjunction with its companion: ).

Principles and Assumptions TAN-LM files are assumed to be applicable to texts in languages whose vocabulary lends itself to grammatical and lexicographical analysis. The two areas are interrelated but independent. If you wish, your TAN-LM file may contain only lexemes or only morphological analyses. As an editor of a TAN-LM file you should understand the vocabulary and grammar of the languages you have picked. You should have a good sense of the rules established by the lexical and grammatical authorities you have chosen to follow. You should be familiar with the conventions and assumptions of the TAN-mor files you have adopted. Although you must assume the point of view of a particular grammar and lexicon, you need not define those authorities, nor hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities. Although TAN-LM files are simple, they can be laborious to write and edit, more than other types of TAN files. They can also be hard to read if the underlying TAN-mor files use cryptic codes. It is customary for an editor of a TAN-LM file to use tools to help create and edit the data.

Root Element and Header The root element of a lexico-morphological file is TAN-LM. TAN-LM files are either source-specific or language-specific. In the case of the former, <source> points to the one and only TAN-T(EI) file that is the object of analysis. In the case of the latter, <for-lang> is used to indicate the languages that are covered. If the language-specific option is exercised, the file must point to TAN-LM-lang schema files. See . <declarations> takes the elements common to class 2 files (see . It takes two other elements unique to TAN-LM: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential. There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the , no validation tests will be performed. You may find a non-TAN lexical model to be a suitable supplement to any TAN collections you develop. The TEI supports dictionary encoding, and the Lexical Markup Framework, an ISO standard (ISO-24613:2008), has defined a data model for lexicons and dictionaries. The former is geared toward philology and the latter toward linguistics. You may also devise your own format if neither of these support aspects of lexicology that you find important. Because you or other TAN-LM editors are likely to be authorities in your own right, <agent> can be treated as if a <lexicon>, and be referred to by @lexicon in the <body> .

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-LM file takes, in addition to the customary optional attributes found in other TAN files (see @in-progress and ), @lexicon and @morphology, to specify the default lexicon and grammar for the file. @lexicon may point either to a <lexicon> id or to an <agent> id (when someone editing the TAN file is an authority). <body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s). If due to tokenization a linguistic token must occupy more than one <tok>, you may use @cont to group <tok>s together. Elements within an <ana> are distributed, to allow economically sized files. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok>. Many TAN-LM files will be populated by a stylesheet or other algorithm that automatically calculate the possible morphological values of each token, for example, "down" being marked as an adjective, an adverb, a noun, and a verb. In this case, you does not wish to claim that a word really is every combination generated. But you do wish to leave open the possibility for cases where such ambiguity must be expressed (e.g., "down" in "Get down off a duck." being equally a noun and adverb). It is advised that automatically calculated results always include @cert with weighted values that sum to 1 for each token.

Class-3 TAN Files, Varia This chapter provides general background to the elements and attributes that are unique to all class 3 TAN files. For detailed discussion see . Class 3 TAN formats are those that do not fit either class 1 or 2. This class, at present, consists of keywords, of RDF-like claims, and of rules pertaining to morphology.

Keyword Vocabulary (<code>TAN-key</code>) All too often, a project has a set of vocabulary it draws from time and again. To repeat the can not only be tedious, it can be treacherous, especially when a project decides to change or augment its vocabulary, and does so inconsistently or incompletely. The TAN-key format is intended to allow a project to define the IRI + name patterns for things that it regularly names, to be applied to any element that takes @which. For example, it is a suitable way to gather the IRI + name patterns for the people who worked on a project, or to define special kinds of div types. TAN-key files are a core part of the TAN schema, defining commonly used concepts in <token-definition>, <div-type>s, and so forth. For a complete list of predefined TAN keywords, see For more details on how this format relates to other TAN formats, see .

Root Element and Head A TAN-key file has <TAN-key> as the root element. The <declarations> of a TAN-key file will be empty, or have <group-type>s.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-key file consists simply of <item>s, perhaps gathered into groups via <group> or @group. These groups have, at present, no effect upon other TAN files that import them. They have been useful, however, in more advanced uses of the format, particularly in the case of the standard TAN-key file for <div-type> (), where common types of divisions have been given a rudimentary typology suitable for transformations into other formats. Most frequently, a TAN-key file will contain items that have the IRI + name pattern. The only exception is when it contains <token-definition>s.

Morphological Concepts and Patterns (TAN-mor) TAN-mor files are used to describe the grammatical morphological features of a given language, to assign codes to those features, and to define rules governing the application of those codes. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other conditions, and warnings and error messages may be sent to users who have used a pattern incorrectly, or not in accordance with best practices. The TAN-mor format is like Schematron for the grammar of human languages. You specify the categories and codes for a given language, then you may create tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-LM files to see not only if the rules have been violated, but why, and exactly where. This chapter should be read in close conjunction with .

Principles and Assumptions Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see . TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing. The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on those descriptions. TAN-mor is meant to allow those differences to be declared. It is up to other users to decide whether or not to adopt them. The TAN-mor format has also been designed to cater to two different approaches to morphological codes: structured or unstructured. Structured codes begin with set of major categories used to group morphological features. Structured codes tend to have a set number of code elements, and usually require gaps in the code. For example, the Perseus approach to the morphological categories of Greek, Latin, and other highly inflected languages dictate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null. Unstructured codes do not attempt to categorize grammatical features, but simply give each one a unique code, to be applied in any permitted sequence and combination. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that have little inflection, e.g., the Brown and Penn sets for English.

Root Element and Header The root element of a morphological rule file is <TAN-mor>. Zero or more <source> elements describe the grammars or related works that account for the rules declared in the TAN file. If the rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source will assume to be based upon the personal knowledge of the <agent>s who edited the file. <declarations> is empty.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see @in-progress and ). The children of <body> begin with one or more <for-lang>s, followed by any number of <assert>s, <report>s, <feature>s (for unstructured codes), or <category>s (if relying upon structured codes). <category>, used for structured codes, sorts <feature>s into groups. @code values must be unique within a <category>, but may duplicate the @code values of <feature>s from other <category>s. The first <feature> in a <category> describes the category itself, and is not a <feature> like the others. The values and combinations of <feature>s (or rather of the @codes of <feature>s) can be constrained through <assert>s and <report>s, which are used to declare rules that must be followed, or must never be followed, by any dependent TAN-LM file. An <assert> and <report> may be restricted to specific features through @context. If @context is present, then <assert> and <report> declarations will be checked in a TAN-LM file only against values of <m> that invoke the feature; otherwise, all <m>s will be tested. Four kinds of tests are allowed: @matches-m: indicates a regular expression pattern to be checked against the code in an <m>. @matches-tok: indicates a regular expression pattern to be checked against the tokens picked by the values of <tok> in a dependent TAN-LM file. @feature-test: indicates features to be checked in the content of <m>s. @feature-qty-test: indicates the number of features to be checked in the content of <m>s. An <assert> indicates that for any <m> in any dependent TAN-LM file, if the test proves false, and if the <m> has a feature declared in @context, then the <m> should be marked as erroneous (or merely a warning should be returned, if @cert is present) and the message included by the <assert> should be returned. <report> has the same effect, but the role of the test is the opposite: the error and message will be returned only if the test proves true.

Claims and assertions (<code>TAN-c</code>) Many projects using the TAN format will need to include in their workflow general declarations that do not fit one of the TAN formats. In many cases, there are adequate formats that are available. At other times, you may want to encode your information in a format much like your other TAN files. For those cases, an experimental format, TAN-c, is provided. The model is inspired by the Resource Description Framework (RDF; see ). RDF depends upon a simple data model, where each datum consists of three items termed a subject, a predicate, and an object. The first and third are thought of as nodes, and the second as a connector between the nodes. A connector, our preferred term, is frequently elsewhere called an edge, but that metaphor is confusing and misleading. A cylinder, for example, has two edges, but they don't connect anything we might think of as nodes. Furthermore, "edge" implies that what's really of interest is the surface of a three-dimensional object and the void beyond. TAN was designed to serve scholars, who normally find simple declarative sentences—the strength of RDF—highly restrictive, absent any context or qualifiers. Claims always have a claimant. They are made at certain times, and are subject to doubt and nuance. Sometimes our claims are bare negation, e.g., "Aristotle was not the author of De mundo"—an assertion not possible to express in RDF. TAN-c is conceived as a slightly more complex version of RDF. Every claim must be assigned to a claimant. The RDF terminology subject + predicate + object is adjusted by TAN RDF to subject + verb + object. Furthermore, claims may be tempered by certainty, and verbs may be modified by modals. The entire claim may be restricted to a particular time or place. If the object is data, the data type can be restricted by type and lexical form. Despite being somewhat more complex than RDF, TAN-c syntax is more human readable.

Root Element and Header The root element of a TAN-c file is <TAN-c>. The <declarations> takes <modal>, <person>, <place>, <unit>, <verb>, and <version>, all of which are described more thoroughly at . Collectively, they provide the vocabulary that can used in the <body> of the file.

Data (<code><link linkend="element-body"><body></link></code>) The <body> takes a required @claimant and @subject, which define the default values for the rest of the data. The rest of <body> consists of a series of <claim>s. <claim>s are allowed to nest. That is, it is possible to claim that X claims that Y claims that Z claims that.… by nesting <claim>s within each other.

Working with the Text Alignment Network Best Practices in Working with TAN Files In this chapter we discuss ways to manage, create, edit, and share TAN files. The material discussed here is non-normative. That is, these are suggestions based upon the experience, particularly the mistakes, of TAN users. The material is written for intermediate or advanced users of XML technology.

Local Setup TAN files may be set up in any kind of structure one wishes, but because those files are meant to be shared, it is beneficial to use similar conventions, to minimize the possibility of breaking relative URLs in shared TAN files. Below is one way to organize the subdirectories of a typical local TAN project: library [collection 1]—TAN-T(EI) files here TAN-A-div —TAN-A-div files here TAN-A-tok—TAN-A-tok files here [etc.] [collection 2] [etc.] output—saved results from transformations, tests pre-TAN—third-party files to be used to populate TAN files TAN-1-dev —the core TAN files, downloaded from the website or the Git repository stylesheets—stylesheets you have created tools—third-party tools Under this model, any time you decide to develop a collection of TAN files, you create a subdirectory within the library. It is a good idea to try to keep these collections to a manageable size, although it cannot be predicted what the limits might be. If you use Git, each of these collections could be its own Git repository. This is also where you would put other people's TAN collections. Collections inevitably need to "talk" to each other, so it is a good idea to name collection subdirectories as predictably and briefly as possible, preferably a single word in lowercase. For example, scriptural collections could be named simply bible or quran, although you may find a need to add a suffix if you are working with overlapping TAN collections. When you name class 1 files (the filename, not the IRI name; see ), it is a good idea to start with an acronym for the work, followed by the language code, the editor's last name, and perhaps the date when the underlying scriptum was created or published. Class 2 files are tougher. Because they bring two or more files or concepts together, filenames could become very long or unpredictably structured. At this time, the best recommendation is to make sure that each class 2 file is put into a subdirectory, separate from class 1 files, given a brief but meaningful name that points to the research question that motivated its creation. Class 3 are a bit easier. It is recommended that TAN-mor files begin with the language code then an acronym for the person or group responsible for creating the features. TAN-key and TAN-c files are written generally to serve a specific collection, so the collection name and the TAN type should suffice. If you are have a local copy of someone else's TAN collection, and you wish to create TAN files that depend on them, you are in all likelihood going to depend upon relative URLs to those files. It is recommended that you also include absolute URL through secondary <location>s. The validation routine checks only the first document available. From time to time, you might comment out the first <location> and run the validation process again. If you share your dependent TAN file with someone else who does not have a local copy of the collection, the second <location>, with the absolute URL, will furnish a copy of the document.

Creating and maintaining TAN collections As noted in the previous section, it is ideal to group your TAN files through subdirectories in a master library. Those collections should contain files that cohere in some way, but this could be for any number of reasons. TAN is designed to encourage cross-linguistic and intertextual research, so what might hold various TAN files together is unpredictable. In a given project, you are likely to repeat basic information, particularly <agent>, <role>, and <work>. such as elements with the , consider moving those to a TAN-key file. It is almost always preferable to develop TAN-keys before resorting to <inclusion>s. Sorting out lines of inclusion can be confusing.

Creating and editing TAN files Converting to TAN from an irregular format can be a chore. Suppose you have a a Word file, a web page, or plain text that you intend to serve as the basis for a TAN file. A common first impulse is to copy the desired content, paste it into the body of our TAN file, and then begin to manually correct and change things. Although this is the most common approach, it means that if there are changes made to your source, you may have an enormous task ahead of you to figure out exactly what was changed where. Further, some transformations involve complex processes, and you may find, in the course of correcting the intermediary, that you made a major mistake that cannot, at that point be undone. Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you eliminated line breaks that were useful signals about where <div>s should be separated. Even if all goes well, after all that hard work you might be find out that the pre-TAN data source has been updated, with errors corrected. If any significant time has elapsed since the last transformation, you may have forgotten what procedure you followed to convert the data. And if you remember, you have to repeat the steps again, and plan for the next time when the pre-TAN source is updated. For all these reason, it is recommended that data be converted to a TAN file by means of an XSLT stylesheet to analyze and transform the digital source into data that is TAN compliant. As you find mistakes such as those described above, no harm is done. You can adjust your algorithm and re-run the process as many times as you need, each time getting better and better results. This approach requires extra initial work. That is, you will need to get to know XSLT (or an alternative) well. Establishing a good transformation process can be time consuming. But the investment pays off in the long run. All or part of what you write for one set of files may work for the next. Whether or not you use stylesheets to create or populate your TAN files, it is almost always best to begin the process with a sample TAN file that resembles, even if skeletally, your desired output, then populate it with the proper content. If you feed the TAN template along with the pre-TAN data into a stylesheet, the stylesheet becomes an <agent> in its own right. You are encouraged to give your XSLT file a unique identifier, and to stamp the resultant TAN file with an <agent>, a <role>, and a <change> that documents the changes that were made. The XSLT approach to creating and populating TAN files, described above, has been used successfully to handle not only historical documents but living ones as well, e.g., a working, evolving scholarly translation of ancient texts. In those situations, where updates are made very frequently, the traditional cut-paste-and-edit method is not only unproductive; it is foolish. Writing transformations may seem laborious at first, because of how difficult it is to think how how best to handle and manipulate a TAN file. But there is a good chance that the labor you have in mind has already been done for you in the built-in TAN functions (see ).

Sharing TAN files TAN files have been designed to be shared. Although individual TAN files are likely to be valuable on their own, even when removed from their context (e.g., via an email attachment), they may be critically crippled without their dependencies. As a result, TAN files are most likely to be distributed or published in groups, as collections. One way to distribute a collection is by making it available as a repository via Git or some other version control software (VCS). This approach has many advantages. The files become available to whomever wants them, and the editorial history is preserved, so that a change one person makes to TAN files used by another need not necessarily be written in stone. VCS features and tools are extremely fast and useful. Collections may also be distributed through shared syncing services (e.g., Drive, Box, or Dropbox). Or put on a server. In the latter case, it may be difficult for users to browse a collection. In that case, you may wish to expose the collection as a compressed ZIP archive. This saves on your own bandwidth, and it still exposes the files for XML processing. But a ZIP archive is not suitable for linking from one TAN file to another, nor is it appropriate as a <master-location>. Unpacking a compressed file requires writing to the disk, which is a security risk, and so is disallowed during validation. Such zipped archives are excellent ways to distribute collections, but they should not substitute for a primary repository.

Doing Things with TAN Files (Stylesheets and the Function Library) The TAN format is not an end in itself. Indeed, there is no point to any file format, unless you can do things with it. TAN was designed primarily so that users could do unusual and interesting things. /do things, a major subdirectory in the project file, is populated with folders named with actions you might want to perform on a TAN file, and they contain XSLT stylesheets that fall into that area of activity. Those stylesheets are the front end of a long process that begins with TAN validation. Whenever you validate a TAN file, the Schematron validation file (the companion to the RELAX-NG validation file) is invoked. But that Schematron file is very small, and does very little work during validation, other than to look for errors, information, and help in a second version of the file being validated. That second version of the file is created through a very large library of XSLT stylesheets that resolve, normalize, and expand the document, and mark its errors. That extensive library of XSLT we call here the function library (we use both words, to distinguish the collection from individual, generic functions). The function library provides definitive interpretations of the TAN format, marking parts that are in error. The function library is also an important step to creating your own tools or stylesheets, anticipating, as it does, many things you might want to do with a TAN file. Certain considerations that have been put into the design of the function library are worth noting. First, the function library has a structure similar to that of the RELAX-NG schemas. That is, the primary access point is through one of the eight XSLT files named after a primary TAN formats. Access deeper into the function library structure is possible, but you might be missing out on some important features useful to the particular TAN format you are working with. Before executing any validation, an engine computes all global variables, even those that might, in the end, not be required. Therefore the function library defines only those global variables that are central to the validation process. Functions, templates, and keys, on the other hand, are used by a validation engine only when needed, so some of them provide functionality that looks beyond the validation process. The most complex and important global variables are the two principal transformations to the TAN file itself, $self-resolved and $self-prepped. $self-resolved is the result of changing the TAN file through some key steps, including (1) stamping the original uri of the file @base-uri This attribute is one of a number of new attributes and elements that are introduced in the validation process, and are not defined by the TAN schema. in the root element, (2) converting all numeration systems to Arabic numerals, (3) replacing all elements that have @include with resolved forms of the element, (4) replacing elements with @which with their resolved IRI + name form, (5) stamping elements with @q and a number representing the nth place of that element relative to its original siblings (included elements are given the @q of their host element). $self-prepped is the result of combing through the file and looking for errors that have been defined in the master list of errors. The process differs from one TAN file type to the next. The next most important global variables have to do with the other TAN files the self refers to: Global variables for referred files Raw (first document available) Resolved Prepped <inclusion> $inclusions-1st-da $inclusions-resolved — <key> $keys-1st-da $keys-resolved $keys-prepped <source> $sources-1st-da $sources-resolved $sources-prepped <see-also> $see-alsos-1st-da $see-alsos-resolved —

The first column lists variables that hold the first documents available, without alteration. Variable in the second column hold the resolved form of the -1st-da variables, following the same process described above for $self-resolved. Once $self-resolved has been determined, neither <inclusion> nor <key> are needed for further validation, therefore they do not have prepped versions. Any bearing <see-also> has on validation of the original TAN file can be determined from the resolved form. But it frequently happens, mainly with class 2 files, that the sources need to go through some preparation before determining whether or not the original is valid, so a similar process of preparation is applied. These global variables have been described above very generally. To know more precisely how their values are calculated, please consult the function library. The other components of the function library—the functions, keys, and templates—cannot be described conveniently or succinctly here. But they are critical parts of building successful stylesheets that transform TAN files. The next chapter provides a comprehensive view of how they work.