The Text Alignment Network: Official Guidelines

The Text Alignment Network: Official Guidelines Text Alignment Network: Official Guidelines 2015-present Joel Kalvesmaki Joel Kalvesmaki kalvesmaki@gmail.com This document and the files it describes are licensed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/ Latest version: http://textalign.net/release/TAN-2018/guidelines/. 2018 2018-01-09 Formats: HTML • PDF • Docbook (master) In case of contradictions, apparent or not, between these guidelines and the core TAN files, priority should be given to the RELAX-NG schemas (compact syntax), then to the functions, and then to these guidelines. General Overview Introduction

Definition and purpose The Text Alignment Network (TAN) is a suite of highly regulated XML formats intended to allow scholars to align and share texts and textual analysis at a maximal level of syntactic and semantic interoperability. TAN is particularly suited to textual works with multiple versions (translations, paraphrases), and to expressing quotations, word-for-word alignments, and grammatical features. TAN files are simple, modular, and networked, allowing users, working independently and collaboratively, to edit, study, and annotate shared files. The extensive validation rules depend upon a library of functions that definitively interpret the format, thereby helping anyone studying or editing the files, and providing a foundation for customized tools and applications. Although expressive of scholarly nuance and complexity, the TAN format has been designed to benefit everyone, scholars and non-scholars alike, and can be used broadly for multilingual publishing, language learning, and machine translation.

Rationale and Purpose Scholars working with texts frequently need to study numerous versions. Some texts have been lost in their original form and can be studied only through later translations, paraphrases, or fragmentary quotations. Even when an original survives, its later versions are often worth study, revealing as they do something of how words, concepts, and works were preserved, altered, or combined across the generations and cultures who read and circulated the versions. Such textual comparison requires words, sentences, paragraphs, and other text segments to be aligned. Such alignment can be challenging. Some versions might be defective, or follow an idiosyncratic sequence. One editor may have divided the text according to a system not easily applied to other versions. Identifying which words or phrases in a translation correspond to which words or phrases in the original might result in complex, overlapping spans. And even larger segments such as sentences and paragraphs may not line up well. Further, every version of a text is part of a much larger, complex history of text reuse, and a complete study of that context requires not engagement with other works and other languages, requiring collaboration across projects and fields of study. The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans to read and edit, expressive enough to allow scholars to register doubt and nuance, and sufficiently structured to permit complex computer-based queries across independent datasets. The format is actually a suite of formats, built modularly, with each format designed to allow an editor to focus exclusively on a single set of tasks. The format encourages or requires editors to declare their views or assumptions about language and texts in a structured manner, so that other users of the data (both human and computer) can determine whether the data is suitable for their needs. Because nearly all TAN data must be expressed in way that computers can parse, the information can be used in semantic web applications. TAN has been designed to support two kinds of scholarly activity: creation and research. When we create our primary sources or analyses of them, we normally want what we create to be useful to our colleagues. TAN was designed to augment the utility of such creative scholarly activities as: Creating and sharing a transcription of a particular version of a textual work such that it is most likely to align with any other TAN version of that text created by someone else; Creating an index of quotations that is semantically rich and can be applied to any other version of the quoting or quoted works; Specifying exactly (e.g., word-for-word) where a source and its translation correspond, even when there may be messy overlapping or ambiguous relationships, or where doubt or alternative possibilities of alignment need to be expressed; Listing the lexicomorphogical features of each word in a text or a language such that the linguistic data has meaning above and beyond a particular coding scheme, and can be collated with lexicomorphological data for other languages. TAN files that are published and shared produce a decentralized but interoperably corpus of texts. As this TAN-compliant corpus expands across linguistic, chronological, and spatial boundaries, third-party tools and applications can expand the repertoire of research questions beyond any single corpus, to help scholars fruitfully investigate broader, comparative questions such as: For classical Greek texts, how were words with the root -ιστημι ("stand") translated into ancient Latin? In what specific ways did the vocabulary of technical terms shift from pre-Christian translations into later, Christian ones? How do the reformed Chinese translation technique of Sanskrit Buddhist texts, attested by Dao An (312-385 CE), compare to reforms in the seventh and eighth centuries of Syriac translations of Greek texts? How do Arabic translations of Greek texts from the Abbasid period differ from contemporaneous translations from Sanskrit into Arabic? Can an anonymous English translation of a modern French novel be identified with known translators of French novels from the same period? How do present-day translations of official United Nations documents differ across languages? This is not to say that the TAN format, in itself, it answers such questions. It merely lays a framework within which such questions can be investigated. Some other caveats: Although TAN comes with an extensive library of functions and templates, it is not a tool per se. It does not provide software or applications to create, edit, or display TAN-compliant files, nor does it dictate how such tools should behave. Rather, it allows you or a developer (especially an XML developer) to create customized applications and tools. The TAN formats are specialized. They supplement, and does not replace, other common text formats such as TEI, Docbook, and so forth, or other alignment formats such as XLIFF or TMX. Converting from TAN into these formats is usually straightforward, but will normally entail loss. On the other hand, converting from one of these formats into TAN normally cannot be completely automated, the TAN format has scholarly expectations that are not required in the other formats. Conversion must be given careful thought. TAN has a restricted field of inquiry (defined and explained in these guidelines). The format is not suitable for many lines of iniquiry, e.g., representing how a text was displayed in a particular edition. TAN has been designed to serve those who prioritize legibility and readability over computational efficiency. The extensive TAN validation routines—essential to aiding interoperability—can be taxing to run on numerous or enormous files.

About this version These guidelines were written for version 2018 of the TAN format. This version is semi-stable. Changes will be made quite reluctantly to the RELAX-NG schema files. Other files may be changed, the most work will be going into the next version of TAN.

Participation At the present, changes are made regularly to the schemas and functions. If you have a TAN library, sharing it with other participants, particularly via Git, will help test any changes that have been made, and allow others to offer updates or corrections to your library. Participants in testing, using, and developing the Text Alignment Network are welcome. Our core purpose is to develop and maintain the schemas, the guidelines, and the functions and templates. Inquiries about participation should be sent to the project manager, Joel Kalvesmaki, by email: kalvesmaki at gmail.com. Official announcements are made by email (Google Group) and by Twitter.

Starting off with the TAN Format If you are new to markup languages, or if you are unfamiliar with acronyms such as XML, RDF, XPath, or technical terms such as Unicode, you should start with this chapter, which uses a simple example to illustrate the steps typically taken to create and edit TAN files. By the end, you will have a sense of how to create and edit a simple collection of TAN transcriptions and alignments. If you are familiar with basic markup concepts, you may wish to read through the chapter very quickly, or skip it altogether. The discussion touches on a number of general concepts that will be introduced only briefly. If you find the concept new or confusing, follow the prompts for further reading to get better grounded in a particular topic or technology.

Creating TAN Transcription and Alignment Data Let us take a simple example, that of aligning two English versions of the nursery rhyme Ring-a-ring-a-roses, sometimes known as Ring around the Rosie. Our goal here is to publish two versions of the nursery rhyme in the TAN format so that they are most likely alignable with any other TAN version of the poem that someone might create. We begin by finding previously published versions. In this case we have taken an interest in the versions published in 1881 and 1987 (one published in the UK and the other, the US). Each of these books have other rhymes, but we've already decided to focus upon the one particular nursery rhyme, so we transcribe those parts and nothing else:Ring around the Rosie 1881 (UK) version 1987 (US) version Ring-a-ring-a-roses, A pocket full of posies; Hush! Hush! Hush! Hush! We're all tumbled down. Ring-a-round the rosie, A pocket full of posies, Ashes! Ashes! We all fall down.

We must be sure to save each of the two transcriptions as plain text, preferably with .xml at the end of each file name. Do not bother with word processor (Word, OpenOffice, Google Docs, and so forth), because those programs are too sophisticated for our work. They sometimes generate erroneous data, even when you export to plain text. We will not be concerned with italics, colors, fonts, margins, and so forth, so much better for our work is a text editor, which works only on plain text. But even those do not check to see if the rules of the format have been followed. So the best tool is an XML editor, which does the same thing a text editor does, but saves much typing and prevents syntax errors. More important, an XML editor will tell us when our TAN file is invalid, and will provide information and help in our TAN files. Software suitable for your needs comes in many styles and prices. In addition to the links in the paragraph above, you may wish to visit the comparative lists for both text editors and XML editors. TAN was developed using oXygen, which is very powerful but possibly confusing to new users. To avoid exasperation or despair, take advantage of tutorials and documentation associated with the XML editor you have chosen. Our first task is to get these two versions into separate files with the appropriate markup. Each TAN transcription file has two major parts: a head and a body. For now, we focus on only the second part, the body, as well as a few the necessary preliminary lines that stand above both the head and the body. First, the 1881 (UK) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring01"> <head> . . . . . . . </head> <body xml:lang="eng" in-progress="false"> <div type="line" n="1">Ring-a-ring-a-roses,</div> <div type="line" n="2">A pocket full of posies;</div> <div type="line" n="3">Hush! Hush! Hush! Hush!</div> <div type="line" n="4">We're all tumbled down.</div> </body> </TAN-T> And now the 1987 (US) version: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring02"> <head> . . . . . . . </head> <body xml:lang="eng" in-progress="false"> <div type="l" n="1">Ring-a-round the rosie,</div> <div type="l" n="2">A pocket full of posies,</div> <div type="l" n="3">Ashes! Ashes!</div> <div type="l" n="4">We all fall down.</div> </body> </TAN-T> These are standard eXtensible Markup Language (XML) files. (If you are already familiar with XML you may wish to skip ahead to the next section.) XML lets you take a text or a collection of data and structure it via markup. In the examples above, the markup is in boldface. Each file begins with a prolog, marked by the lines that begin with <?. The first line in the prolog simply states that what follows is an XML document. The next two lines are processing instructions that point to the files that will be used to check to see whether or not our data is valid. For now we will not explain the details of those first three lines, which will be identical, or nearly so, from one TAN file to the next. We can simply cut and paste those lines when we want to start a new one. The fourth line is the opening tag of what is called the root element, here called <TAN-T>. That opening tag, <TAN-T...> is answered by a closing tag, </TAN-T>, the last line. The paired-tag relationship is true for all the other elements in this example. <head> is answered by </head>, <body> by </body> and each <div...> by </div>. These elements nest within or beside each other, but they never overlap. (The prohibition on overlapping elements is one of the cardinal rules of XML.) This relationship means that every XML file can be thought of as a tree, with the root at the trunk and the enveloped elements as branches, terminating in metaphorical leaves. It is helpful to use the tree metaphor when we describe the path we take, toward either the leaves or the root. In this manual, we may use the terms rootward and leafward when we want to trace movement up and down the hierarchy of an XML document. An XML document is also profitably thought of as a family tree, a metaphor that provides commonly used terminology. In our examples above, <TAN-T> is the parent of <body>, and <body> the parent of the four <div> elements. Likewise, each <div> is the child of <body>, and <body> is the child of <TAN-T>. Distant parental relationships can be described with the terms ancestor and descendant. <TAN-T> is the ancestor of every element it encompasses, and every element encompassed by <TAN-T> is its descendant. Paratactic relationships are also important. <head> and <body> are siblings to each other, and every <div> is a sibling to every other <div>. Inside of the opening tags for the <TAN-T>, <body>, and <div> elements are pairs of text joined by an equals sign, collectively called an attribute. The left side of the equals sign is the attribute name, and on the right side, within the quotation marks, is the attribute value. <TAN-T> has two attributes, @xmlns and @id (when we discuss an attribute outside its original context, we often preface the name with @). We will skip @xmlns for now; this attribute (actually, a pseudo-attribute) specifies the namespace of the XML file, an advanced topic that need not be discussed now. The @id, however, is quite important. Every TAN file has an @id that uniquely and permanently identifies the file itself. It should not be changed, even as we make edits. The name you save the file as can be changed, but keep in mind that other people may be depending on it, and may be unable to find it. The value of @id is always what is called a tag uniform resource name (tag URN). It always starts with tag:, followed by an email address or domain name that we own or owned. (It is okay to use an obsolete address; this part is only for identification.) After that email address or domain name comes a comma (no spaces) and a date on which we owned it, in the international standard format of year, month, and date, joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be the first of the month; if we leave off the month value it is assumed to be January. In the examples above, parkj@textalign.net,2015 points to the person who owned that particular email address on the stroke of midnight (Coordinated Universal Time) January 1, 2015. (In this example, we are pretending to be that person.) After that comes a colon, and then any name we wish to assign to the file. We have anticipated a simple collection of texts, so we've called the files ring01 and ring02. (If we run out of names, or want to restart, we can simply use a new email-date preface, e.g., parkj@textalign.net,2015-01-02.) The idea here is that hundreds of years from now, even when that email will be defunct or owned by someone else, someone might still be able to identify the person responsible for the file. The element <body> contains our transcription. @xml:lang, required, specifies the principal language of the transcribed text. We use the standard 3-letter abbreviation for English. (See later in the guide for more complex language requirements.) By saying that @in-progress is false, we indicate that we have finished our transcription and have no further plans to develop it. It doesn't mean that the file is free of errors. We can make corrections later. It just means that we have no more substantive revisions are planned, and any further changes will be restricted to corrections of typographical errors. This attribute is optional. If it is left off, our TAN file is assumed to be a work in progress, and it serves as a kind of warning to anyone who might want to use it. Our transcription has been divided into four <div> elements. How we divide up the work is entirely up to us. But we must make sure that every bit of text is enclosed by a leafmost <div>. That is, every <div> must be the parent of only other <div>s, or none at all. We cannot have a <div> that mixes text with other elements (such as other <div>s). The values of @type and @n indicate, respectively, the type of division and the name of the division. We have used line in the first example, but we could easily have also used l (as we did in the second) or ln or any other phrase that we think will make intuitive sense to other users. The choice is arbitrary (we will see why below). We have used arabic numerals for the values of @n, but the value, once again, could have been anything. We could have used Roman numerals, or some other naming scheme that is standard in the field. Aside from the <head> element (discussed later), that's all we need in the transcription. We can now move to alignment and annotation. There are two different types of alignment, one emphasizing breadth, the other, depth. The broad type of alignment, called TAN-A-div, allows us to specify TAN transcriptions of as many versions of as many works as we wish, and to make claims about those texts. The other type of alignment, emphasizing depth, is called TAN-A-tok and allows us to take any two (and no more) TAN transcriptions, create word-to-word (or better put, token-to-token) relationships, and specify what type of relationship holds between each set of aligned words. TAN-A-div is suitable for work that focuses on the general alignment of multiple versions of one or more works at a single time. TAN-A-tok is for highly detailed, precise alignment of two text versions. For our example, we start with a TAN-A-div file (once again suppressing <head>):<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> . . . . . . . </head> <body/> </TAN-A-div> In the prolog, the first line is identical to the first line of our transcription files. The second and third lines, the processing instructions, are identical, aside from pointing to the validation files for alignment. Even the fourth line looks like the transcription file, other than the new name for the root element, <TAN-A-div>, and the new value for @id. The penultimate line, <body/>, is what is called an empty element, and is equivalent to <body></body>—a shorthand syntax for elements contains nothing. It will become apparent, when we discuss <head> below, why our <body> can be empty. The other kind of alignment, TAN-A-tok, takes a bit more work, because we must first identify words that correspond with each other. Even before we do that, we need to decide what kind of relationship holds between the two texts. Let us pretend, for the sake of example, that the 1987 version is a direct descendant (and therefore variation) of the 1881 one. So our task is to show exactly what parts of the the older version correspond to those of the newer one. We will simplify in this case, and assume an interest only in words, ignoring space and that punctuation. We will also adopt, tokens instead of words (word is notoriously difficult to define, and has connotations lacking from token). We now create a TAN-A-tok file:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-tok.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02"> <head> . . . . . . . </head> <body bitext-relation="B-descends-from-A" reuse-type="adaptation" in-progress="false">  <align> <tok src="ring1881" ref="1" ord="1"/> <tok src="ring1987" ref="1" ord="1"/> </align> <align> <tok src="ring1881" ref="1" ord="2"/> <tok src="ring1987" ref="1" ord="2"/> </align> <align> <tok src="ring1881" ref="1" ord="3"/> <tok src="ring1987" ref="1" ord="3"/> </align> <align> <tok src="ring1881" ref="1" ord="4"/> <tok src="ring1987" ref="l" ord="4"/> </align> <align> <tok src="ring1881" ref="1" ord="5"/> <tok src="ring1987" ref="1" ord="5"/> </align>  <align> <tok src="ring1881" ref="2" val="A"/> <tok src="ring1987" ref="2" val="A"/> </align> <align> <tok src="ring1881" ref="2" val="pocket"/> <tok src="ring1987" ref="2" val="pocket"/> </align> <align> <tok src="ring1881" ref="2" val="full"/> <tok src="ring1987" ref="2" val="full"/> </align> <align> <tok src="ring1881" ref="2" val="of"/> <tok src="ring1987" ref="2" val="of"/> </align> <align> <tok src="ring1881" ref="2" val="posies"/> <tok src="ring1987" ref="2" val="posies"/> </align>  <align> <tok src="ring1881" ref="3" ord="1, 2"/> <tok src="ring1987" ref="3" ord="1"/> </align> <align> <tok src="ring1881" ref="3" ord="3 - 4"/> <tok src="ring1987" ref="3" ord="2"/> </align> <align> <tok src="ring1881" ref="4" ord="1"/> <tok src="ring1987" ref="4" ord="1"/> </align> <align> <tok src="ring1881" ref="4" ord="2"/> </align> <align> <tok src="ring1881" ref="4" ord="3"/> <tok src="ring1987" ref="4" ord="2"/> </align>  <align> <tok src="ring1881" ref="4" ord="last-1"/> <tok src="ring1987" ref="4" ord="last-1"/> </align> <align> <tok src="ring1881" ref="4" ord="last"/> <tok src="ring1987" ref="4" ord="last"/> </align> </body> </TAN-A-tok> Once again, the first four lines, the prolog and root element, should look familiar, with the only significant changes being the names of the validation files, the name of the root element (<TAN-A-tok>) and the value of @id. The heart of the data is <body>, which has, in addition to @in-progress, two more attributes, @reuse-type, which specifies the default type of relationship between the two sources, and @bitext-relation, which specifies how the versions relate to each other. Our two values, B-descends-from-A and adaptation, are arbitrary names that we define in the <head> (discussed later). You will also notice some lines that begin . These are comments, and can be placed within or beside any element, and can be any number of lines. If you wish to ignore, say temporarily, some elements, an XML editor can help you toggle them on and off as comments. <body> is the parent of one or more <align> elements, each of which correlates a set of tokens in the two texts through its <tok> children. Each <tok> has, in this example, three attributes. @src takes a nickname (an @id reference) that points to one of the two transcriptions; we have used ring1881 and ring1987 but we could have just as easily used anything else such as uk and us. @ref has a value that points to a specific <div> in the source transcription; and @pos or @val specify which token is intended, either by word number (@pos) or text of the actual word (@val). Either technique is fine, and can be mixed, as in the example. You may also notice that the comma and hyphen can be used in @pos to point to multiple words within the same <div>, and that last and last-X (where X is a digit) can be used to point to a word token relative to the last one in a <div>. Each <align> can establish one-to-one, one-to-many, many-to-one, or many-to-many relationships between words from the two texts. Words may feature in multiple <align> elements (a kind of overlapping that doesn't offend the XML rule against overlapping). And if an <align> has <tok> elements belonging to only one source, such as in the fourth-to-last <align> above, we have what is called, in these guidelines, a half-null alignment. This half-null alignment indicates that the second word of line four of the 1881 version is excluded from the act that we have called adaptation (which is, as we shall see, defined in the <head>). If this were a translation, it would be as if we were saying that this word was excluded from the translation. (A half-null alignment containing only tokens of the later source might point to words that the translator added.) A half-null alignment should not be confused with our own silence. As creators of this file, we are under no obligation to indicate every word-for-word correspondence. If we fail to mention certain words, all that can be implied is that we opted not to say anything about them. We could have aligned the two texts in different ways. Perhaps further study will reveal that we were in error to associate the second "ring" with "round" in line 1. We can make corrections, even after publication, and signal the change to users of our data. There are also ways to express doubt or alterative opinions. We can even correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are discussed elsewhere in these guidelines.

The Principles of TAN Metadata (<code><link linkend="element-head" ><head></link></code>) At this point, we have finished four TAN files: two transcriptions, one TAN-A-div file, and one TAN-A-tok file. But we've suppressed the <head> in all of them, until now. Before getting into details, we need first to explain a few TAN principles. Unlike <body>, which carries the raw data, <head> contains what is oftentimes called metadata. That is, <head> describes the raw data. Because the TAN format is intended primarily to serve scholars, and because the format is heavily regulated (that is, there are numerous validation rules that supplement the basic ones behind XML), the metadata requirements are stricter than they are for Word documents, HTML, TEI, or other formats you might know better. Scholars who find our file really need to know some essential things before they can responsibly use it. For example, what are the sources we have used? Who produced the data? When? What key assumptions have been made in producing the data? What licenses govern the data? The questions are not difficult to answer, but they are critical, and we should take some time to provide accurate answers. Some metadata questions are specific to certain formats. For example, in a TAN-A-tok file, we ask what relationship holds between the two sources. But that makes no sense for a TAN-T file. But other questions apply universally across all TAN files, no matter what kind of data. As we go from one TAN format to the next, we need to deal as much we can with similar structures and expectations. This reduces any potential confusion in creating and editing a TAN file, and helps other people using our data to find the information they want. More importantly, what we write in one file might save us some work in another. The rigorous scholarly requirements for TAN metadata are offset somewhat by another principle that was adopted in the design of TAN, namely, that each format's <head> should focus exclusively upon the data in <body> and not other things. That is to say, in a transcription, we should definitely indicate what our source is. But we should not try to write a catalog entry, or even a structured citation, for the book we have used. We are not library catalogers. Our obligation is merely to point somewhere a reader can get more complete information. The <head> is designed to help us to stay focused on the task and data at hand. TAN was also designed with the assumption that all metadata should be useful to both humans and computers. For our example above, we must describe the work we have chosen (Ring around the Rosie) in a way that is comprehensible not just to the reader but to the computer. Take for example the 1881 book we have used for our first transcription. For the human reader we can say simply something like "Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]". But computers need a more controlled, predictable syntax before they can be directed to the correct edition of Mother Goose (or rather to a digital surrogate of the edition). The human-readable string is too complex, and syntactically opaque. A more computer-friendly identifier would be international standard book numbers (ISBNs), which distinguish the 1984 version of Mother Goose illustrated by Kayoko Okumura from the one of the same year illustrated by William Joyce. The ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be converted into a machine-actionable string called universal resource names (URNs), in this case urn:isbn:0-671493159 and urn:isbn:0-394865340. (Our 1881 version was published before the ISBN program was introduced. We will see below another way to name it.) URNs are families of formalized naming schemes regulated by a central body (Internet Assigned Numbers Authority, IANA) to ensure permanent, persistent, unique names for various types of things. There are URN schemes for journals (via ISSNs), articles (DOIs), and movies (ISANs), which means that anyone can use them to refer unambiguously to a particular kind of thing. All URNs are simply names. They don't tell you where an object is. To provide a unique location, however, we have universal resource locators (URLs), which might be much more familiar from daily use of the Internet, e.g., http://academia.edu. Like URNs, URLs are also centrally regulated, with individuals or organizations buying the rights to domain names from a central registry (usually through a third-party vendor). Both URNs and URLs can be thought of as the same type of thing, namely, a universal resource identifier (URI), sometimes called an international resource identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and URLs. These four acronyms can be easily confused, and it is best to disambiguate them by thinking of the last letter in each. URIs/IRIs Incorporate both Locators (URL) and Names (URN). If those acronyms are confusing, don't worry. For our purposes here, they are pretty much are the same, and from this point onward we'll use merely the term IRI (unless we really mean a location, which we'll call a URL). IRIs are essential to a system frequently called the semantic web or linked (open) data, an agreed way of writing and processing data that relies upon IRIs and a simple data model. The semantic web allows people to make assertions in a way that computers can "understand." If people, working independently, happen to use the same IRIs to describe the same things, then computers can be programmed to make associations between disparate, heterogenous datasets. This allows us to find connections across disciplines and projects, to marshall computers to make inferences we might not make on our own, and to create a network of linked data. TAN has been designed to be linked-data friendly, and so requires in its <head> almost all data to be representable not just in a human-readable form but also computer-readable, as an IRI. Our first task, then, in writing the <head> sections of our four TAN files is to look for IRI vocabulary that will be familiar to the people most likely to use our files. In trying to find suitable IRIs, we will find that the persons, things, and concepts we want to describe will range from the highly familiar to the unfamiliar. Highly familiar: The two books that provide the basis of our transcription are well catalogued and generally known. A number of services provided by librarians provide a controlled IRI vocabulary that can be used by anyone to describe uniquely a particular version of a book. WorldCat (run by OCLC) and the Library of Congress are good examples. In our case, we have found accurate Library of Congress IRIs for both editions of Mother Goose: http://lccn.loc.gov/12032709 and http://lccn.loc.gov/87042504. Observe that these two IRIs are also, perhaps confusingly, URLs (locations). If we paste these strings into our browser, we retrieve a record that describes the book. This locator does not lead us to the book itself, only to information about the book. Nevertheless, the Library of Congress has decided to make this URL also a name for the book. Anyone who owns a domain name can designate a URL as a name for an object. And that allows them to set up their server to also return information about the object the IRI names. This subtle ambiguity—that the URL both names an entity and is a location for a webpage—can sometimes be confusing to those who are new to the semantic web, because such URLs name in reality two types of things: an entity and a location to find out more information about that entity. We now have IRIs for the sources. Let's now find an IRI to name the work, Ring around the Rosie. The work is widely known, and even has a Wikipedia entry. That Wikipedia entry is a benefit. The Universities of Leipzig and Mannheim and Openlink Software have collaborated on a project called DBPedia, which is committed to providing a unique URN for every Wikipedia entry in the major languages. The DBPedia URN in this case is http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses. Once again, this is both a name and a locator. It names a specific intangible object, namely a nursery rhyme that we've called Ring around the Rosie, no matter what specific version. But if you put that name into your browser, you will get back more information about that named object. Familiar, but only in small circles: We will need to have names for some of the people who edited the file. Here we're not interested in the authors of our books. We are interested in crediting the people who helped make the TAN file. Most people who write and edit our TAN file will not be well-known, public figures. If they are, and if they are famous enough to have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are also published authors, there is a good chance that they are listed in the databases of either VIAF or ISNI, both of which publish unique IRIs for persons. Many contributors to TAN files, however, will not be listed in these general databases. In those cases, we can name these participants with an IRI that we "own." We have already done something like this by assigning tag URNs to our four transcriptions (the value of @id in the root element). Our editors can do the same thing. If a student Robin Smith has been helping with proofreading, Robin can take an email address (even one that doesn't work any more) and a date when the email address was used and construct a tag URN such as tag:smith.robin@example.com,2012:self. This has a slight drawback in that we cannot type this string into our browser to find out more about the Robin, but it at least allows us to assign a name that will not be confused as the Robin Smith identified by ISNI as http://isni.org/isni/0000000043306406. (If we want to go a step further, Robin could mint a URN from a domain name that she owns, and set up a linked data service that offers more information, human- and computer-readable. But this is not required, and it can be a lot of work to maintain.) Now we come to a more difficult challenge. We have to assign an IRI to the relationship that we claim holds between two text-bearing objects. Making that clear is important, because if we had a different view on how one related to the other, it would probably affect the specifics of our word-for-word alignments. We are assuming for the sake of illustration that the version published in the 1987 Mother Goose is a direct descendant of the 1881 version. Because no suitable IRI vocabulary yet exists for such concepts, TAN has coined an IRI that can be used by anyone wishing to declare that the second of two sources descends from the first through an unknown number of intermediaries: tag:textalign.net,2015:bitext-relation:a/x+/b. We face a similar issue when thinking about text reuse. We generally consider the 1987 version to be an adaptation of the 1881 version. And there are not stable, well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI, tag:textalign.net,2015:reuse-type:adaptation:general. In both cases above, we could have come up with our own vocabulary. But the idea here is that we should be sharing a common vocabulary whenever possible. The built-in TAN vocabulary simply gives us a convenient lingua franca for describing some important but abstract concepts. For other examples of IRIs coined by TAN, see . Generally unfamiliar: Some things or concepts will be unknown to very few people, perhaps even us. If we plan to refer to that thing or concept often, it is preferable to coin a tag URN, as described above. But in some cases, we might find that a tag URN we minted for some concept or thing was, in hindsight, misleading or poorly constructed, because we hadn't thought as thoroughly as we could have about the category. If we wish to avoid these kinds of situations, we can assign a randomly generated IRI called a universally unique identifier (UUID), e.g., urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0. Uuid URNs are very useful. The likelihood that a randomly generated uuid will be identical to any other uuid is astronomically improbable, making them reliably unique names for anything (barring someone copying and reusing that uuid URN to name some other object or concept). Numerous free UUID generators can be found online. To humans, a UUID on its own is meaningless, and rather ugly. But it is a start. We always have the option, later, of adding an IRI. It's perfectly fine to give one object or concept multiple IRIs. But the reverse is never true. One should never use the same IRI to identify more than one object or concept.

Creating TAN Metadata (<code><link linkend="element-head" ><head></link></code>) Now that we have explored various IRI vocabularies for concepts around our versions of Ring-a-ring-a-roses, we can now complete the metadata in our four TAN files. Let us start with the TAN-T file of the 1881 version: <head> <name>TAN transcription of Ring a Ring o' Roses</name> <master-location href="http://textalign.net/release/TAN-2018/examples/ring-o-roses.eng.1881.xml"/> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Attribution 4.0 International</name> </license> <licensor who="park"/> <source> <IRI>http://lccn.loc.gov/12032709</IRI> <name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]</name> </source> <definitions> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>"Ring a Ring o' Roses" or "Ring Around the Rosie"</name> </work> <div-type xml:id="line"> <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> <name>line of poetry</name> </div-type> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name>Jenny Park</name> </person> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> </definitions> <resp roles="creator" who="park"/> <change when="2014-08-13" who="park">Started file</change> </head> <name>, the human readable counterpart to the @id that is inside the root element, can be anything. And we can supply more than one <name>, in case we wish to provide it in different languages or variations. <master-location> is mandatory only if we have claimed through @in-progress that the file is no longer in progress. One or more of these elements provide URLs where master versions of the file are kept (and updated). We provide this as a courtesy to others who might be using our data. Anyone who validates a local copy of the file will be warned if it does not match the master version, and be told the most recent changes. This allows users to found out if changes have been made, and it allows us to make corrections and silently notify other users of our alterations. To communicate this, we do not have to keep track of who is using the file. <license> specifies the license under which we are releasing our data. This element has nothing to do with the copyright of the source we have used (although, having been published in 1881, the book is clearly in the public domain). That is, we are declaring the rights attached to the data, not its source. This once again gets to the TAN metadata principle of describing our data and not other things. We can if we want describe the license of the source we have used (see the rest of the guidelines for guidance), but we absolutely must declare whether we have placed additional scrictures on the dataset we have created. In this example, we have released the data under a creative commons license. The child element <IRI> specifies the IRI assigned by Creative Commons, and <name> is the human-readable form. <licensor>, by means of @who, indicates who holds the license. In this case it points to a person The conjunction of <IRI> and <name>, the IRI + name pattern, is a recurrent feature of TAN files. We may include any number of <IRI> or <name> elements in an IRI + name pattern. But if we do so, we are stating that they all name the same thing, not different things. <source> points, through its IRI + name pattern, to a computer- and human-readable description of the book we have chosen. <definitions> contains data that is specific to TAN file types, to define our terminology. <work> uses the IRI + name pattern to name the work we have chosen to transcribe. <div-type> specifies the type of divisions we have chosen to use to segment the transcription. In a more complex text, there would be several <div-type>s. Each one has an @xml:id, which takes as a value some nickname that we wish to use for @type values of <div>s. The IRI + name pattern is also used for <person>, which describes who was involved in creating the data, and <role>. We may have as many <person>s and <role>s as we wish. In this case, Jenny Park, has been given a tag URI. The <IRI> value of <role> comes from the vocabulary of schema.org, which is maintained by Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated to universal Internet standards), but we could have used Dublin Core or some other IRI vocabulary describing behaviors, responsibilities, and roles. Those roles and persons get combined after the <definitions> , in a <resp>, which stipulates who was responsible for what roles. If you decide to modify someone else's TAN file, then you become responsible for changes, not the original person or organization. Your first point of order should be add a <person> to the head, identifying yourself. You need not change the document's @id, but you should take responsibility for any changes you make, otherwise you are incorrectly attributing your changes to someone else. Remember that <head> is focused on the data, not its sources, so the claim that Jenny Park is the creator pertains only to the data. No inference should be made about who created the source. If someone wants that information, or anything else about the source, they should pursue the identifier we have provided under <source>. <change> has attributes @when and @who that specify who made the change/comment and when. The value of @when is always a date plus optional time formatted according to the standard YYYY-MM-DD + time (optional). @who always carries a value that refers to an agent/@xml:id. Neither <change> nor <comment> take <IRI> or any other children. So now we have finished one transcription file's metadata. The other one will look similar, but we'll also take a couple of shortcuts: <head> <name>TAN transcription of Ring around the Rosie</name> <master-location>ring-o-roses.eng.1987.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US</IRI> <name>Creative Commons Attribution 4.0 International License</name> <desc>This data file is licensed under a Creative Commons Attribution 4.0 International License. The license is granted independent of rights and licenses associated with the source. </desc> </license> <licensor who="park"/> <source> <IRI>http://lccn.loc.gov/87042504</IRI> <name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</name> </source> <definitions> <work> <IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</IRI> <name>Ring around the Rosie</name> </work> <div-type xml:id="l" which="line (verse)"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator" which="creator"/> </definitions> <alter> <normalization which="no hyphens"/> </alter> <resp roles="creator" who="park"/> <change when="2014-10-24" who="park">Started file</change> <comment when="2014-10-24" who="park">See p. 39 of source.</comment> </head> One significant difference is that three of the elements that normally take the have been replaced with a simpler form that takes merely @which and @xml:id. For a number of elements, TAN has predefined vocabulary that can be invoked by calling it (through @which) and giving it an abbreviation to be used elsewhere in the document (@xml:id). After <definitions> comes a new element, <alter>, which contains a <normalization> statement that declares, through the name and the IRI in the underlying TAN definition, that we have opted to remove word-break line-end hyphenation. This provides a cautionary note to users of our data who might value line-end hyphenation. Any number of <normalization>s can be used to describe any alterations we might have made in our transcription. In other transcriptions we could use this feature to declare other suppressions, such as editorial comments or footnote signals. Note that the value of div-type/@xml:id here, the letter l, differs from our previous transcription file, line. Even though we have adopted a different nickname, they are treated as equivalent because in each file we have defined l or line with the same IRI, http://dbpedia.org/resource/Line_(poetry). A computer that later looks for files with lines of poetry will not care about l and line, but will look at the underlying IRI that defines these terms. This exemplifies how linked data (see above) can support our work. We are free to use abbreviations and terms that make sense to us, yet we tie those abbreviations to IRIs that have valence outside our project. Now that we have created the metadata for our transcriptions, we turn to the alignment files. Those <head>s will look slightly different. We start with the TAN-A-div file: <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <license which="by_4.0"/> <licensor who="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <definitions> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> </definitions> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> </head> Much of the code above will look similar to the previous two examples. Every alignment file has only one kind of source, namely TAN transcription files, nothing else. Therefore <source>'s <IRI> always takes the @id value of the corresponding TAN transcription file. <name> is arbitrary. It may replicate exactly the title found in the transcription file, or it may be modified, perhaps to harmonize better with the descriptions of the other texts aligned in the file. <source> also has an child element not seen in the earlier two examples, <location>, which specifies where the digital file was accessed and when (through @when-accessed). We may include as many of these <location> elements as we wish, with the most preferred or reliable location at the top, since the validation process will use first document that is available. The @when-accessed value is important, because the validator will look for changes in the file, and if there have been changes since we last accessed the file, it will return a warning with a summary of the number and kind of changes. If such a report is returned, it is up to us to determine if the alterations merit any action on our part. Our TAN-A-div file could have any number of <source>s, and not necessarily for the same work. It also does not matter in which order we put the <source>s. <definitions> is empty, mainly because we have, in this case, no working assumptions to declare. In more advanced uses, this element would not be empty. This <head> explains why the <body> of our TAN-A-div file is allowed to be empty. We have already specified which sources are to be aligned and where they are to be found. All TAN-A-div files assume, by default, that every source that is a version of the same work should be aligned upon the basis of the @n value of <div>s. That is, any user or processor of a TAN-A-div file may assume that all implicit alignments should be made unless otherwise specified. For transcriptions that are already similarly structured and labeled, a TAN-A-div file is unnecessary for alignment. But we will see that the options available in a TAN-A-div's <definitions> and <body> will allow us not only to deal with inconsistencies in source transcriptions but to make important claims, e.g., where one work quotes from another. Meanwhile we turn to our fourth file, TAN-A-tok, whose <head> looks like this: <head> <name>token-based alignment of two versions of Ring o Roses</name> <master-location>ringoroses.01+02.token.1.xml</master-location> <license which="by-nc-nd_4.0" rights-holder="park"/> <source xml:id="ring1881"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Ring o roses 1881</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="ring1987"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Ring o roses 1987</name> <location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <definitions> <bitext-relation xml:id="B-descends-from-A"> <IRI>tag:textalign.net,2015:bitext-relation:a/x+/b</IRI> <name>B descends directly from A, unknown number of intermediaries</name> <desc>The 1987 versions is hypothesized to descend somehow from the 1881 version, mainly for the sake of illustration.</desc> </bitext-relation> <reuse-type xml:id="adaptationGeneral"> <IRI>tag:textalign.net,2015:reuse-type:adaptation:general</IRI> <name>general adaptation</name> </reuse-type> <token-definition src="ring1881 ring1987" which="letters"/> <person xml:id="park" roles="creator"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator" which="creator"/> </definitions> <change when="2015-01-20" who="park">Started file</change> </head> The TAN-A-tok <head> looks similar to the previous examples, except that <definitions> has some new content. <bitext-relation> states through an IRI + name pattern the stemmatic relationship we think holds between the two sources. (Stemmatics is the study of the chain of transmission—the relationship of an original text-bearing object to the ones that survive. It frequently involves the creation of genealogical-like trees to illustrate the work's version history.) We have used the entire IRI + name pattern, but we could have substituted it with @which and the value a/x+/b. One or more <reuse-type>s specify how one text has reused another. The IRI we have used shows that we believe that the later text has generally adapted the earlier one. If this were a translation or a quotation or some other kind of text reuse, we might have used a different IRI. A third declaration, <token-definition>, specifies how we have defined our word tokens. @src has more than one value, specifying that the same tokenization rule should be applied to both sources. This element is optional. If we leave it out, users are to assume that we mean letters. This is because most often, whenever in ordinary conversation we refer to the nth word in a sentence we assume people will skip punctuation marks when they count. The value for @which, letters, is a reserved TAN keyword that specifies that any consecutive string of word characters, ignoring spaces and punctuation. Under this token definition the phrase "Hush!" said he would have three tokens. Had we set the value of @which to the reserved TAN keyword letters and punctuation, we would have six tokens, since each punctuation mark would be defined as a token.

Aligning across Projects We now have a small corpus of TAN files. Let us imagine what it might be like to connect our TAN corpus to another. Let us assume that we have found elsewhere, in a German project, a TAN transcription of a work that looks quite similar to our own:<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel"> <head> <name>TAN Transkription, Ringelreihen mit Riederfallen</name> <master-location>http://beispiel.com/TAN-T/ringel.xml</master-location> <license> <IRI>http://creativecommons.org/licenses/by/4.0/</IRI> <name>Creative Commons Namensnennung 4.0 International Lizenz</name> <desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.</desc> </license> <licensor who="schmidt"/> <source> <IRI>http://www.worldcat.org/oclc/4574384</IRI> <name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig, 1897.</name> </source> <definitions> <work> <IRI>tag:beispiel.com,2014:texte:holderbusch</IRI> <name>"Die Kinder auf dem Holderbusch"</name> </work> <version> <IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e</IRI> <name>zweite Version</name> </version> <div-type xml:id="Zeile"> <IRI>http://dbpedia.org/resource/Gedichtzeile</IRI> <name>Gedichtzeile</name> </div-type> <person xml:id="schmidt" roles="Produzent"> <IRI>tag:hans@beispiel.com,2014:selbst</IRI> <name xml:lang="eng">Hans Schmidt</name> </person> <role xml:id="Produzent"> <IRI>http://schema.org/producer</IRI> <name xml:lang="eng">Produzent</name> </role> <ambiguous-letter-numerals-are-roman>false</ambiguous-letter-numerals-are-roman> </definitions> <alter> <normalization> <IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off</IRI> <name>Keine Bindestriche</name> </normalization> </alter> <resp who="schmidt" roles="Produzent"/> <change when="2014-08-13" who="schmidt">Anfang</change> <comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht</comment> </head> <body xml:lang="deu" in-progress="false"> <div type="Zeile" n="a">Ringel, Ringel, Reihe!</div> <div type="Zeile" n="b">Sind der Kinder dreie,</div> <div type="Zeile" n="c">Sitzen auf dem Holderbuch,</div> <div type="Zeile" n="e">Schreien alle: husch, husch, husch!</div> </body> </TAN-T> It seems clear to us that this 19th-century German version is quite similar to our two English versions. We have some alignment options open to us. Two more sets of word-for-word alignments would be interesting, but remember, just because we find a text that nicely aligns with others does not mean that we must align them, or even if we choose to make an alignment that we have to align everything. In this case, we choose not to worry about word-for word alignments, and we focus here only on the TAN-A-div alignment, so that, for example, we can later read the three versions in parallel and study their relationships. To that end, we first observe some differences between this transcription and our other two. First, the value of <work> is not the one we have given our two versions. Second, the <div-type> is defined as http://dbpedia.org/resource/Gedichtzeile (Gedichtzeile = line of poetry). Third, the lines have been lettered instead of numbered (and they are stipulated to be letter numerals, not roman, through <ambiguous-letter-numerals-are-roman>). And last, the editor seems to have made a typographical error, making the last line n="e" instead of n="d"). These four differences typify some of the inconsistencies that are commonly found in digital texts. There are a few other differences in this third transcription that do not affect our alignment. <version> is used to distinguish different versions of the same work found on the same text-bearing object. That is, if we are transcribing a bilingual edition, we can use <version> to specify which of the two versions we are encoding. Notice that the <IRI> value is a uuid. In this case the editor was not prepared to deploy a formal IRI naming scheme (perhaps using a tag URN) that would be satisfactory for work-versions. These are points we can easily reconcile in our TAN-A-div file, which we now expand to include the German version. We make the following adjustments (in boldface):<?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?> <?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment"> <head> <name>div-based alignment of multiple versions of Ring o Roses</name> <master-location>ringoroses.div.1.xml</master-location> <license which="by_4.0"/> <licensor who="park"/> <source xml:id="eng-uk"> <IRI>tag:parkj@textalign.net,2015:ring01</IRI> <name>Transcription of ring around the roses in English (UK)</name> <location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml</location> </source> <source xml:id="eng-us"> <IRI>tag:parkj@textalign.net,2015:ring02</IRI> <name>Transcription of ring around the roses in English (US)</name> <location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml</location> </source> <source xml:id="ger"> <IRI>tag:beispiel.com,2014:ringel</IRI> <name>Transcription of an ancestor of Ring around the roses in German</name> <location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml</location> <location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml</location> </source> <definitions> <person xml:id="park"> <IRI>tag:parkj@textalign.net,2015:self</IRI> <name xml:lang="eng">Jenny Park</name> </person> <role xml:id="creator"> <IRI>http://schema.org/creator</IRI> <name xml:lang="eng">creator</name> </role> <alias id="ring" idrefs="ger eng-us"/> </definitions> <alter src="ger"> <rename n="5" by="-1"/> </alter> <resp who="park" roles="creator"/> <change when="2014-08-14" who="park">Started file</change> <change when="2014-08-22" who="park">Added German version.</change> </head> <body/> </TAN-A-div> The first major change is the insertion of a third <source>, pointing to the new file and specifying its name and IRI. Note that two locations have been provided, one for the original location and another for the copy saved locally into our project folder. Validation will occur at the first document available. If we wanted to work primarily off our local copy, we would have put that <location> first. By placing it second, we allow the validation engine to look for updates and changes in the master version. If that version is unavailable, validation will be made against second, local copy. The second major change, to address the German version's different value of <work>, is the addition of an <alias>. If and when we make claims about a work in general, via @work, the id value ring will mean that we're asserting the claim to be true for any scriptum that shares the IRI values of the <work> in either the German or the US version (which is why we do not need to specifically mention eng-uk in the <alias>, since it already has a work IRI in common with the US version). A <rename> takes care of the apparent typographical error, this time anchoring the German version to the US one. Note that the German version uses e, but we have used 5. But we could have used e, or even the Roman numeral v, had we wished to. Every TAN file's numeration system is evaluated locally, independent of any companion files. So we need not reconcile the a, b, and c in the @n values in the German version, because these will be automatically treated as equivalent to 1, 2, and 3. The TAN format allows four numeration systems other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems will be treated as numerical pairs (1 and 1, 1 and 5, etc.). The last major insertion is a new <change>, documenting when we made the alterations. The value of @when effectively updates the version of our TAN-A-div file. With these changes, the new version is aligned with the other two. Our work may have been simplified if we had just modified the German version ourself. But such changes would have affected only our local copy, not the master one. Changing only our local copy would not allow us to connect our work to other TAN files that may be depending upon the same master file. But perhaps Hans Schmidt, the producer of the German version, can be contacted. We do so, and we suggest that he modify the version to make it align better. In the case of <div-type>, he need merely add another element: <IRI>http://dbpedia.org/resource/Line_(poetry)</IRI> (or even better, use the built-in TAN vocabulary). Perhaps he has reasons for labeling the lines with letters, and perhaps he is reluctant to explicitly identify this poem with Ring around the Rosie. That is within his rights. But the conversation might lead to our pointing out that n="e" should probably be n="d" and that there is an apparent discrepancy in the last line. (The original, printed book has the poem twice on page 438, one with the spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct his master file, he can add a new <change>, and thereby tacitly notify anyone else using the file that corrections have been made. At this point we have a network of five TAN files, four in our corpus and one from outside. Although simple, the network could be the basis for some creative and complex research questions. Stylesheets could be used to automatically align the versions for reading and study, or to perform statistical analysis. Study of the rest of these guidelines, as well as example TAN libraries, will suggest numerous ways to create, manage, share, and use TAN files.

Detailed Description This part of the guidelines provides a detailed description of the formats of the Text Alignment Network. The material is organized according to the structure that governs the schema files, so both can be read in tandem. outlines, in a non-technical way, the principles and technical foundations of the TAN format. , , , and comprehensively describe all the TAN formats. Each chapter starts with theoretical or scholarly background, to provide a contextual explanation for the technical points that follow. , the first of two very long chapters, provides a comprehensive, detailed explanation of the rules for every element and attribute, as well as the patterns into which they fall. This chapter includes a thorough list of relevant validation rules and examples. It has been written using a stylesheet that traverses the official TAN schemas, functions, and examples. lists all the vocabulary items that have already been defined as a core part of the format. This chapter is, essentially, a different way of looking at the TAN-key files that are in the TAN-key folder. The chapters in this part of the guidelines should be read selectively, not consecutively. They have been written with the assumption that you have already read the previous part () and that you have already started to create and edit a TAN collection. Because readers will come from different specialties, all acronyms, abbreviations, and concepts are defined and explained, albeit tersely. Concepts or technologies are discussed only insofar as they affect the use of TAN; suggestions for further reading are provided for those who want a more thorough introduction to a topic. General Underpinnings This chapter retains something of the introductory spirit of the previous one by providing an overview of the fundamental principles and technologies behind TAN. The overall goal of this chapter is to explain design principles of the format. Although this chapter assumes on your part no prior knowledge of any particular technology, it is also not meant to be a tutorial. Links to further reading will take you to more adequate introductory material.

Design Principles The TAN formats have been designed around a few basic design principles: Scholarly habits Be patient. Simplify. Stay focused. Avoid redundancy. Don't state the obvious. Use familiar conventions. Scholarly freedom Express doubt. Offer alternatives. Exercise independence. Invite interdependence. Scholarly responsibility Declare your assumptions. Make your work citable. Satisfy scholars' expectations: Who did what when? What are your sources? How do you define your terms? What alterations have you made to your sources? What rights do I have to use your material? General utility Use stable technology. Keep design predictable, consistent. Make the data human readable. Make the data computer actionable.

Format Organization The Text Alignment Network is a modular suite of XML encoding formats, each one designed for a specific type of textual data, divided into three classes: transcriptions (class 1), annotations and alignments of transcriptions (class 2), and everything else (class 3). Class 1, representations of textual objects, consists solely of transcription files. Each transcription file contains the text of a single work from a single text-bearing object (which we term scriptum), whether physical or digital. There are two types of transcription file: a standard generic format and a TEI extension. These two types are differentiated by the root element, <TAN-T> and <TEI> respectively. Class 2, annotations of class 1 files, are used to encode claims about texts, and to align them. There are two types of alignment, one for broad, general alignments and another for granular, word-for-word aligments. The former, with <TAN-A-div> as the root element, aligns any number (one or more) of class 1 files, and permits assorted claims about those files. The latter, <TAN-A-tok>, aligns only pairs of class 1 files. Lexico-morphology files, <TAN-A-lm>, are used to encode the lexical and morphological (or part of speech) forms of individual words in a single class 1 file. Class 3, covers everything else. <TAN-mor> is used to define the grammatical categories or features of a given language and to specify rules for tagging words in a dependent TAN-A-lm file. <TAN-key> collects and defines terms frequently used in other TAN files. <collection> marks TAN catalog files, which provide an index of locally available TAN files. This modular approach supports what is sometimes called stand-off annotation (or stand-off markup), in contrast to in-line annotation, in which a text and its annotations are placed in a single file. (Most TEI and HTML files feature in-line annotation.) In stand-off annotation, the annotations reside in files separate from the text. This provides several benefits: An editor can work on a file with minimal distraction, focusing on a limited set of closely related questions. Editors can work off the same master files, even if they have very different research interests. Complementary or competing annotations can be made, even if those annotations overlap (a major problem for in-line annotation, where according to XML rules no element may interlock or overlap with another). TAN files become, collectively, a complex dataset, supporting lines of research that might not have been anticipated by any single project. Editorial labor can be conducted without central coordination, as individuals work at their own pace, independently, on separate files. When errors are found, they can be corrected in master files. Anyone depending upon that master file as a source will be notified of changes that have been made and they can deal with them accordingly. (Editor 1 can post typographical corrections, and if she logs the change with a time-date stamp, anyone using the file, upon validating their files, will be sent information or a warning about the change. Similarly, Editors 2 and 4 can let Editor 1 know about their work, and Editor 1 can update the Old French versions with cross-references.) Any data file can be released, circulated, and used independent of any other that points to it, or to which it points. Connected files can be combined and transformed in any number of ways to produce a wide variety of derivative documents (e.g., collated versions, statistical analysis). A transformation created for one set of TAN documents will work identically on other TAN documents of the same format. (If someone creates a tool to synthesize a transcription and an associated TAN-A-lm file, it can be applied to both Editor 2's and Editor 4's work.) The TAN family of formats can be expanded to allow other types of linguistic data, and therefore other lines of research. Stand-off annotation is not without liabilities. Files might be altered or altogether deleted, rendering dependent files meaningless. An editor may find that not having the annotated text in the same place as the annotation is an inconvenience. These are significant challenges, but TAN validation rules have been designed to mitigate these as much as possible.

Assumptions in the Creation of TAN Data All creators and users of TAN files are expected to share few basic assumptions. First, all TAN-compliant data is to be understood as largely derivative. That is, data files have no originality or creativity independent of their sources (but see below about interpretation). TAN-compliant data is to be created with intent of adhering as closely as possible to some model or archetype. For example, a transcription should replicate faithfully some earlier digital edition or text-bearing material object (e.g., stone, papyrus, manuscript, printed book for written text; audiovisual media for oral or performative texts). Morphological files and alignment files should describe as clearly and as reliably as possible their source transcriptions. In creating and publishing a TAN file you claim to have offered a good-faith representation or description of something; in using a TAN file, you hold the creator to that expectation. Second, all core TAN files are interpretive. That is, they are permeated by editorial assumptions and opinions that might not be shared by everyone. If there is any originality or creativity in a TAN file, it is in that interpretive outlook. For example, if you edit a transcription file you must decide how to handle unusual letterforms and other visible marks. Your decisions will be informed by how you view the original text and its native writing system, and how you interpret and use Unicode. If you write an alignment file, you must make decisions about what factors caused one text to be transformed into another. Lexicomorphological files require you to commit to one or more grammars and dictionaries, and you must discern how best to handle cases of vagueness and ambiguity. No TAN file ever stands completely outside the interpretive act. In creating and publishing a TAN file you claim to have disclosed as best you can the assumptions behind your interpretive outlook; in using a TAN file, you hold the creator to that expectation. Third, all core TAN files are useful. That is, the interpretive impluse is assumed to be coupled with an equally strong desire to make the data as useful to as many users as possible, even those who may not share your assumptions or interpretation. A creator of a transcription file, for example, should normalize and segment texts with a minimum of idiosyncracies, adopting the most widely used reference systems, so as to optimize the alignment process. Morphological files should depend whenever possible upon commonly accepted grammars and lexica. Alignment files should work with comprehensible categories of text reuse. No TAN file will always be useful to everyone, but it should be as useful to as many as possible, as frequently as possible. In creating a TAN file you claim to use common, shared conventions whenever possible, and to note any departures; in using a TAN file, you hold the creator to that expectation.

Core Technology TAN depends upon a set of relatively stable technologies. Those technologies and the underlying terminology are very briefly defined and explained below, with particular attention to interpretive decisions that have been adopted by TAN validation rules. References to further reading will lead you to better and more thorough introductions.

Unicode

What is it? Unicode is the worldwide standard for the consistent encoding, representation, and exchange of digital texts. Stable but still growing, Unicode is intended to represent all the world's writing systems, living and historical. Maintained by a nonprofit organization, the Unicode standard allows us to share texts in any alphabet and reliably share that data with other people, independent of individual fonts. With more than 128,000 characters, Unicode is almost as complex as human writing itself. The entire sequence of characters is divided into blocks, each one reserved, more or less, for a particular alphabet or a set of characters that share something in common. Within each block, characters may be grouped further. Each character is assigned a single codepoint. Because computers work on the binary system, codepoints have been numbered according to the related hexadecimal system (base 16), which uses the digits 0 through 9 and the letters A through F. (The number 10 in decimal is A in hexadecimal; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It is helpful to think of Unicode as a very long ribbon sixteen squares wide, a glyph in each square. This is illustrated nicely in this article. Each position along the width is labeled with a hexadecimal number (0-9, A-F) that always identifies the last digit of a character's code point value. It is common to refer to Unicode characters by their value or their name. The value customarily starts "U+" and continues with the hexadecimal value, usually at least four digits. The official Unicode name is usually given fully in uppercase. Examples: Unicode characters Character Unicode value Unicode name " " (space) U+0020 SPACE ® U+00AE REGISTERED SIGN ю U+044E CYRILLIC SMALL LETTER YU

Normalization TAN validation rules require all data to be normalized according to the Unicode NFC algorithm. Any text in a TAN file that is not NFC normalized will be marked as invalid.

Unicode characters with special interpretation When the characters U+200D ZERO WIDTH JOINER and U+00AD SOFT HYPHEN occur at the end of a leaf <div>, perhaps followed by white space that will be ignored (see below), processors will assume that the character is to be deleted, and when combined with the next leaf div, no intervening space should be allowed. Furthermore, because these characters are difficult to discern from spaces and hyphens, any output based on the character mapping of the core functions should replace these characters with their XML entities, ‍ and .

Combining characters At the core level of conformance, Unicode does not dictate whether combining characters (accents, modifying symbols) should be counted independently or as part of a base character, nor does the family of XML languages. In most circumstances, this point is negligible. But it affects regular expressions and XPath expressions (see below). Two of the class 2 formats allow the counting of characters. Such counting is assumed to be made exclusively of non-combining characters, defined as the regular expression [^\p{M}]. Any numerical reference made in a TAN file to an individual character will be found by counting only non-combining characters. When the nth character is requested, TAN functions will return the nth base character along with any combining characters that immediately follow. TAN rules stipulate that combining characters must have a preceding base character. Any <div> that starts with a combining character will be marked as invalid. See also .

Deprecated Unicode points Because TAN is focused not at all on appearance, the following characters will generate an error if found in a TAN file: U+00A0 NO-BREAK SPACE U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE

Further Reading Unicode Consortium Unicode (Wikipedia)

eXtensible Markup Language (XML)

What is it? Defined by the W3C, the eXtensible Markup Language (XML) is a machine-actionable markup language that facilitates human readability. For a basic introduction to XML see .

Schemas and validation Validation files are found here: http://textalign.net/release/TAN-2018/schemas/. Each TAN file is validated by two types of schema files, one dealing with major rules concerning structure and data type (written in RELAX-NG) the other with very detailed rules (written in Schematron). The RELAX-NG rules are written primarily in compact syntax (.rnc), and then converted to the XML syntax (.rng). For TAN-TEI, the special format One Document Does it all (.odd) is used to alter the rules for TEI All. The Schematron files are generally quite short. The primary work is done by a large function library written in XSLT. For more on this process, see . Some validation engines that process a valid TAN-compliant TEI file may return an error something like

conflicting ID-types for attribute "who"
                        of element "comment" from namespace "tag:textalign.net,2015:ns"

. Such a message alerts you to the fact that by mixing TEI and TAN namespaces, you open yourself up to the possibility of conflicting xml:id values. It is your responsibility to ensure that you have not assigned duplicate identifiers. Very often, it is possible for you to configure an XML editor to ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)

White space By default in XML, unless otherwise specified, consecutive space characters (space, tab, newline, and carriage return) are considered equivalent to a single space. This gives editors the freedom they need to format XML documents as they like, for either human readability or compactness. All TAN formats assume space normalization, with an extra caveat, namely, that some space is assumed to exist between adjacent leaf <div>s, even if no text node intervenes. This behavior is overridden if the first leaf <div> ends in the soft hyphen or the zero width joiner; see ). The TAN format does not stipulate how space-only text nodes should be interpreted. It is up to processors to analyze the relevant <div-type> to infer an appropriate type fo white-space separator. If retention of multiple spaces is important for your research, then TAN formats may not be appropriate, since TAN is not intended to replicate the appearance of a scriptum. Pure TEI (and not TAN-TEI) might be a practical alternative, since it allows for a literal use of space, and encourages XML files that try to replicate the appearance of a scriptum. For more on white space see the W3C recommendation.

Non-mixed content Many familiar text formats such as TEI, HTML, and Docbook allow what is called mixed content—a mixture of elements and nonspace text as siblings. The TAN formats, aside from TAN-TEI, are committed to a non-mixed content model. Nonspace text nodes and elements are never siblings. The practical effect of this decision is that indentation may be applied to a TAN file as one wishes, and space text nodes may be inserted between any two adjacent elements, without affecting the meaning. To specify in a class 1 file that two adjacent leaf <div>s should have no intervening space, see .

Namespaces

What are they? XML allow users to develop vocabularies of elements as they wish. One person may wish to use the element <bank> to refer to financial institutions, another to rivers. Perhaps someone wishes to mention both rivers and financial institutions in the same document. XML was designed to allow users to mix vocabularies, even when those vocabularies use synonymous element names. This means that anyone using <bank> must be allowed to specify exactly which vocabulary is being used. Disambiguation is accomplished by associating IRIs (see below) with the element names. The actual full name of an element is the local name plus the IRI that qualifies its meaning, e.g., bank{http://example1.com/terms/} and bank{http://example2.com/terms/}. The relationship between the element name and the IRI is analogous to that between a person's given name and their family name. The IRI—the family name—is called the namespace—not an ideal term, but the one that has been adopted. Think of the namespace as the family name for a group of elements. Namespaces look a lot like attributes (they aren't). They take the form xmlns="http://example1.com/terms/" (defining the default namespace) or xmlns:[PREFIX]="http://example2.com/terms/" (defining a namespace that has been assigned a particular prefix) placed inside an opening tag. For example,

<bank
                        xmlns="http://example1.com/terms/">...</bank>

states, in effect, the namespace for <bank> and the default namespace for all descendants (it can be explicitly overridden). Different types of <bank> can be mixed through namespaces: <bank xmlns="http://example1.com/terms/"> <bank xmlns="http://example2.com/terms/"> ... </bank> </bank> <bank xmlns="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/"> <e2:bank > ... </e2:bank> </bank> <e1:bank xmlns:e1="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/"> <e2:bank > ... </e2:bank> </e1:bank>

TAN namespace and prefix The TAN namespace is tag:textalign.net,2015:ns. The recommended prefix is tan. The TAN-TEI format uses as its default the TEI namespace, , normally given the prefix tei.

The Text Encoding Initiative

What is it? The Text Encoding Initiative (TEI) is a collection of XML rules for the representation of texts in digital form. Developed and maintained by a consortium of scholars and scholarly organizations, TEI includes not only a library of schemas, but guidelines and stylesheetsmore. The TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software. Taken from the TEI website , accessed 2017-05-21. Any TAN-T module can be easily cast into a TEI file, although much of the computer-actionable semantics will be lost in the process. Likewise, a TEI file can be converted to TAN-T, but there is a greater risk of loss of content, particularly in the header, since the non-TEI TAN formats are restricted to a small subset of TEI tags. TAN-TEI is TAN's TEI extension, based on an ODD file that is in the same directory as the rest of the schemas. TAN-TEI schemas are generated on the basis of the official TEI All schema that is available at the time of release. For more about the strictures placed upon the TEI All schema see . See also and .

Further reading Text Encoding Initiative

Data types Being written purely in XML technologies, TAN adopts its data types, e.g., strings, booleans, and so forth, from the official specifications made by the W3C. The following data types require some special comments.

Languages TAN adopts for language identification Best Common Practices (BCP) 47, which standardizes identifies for languages and scripts. For most users of TAN, this will be a simple three-letter abbreviation, sometimes supplemented with a hyphen and an abbreviation designating a script or regional subtag. For example, eng, eng-UK, and eng-UK-Cyrl refer, respectively, to English (in general), English from the United Kingdom, and English from the United Kingdom written in the Cyrillic script. As a general rule, values of this type should begin with a three-letter language code, preferably lowercase. ISO codes for human languages appear in @xml:lang and <for-lang>. The former states what language the enclosed text is in. The second indicates that some statement or claim is being made about a specific language language. For example, <for-lang> in the context of a TAN-mor file indicates which languages the file was written for. For more information, see one of the following: BCP 47 official specifications BPC 47 technical details

Dates and times TAN adopts the standardized ISO form of dates and date-times, as interpreted by XML data types. These begin with years (the largest unit) and ends with days, seconds, or fractions of seconds (the smallest). The simplest date takes this form: YYYY-MM-DD. If a time is included, it is specified by continuing the string, first with a T (for time) then the form hh:mm:ss.sss(Z|[-+]hh:mm). For example, the following is 2016-09-20T20:38:27.141-04:00 is an ISO date-time for Tuesday, September 20, 2016 at 8:38 p.m. on the Eastern Time Zone. More reading: W3C specification Wikipedia entry on ISO 8601

Identifiers and Their Use The acronyms for identifiers, and the meanings of those acronyms, can be mystifying. Here is a synopsis: IRI: Internationalized Resource Identifier, a generalization of the URI system, allowing the use of Unicode; defined by RFC 3987 URI: Uniform Resource Identifier, a string of characters used to identify a name or a resource; defined by RFC 3986 URL: Uniform Resource Locator, a URI that identifies a Web resource and the communication protocol for retrieving the resource. URN: Uniform Resource Name, a term that originally referred to persistent names using the urn: scheme, but is now applied to a variety of systems that have registered with the IANA. URNs are generally best thought of as a subset of URIs. UUID: Universally Unique Identifier, a computer-generated 128-bit number used to assign identifiers to any entity. UUIDs can be built into a URN by prefixing them with urn:. The TAN format generally prefers to refer to IRIs. See also .

Resource Description Framework (RDF) and Linked Open Data

What are they? Identifiers are used in many contexts for many purposes. One such purpose is called Linked Open Data (LOD) or the Semantic Web, which relies upon a very simple data model called Resource Description Framework (RDF), a family of World Wide Web Consortium (W3C) specifications originally designed as a data model for metadata. RDF was designed to be a data model to support general assertions. The model rests upon the concept of a statement, made of three parts: subject, predicate, and object. Subjects and predicates take identifiers that act as names of things. The object may take an identifier or just data. The idea behind LOD is that os we begin to use the same URLs for the same concepts, then independently created datasets can be combined and compared. The entire collection of RDF statements on the web allow inferences not possible on the project level. These URL identifiers look like a web page address (e.g., http://...), but are first and foremost names for things (the term "Resource"—the R in RDF—is a clumsy way to refer to any person, place, concept—anything at all). Ideally, those URLs will still name those things after the domain name expires and the web resource cannot be found.

TAN and RDF Much of TAN can be converted to RDF statements. In fact, TAN may be one of the most human-friendly way to read and write RDF. Compare, for example, this snippet (taken from ), written in Turtle syntax, ...1 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 2 @prefix foaf: <http://xmlns.com/foaf/0.1/> . 3 4 <http://biglynx.co.uk/people/dave-smith> 5 rdf:type foaf:Person ; 6 foaf:name "Dave Smith" . ...with the TAN equivalent:<person xml:id="dsmith"> <IRI>http://biglynx.co.uk/people/dave-smith</IRI> <name>Dave Smith</name> </person> In this case TAN and RDF are converted losslessly. But in many other cases, TAN statements cannot be reduced to the RDF model. This happens most often in the context of <claim>, which is designed to allow scholarly assertions and claims that are difficult or impossible to express in RDF. For example, RDF does not allow one to say "Person X is not the author of text Y." TAN claims have adapted the core concepts behind RDF to cater to scholarly needs. For more details see .

Further reading W3C recommendation Linked Data Linked Open Vocabularies

Tag URNs TAN files make extensive use of tag URNs (see ). In fact, TAN's namespace is a tag URN (). A tag URN has two parts: Namespace. tag: + an e-mail address or domain name owned by the person or organization that has authorized the creation of the TAN file + , + an arbitrary day on which that address or domain name was owned. The day is expressed in the form YYYY-MM-DD, YYYY-MM, or YYYY. A missing MM or DD is implicitly assigned the value of 01. Name of the TAN file. : + an arbitrary string (unique to the namespace chosen) chosen by the namespace owner as a label for the entire file and related versions. It need not be the same as the filename stored on a local directory. You should pick a name that is at least somewhat intelligible to human readers. Although you may use any tag URN coined by someone else, you may create a tag URN only if you are the owner of that URN's namespace. Great care must be taken in choosing the IRI name, because you are the sole guarantor of its uniqueness. It is permissible for something to have multiple IRIs, but never acceptable for an IRI to name more than one thing. It is a good practice to keep a master checklist of IRI names you have created. If you find yourself forgetting, or think you run the risk of creating duplicate IRI names, you should start afresh by creating a new namespace for your tag URNs, easily done just by changing the date in the tag URN namespace. TAN IRI names tag:jan@example.com,1999-01-31:TAN-T001 tag:example.com,2001-04:hamlet-tan-t tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts tag:bbrb@example.org,1995-04-01:pos-grc The first example comes from someone who owned the email address jan@example.com on January 31, 1999 (at the stroke of midnight, Universal Coordinated Time). The other examples follow a similar logic. The namespace of the second and third examples are tied to the owners of specific domain names, not those of email addresses. The 2014 in the fourth example is shorthand for the first second of January 1, 2014. The TAN encoding format has chosen tag URNs over URLs for several reasons: Permanence. Authors of TAN data are creating files that are meant to be relevant for decades and centuries from now, well after specific domain names have changed ownership or fallen into obsolesence, and well after the creators are dead. URLs are not built for such permanence. Responsibility. The TAN format requires every piece of data to be attributable to someone (a person, organization, or some other agent). A tag URN implies who was responsible for creating the URN. Accessibility. Tag URNs can be made by anyone who has an email address. No one has to register with any central authority. You can begin naming anything you want, any time you want, without seeking anyone's approval. Ease. Tag URNs are easier to use than, say, http-form URLs, as recommended by RDF (see ). Many potential TAN authors never have owned a domain name, and never will. Further, many of those who do own domain names cannot or do not wish to configure and maintain servers to administer the referral mechanisms upon which the semantic web depends. Disambiguation of name and location. In the semantic web, conflation of name with a location to resolve it is considered a virtue because the single string does double duty, both naming the resource and pointing to a location where more can be learned. But this conflation is unhelpful in the TAN context. TAN files are meant to be distributed widely, and not rely upon a single location. And URLs are in common parlance interpreted as locations for data, not as names for things. Tag URNS don't confuse users by looking like locations. This upholds a principle that is common in scholarly citation, namely, that one should always distinguish the name of a resource from where it might be found. Further reading: RFC 4151, the official definition of tag URNs

Regular Expressions Regular expressions are patterns for searching text. The term regular here does not mean ordinary. Rather, it derives from Latin regula, and points to a rule-based syntax that provides patterns for finding and replacing text. Regular expressions come in different flavors, and have several layers of complexity. TAN regular expressions adhere closely to the recommendation of XSLT 3.0 (XML Schema Datatypes plus some extensions), and outlined in XPath Fuctions 3.0. XML Schema Datatypes define regular expressions differently than do Perl, one of the most common forms of regular expression. For example, the pipe symbol, |, is treated as a word character in XML regular expressions (\w), but the opposite is true for Perl. For convenience, here are the how codepoints U+0020..U+00FF are categorized according to XML (and therefore TAN): Word characters (\w):

$ + 0 1 2 3 4 5 6 7 8 9 < = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Non-word characters (\W):

! " # % & ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § «  ¶ · »
                           ¿

Some of these choices may seem counterintuitive or wrong. But at this point it does not matter. The distinction is a legacy that will remain in place. It is advisable to familiarize yourself with decisions that, in some respect, are arbitrary. A regular expression search pattern is treated just like a conventional search pattern until the computer reaches a special escape character:

. [ ] \ | - ^
                     $ ? * + { } ( )

. Here is a brief key to how characters behave in regular expressions, provided they are not in square brackets (on which see the recommended reading below): Special characters in regular expressions Symbol Meaning $ end of line . any character | or (union) ^ start of line ? zero or one * zero or more + one or more [ ] a class of characters ( ) a group \w any word character \W any nonword character \s any of the four standard spacing characters: space (U+0020), tab (U+0009), newline (U+000A), carriage return (U+000D) \S anything not a spacing character \d any digit (0-9) \D anything not a digit \p{IsGujarati} any character from the Unicode block named Gujarati \\ backslash (the backslash alone suggests that the next character is a special character) \$ dollar sign \( opening parenthesis \[ opening square bracket

Some examples: Examples of Regular Expressions Expression Meaning What the expression matches when applied to "Wi-fi, good. A_hem* isn't!" ^.+$ one whole line of characters "Wi-fi, good. A_hem* isn't!" [ae] a or e "e" [a-e] a, b, c, d, or e "d", "e" [^ae]+ one or more characters that are anything except a or e "Wi-fi, good. A_h", "m* isn't!" .i any character followed by i. "Wi", "fi", " i" (.i) when a character followed by an i is found treat it as a capture group (used only in a search pattern) "Wi", "fi", " i" $1 first capture group (used only in a replacement pattern, and corresponds to the sequence of capture groups in the search pattern) In the example above, each match corresponds to $1 [aeiou]\w* any lowercase vowel along with every word character that follows "i", "i", "ood", "em", "isn" [t*]. any t or * and the following character "* ", "t!" Note that the asterisk, if inside a character class, acts as itself. \s+ match one or more space characters " ", " ", " " \w+ match one or more word characters "Wi", "fi", "good", "A_hem", "isn", "t" \W+ match one or more nonword characters "-", ", ", ". ", "* ", "'", "!" [^q]+ one or more characters that are not a q "Wi-fi, good. A_hem* isn't!"

The examples above provide a taste of how regular expressions are constructed and read. Regular Expressions and Combining Characters Regular expressions come in many different flavors, and each one deals with some of the more complex issues in Unicode in their own manners. This ambiguity will most keenly be felt in the use of combining characters. Suppose we have a string of three characters, áb (i.e., an acute accent over the a, áb). The regular expression a. will in some search engines include the b and others not. Unicode has differentiated three levels of support for regular expressions (see official report). Only level one conformance in TAN is guaranteed. Combining characters fall in level two. In TAN, character counts depend exclusively upon base characters, not combining ones (see ). TAN includes several functions that usefully extend XML regular expressions. See tan:regex, tan:matches(), tan:replace(), tan:tokenize(). Further reading: Various tutorials on Regular Expressions Wikipedia, Regular Expressions Regular Expressions in XSLT 3.0 Unicode and Regular Expressions XML Schema Datatypes

Interpretation of multiple values Many TAN elements contain multiple values or have attributes that allow multiple values. Do those multiple values represent intersection, union, or distribution? For example, attribute="A B" could be interpreted to mean, using the diagram below, anywhere in y (intersection); anywhere in x, y, or z (union); or somewhere x or y and somewhere in y and z (distribution).

Venn%20diagram.jpeg Multiple values in TAN are defined according to perceived common usage in ordinary English: Union (= x, y, or z; default). Examples: anything that takes the , <equate>, <period>, <where>. Intersection (= y only). Examples: @adverb and other qualifications of claims. For example, "...probably not..." does not mean "...probably..." and "...not..." Distribution (= x or y and y or z). @affects-element, @claimant, @object, <object>, @src, @subject, <subject>, @verb. For example, "[Source A], [source B], are Z" means "Source A is Z" and "Source B is Z." The discussion above does not treat the important question of range. If an assertion is made about A, is it true for one point in x or y, or is it true for any and all points in x and y? At present, TAN does not address this ambiguity, and leaves the interpretation open.

Patterns and Structures Common to All TAN Encoding Formats This chapter provides general background to the elements and attributes that are common to all TAN files. For detailed discussion of individual elements and attributes, see . This chapter has no relevance for TAN catalog files. For an explanation of that format, see .

Common Patterns

IRI + name Pattern Both humans and computers need to read and write TAN metadata. Very often what is readable to humans is unreadable to computers, and vice versa. So the TAN format requires that all metadata be provided whenever possible in both forms. Although this rule may appear to introduce redundancy and therefore opportunities for error, the clarity is critical. It is the only way at present to ensure that anyone who approaches the data—computer or human—can parse and use it. In addition, doubly expressed metadata provides a safeguard much like a checksum: human- and computer-readable descriptions should correspond. Any discrepancy is a signal that an error should be diagnosed and fixed. Some metadata, such as comments, are neither easily nor profitably translated into a computer-actionable string. In such cases only the human-readable form is required. Other metadata involve regular expressions or ISO-compliant dates, both of which are well formed and are usually human-legible. Such data are not repeated. In cases where a datum is not understandable to humans, such as a complex regular expression, a <comment> may be provided. Those exceptions aside, all other metadata takes what is called the IRI + name pattern: one or more <IRI>s and <name>s and zero or more <desc>s. If the thing being described is a digital file, then the IRI + name pattern is part of a larger pattern, the .

Digital Entity Metadata Pattern Some entities identified by the will be digital resources. In those cases, the IRI + name Pattern is extended in two different ways, according to whether the entity is a TAN file or not. If the entity is a TAN file, then <IRI> (one and only one) must be a valid tag URN that matches the @id value of the TAN file being referred to. This may seem excessive, since in other contexts (HTML, TEI), one need only the @href or @src. This extra measure has been introduced because TAN files are meant to be valid long after their creation, when they may be separated from their original context, or when a server no longer has the files referred to. Without the @id value, recovering the referred to file would be difficult or impossible; with it, easier, and perhaps possible. If the entity is not a TAN file, then any IRI may be used. If you choose to use the digital resource's URL as its name (and as its location; see below), then it will be inferred that you mean to identify the digital resource that appeared at that URL at the date or time you accessed it. In either case, the pattern adds to the IRI + name pattern one or more <location>s and an optional <checksum>.

Edit Stamp Most TAN elements allow for an optional edit stamp, an @ed-who and an @ed-when, stating who created or edited the enclosed data and when. Neither attribute is allowed without the other. @ed-when, along with @when and @when-accessed, are the attributes through which a TAN file's version is calculated. The latest date serves as the version number. An edit stamp performs the same function as <change>, except that no description can be provided, and it points precisely to the element where a change has been made. If a description of the alteration is necessary, <change> should be used.

Overall Structure All TAN-compliant files, no matter the type or class, follow a common basic structure: (1) a prolog with at least two processing instruction nodes; (2) a root element; and (3) a head, a body, and an optional teiHeader and tail. Prolog and processing instruction nodes: The standard prolog of every XML file must begin the fil:

<?xml version="1.0"
                  encoding="UTF-8"?>

After that come two processing instructions specifying the two schema files required for validation

<?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
                           c]"?>

<?xml-model
                        href="[PATH]/[ROOT-ELEMENT-NAME].sch"?>

The first processing instruction node points to the RELAX-NG schema that declares the major, structural rules. The second points to the finely tuned rules, written in Schematron. Both processing instructions are required. [PATH] represents the pathname to the schema file, whether local or on a server and [ROOT-ELEMENT-NAME] stands for the name of the root element (the element that is the ancestor of all other elements in the document and the descendant of none). It is your choice whether you use .rnc or .rng as the extension for the RELAX-NG schema. The former is the compact syntax and the latter, the XML format. They are equivalent. The schemas are written primarily in the compact sequence, then converted to the XML format. TAN files admit three different levels of validation: terse, normal, and verbose. A phase may be specified with a pseudoattribute phase in the prolog, e.g.,

<?xml-model
                  href="TAN-A-div.sch" phase="verbose"?>

. But it is customary not to specify the phase, since users will oftentimes wish to change the level of validation. Verbose takes the longest, and terse the shortest. Verbose provides the most feedback, terse the least. Root element: The name of the root element identifies the type of TAN file:Root TAN elements Root element name Type of data TAN class <TAN-T> plain text transcriptions 1 <TEI> TEI transcriptions 1 <TAN-A-tok> token-based alignments 2 <TAN-A-div> division-based alignments 2 <TAN-A-lm> lexico-morphological analysis 2 <TAN-mor> part of speech / morphology patterns 3 <TAN-key> glossaries 3 <collection> catalog of TAN files 3

<collection> is provided here only to complete the table. None of the material in this chapter applies to this special class 3 format. See . Each root element takes a mandatory @id and @TAN-version. All TAN elements take the namespace tag:textalign.net,2015:ns. In most cases, this value is placed in the root element. (The only exception are TAN-TEI transcription files, which take as a default namespace http://www.tei-c.org/ns/1.0 everywhere but in /TEI/head, which takes the TAN namespace.) For more about namespaces, see . Root element children: Most root elements take two mandatory children: <head> and <body>, the latter containing data and the former, metadata (data about the data). TAN-TEI files take a three children: <teiHeader>, <head>, and <text>, because the TEI header does not satisfy TAN expectations. See . All TAN files may take one final optional child, <tail>, a private use element that allows any well-formed XML. It was introduced to facilitate more efficient validation. Nothing in a TAN file should be dependent upon the <tail>. That is, if you are editing a TAN file and you add a <tail>, assume that it will be disregarded by other users. Similarly, you may delete any TAN file's <tail> without consequence.

<code><link linkend="attribute-id">@id</link></code> and a TAN file's IRI Name Every TAN file requires in its root element an @id. Its value, termed the TAN file's IRI name, must take the form of a tag URN (see for syntax). The file's IRI name is the primary way other TAN files will refer to it. The namespace of the current file's IRI name must match at least one namespace in one <person>'s <IRI> value. This helps tie the responsibility for the TAN file to at least one person. The first such <person> is called the primary agent, and is bound to the global variable $primary-agent. In choosing a value for @id you might borrow the filename, but you do not have to. Indeed, it is probably not a good idea, since files are frequently renamed, often with good reason. A TAN file's IRI name should not be changed, especially after publication, because the name is supposed to be permanent and stable. On occasion during editing, it will become clear that revisions are so deep that the file is altogether a different kind of thing. If a previous version has been published, then coining a new IRI name is advised, to dissociate the file with its ancestry. You may always document the connection by supplying a <see-also> element in the <head>, specifying the <relationship> between the two. If you take someone else's data and alter it then you should not change the IRI name, even the namespace. To avoid suggesting that the owner of that namespace is responsible for any revisions you make to the file (if you are allowed—see <license>), you should add yourself as an <person> and then document your alterations through <change> or @ed-when and @ed-who. You should also probably add a <see-also> element, pointing to a version of the file that predates your intervention. The name of the version of a TAN file is identified by the most recent date in a file's @when, @ed-when, or @when-accessed. It is important, therefore, whenever you change a TAN file that has already been published to provide at least an edit stamp () in the part of the file you changed or in a <comment> or <change>, so that anyone validating a TAN file dependent upon yours will be warned that changes have been made. The user may then either continue to process the file (the changes may be minor on inconsequential) or investigate the changes before deciding what to do. Because the IRI name is stable, it is suitable for use outside of TAN, in, for example, RDFa, JSON-LD, and linked open data (see ). The IRI name kept at @id is the only metadatum positioned outside <head>. It is placed as rootward in the document as possible to emphasize that it names the entire document. @TAN-version must be 2018, indicating that the files have been made in light of the development files of version one.

Metadata (<code><link linkend="element-head"><head></link></code>) No matter how much one TAN format differs from another, the metadata are quite similar. Anyone getting a TAN file, no matter its class or type, is assumed to want to know, and therefore find easily and predictably, the following: the stable name of the file; its version; its sources; other files upon which it depends or otherwise have an important relationship; the most significant parts of the editorial history; the linguistic or scholarly conventions that have been adopted in creating and editing the data; the license, i.e., who holds what rights to the data, and what kind of reuse is allowed. the persons, organizations, or entities that helped create the data, and the roles played by each. To answer these questions completely, consistently, and predictably the <head>, a mandatory child of the root element, takes a common pattern across all TAN formats, thus allowing anyone to easily and predictably work across large numbers and types of TAN files. The TAN <head>, intended to be concise and focused, compels you to provide metadata for the data that is governed by <body>, but it does not accommodate metadata for the metadata. That is, your metadata should focus on the data itself and not other things. For example, <head> requires you name the people who helped create or edit the data, but you are not expected to tell us about them. Merely give good <IRI>s that point to authoritative sources that provide background information. The principles above explain why the TEI extension of TAN requires two heads, one for TEI and the other for TAN. <teiHeader> is impossible to map onto a TAN <head>. But that <teiHeader> has valuable, sometimes critically important, information, and should be retained, or replaced with a valid but empty skeleton. Detailed descriptions of <head> and its components are in . Here we provide a summary, general description of TAN metadata. To describe the current file, <head> takes one or more <name>s, zero or more <desc>s and <master-location>s, one <license>. Next come a list of files upon which the file depends: zero or more <inclusion>s, zero or more <key>s, zero or more <source>s, and zero or more <see-also>s. All editorial assumptions are placed in <definitions>, whose contents differ from one TAN format to the next. Finally comes the responsibility section stating who did what when: one or more <person>s, <role>s, and <change>s, and zero or more <resp>s.

Rights and Licenses Two TAN elements cover rights and licenses: <license> (mandatory in every TAN file) and <licensor>. The first element defines the license under which you are releasing your data; the second specifies who has licensed the data. The license applies only to the file itself, not to its sources. The distinction is important, and helpful. It is much easier for you to decide and state the rights and license behind your own work than to do so for that of others. Declaring who holds what rights over your source(s) may be not only difficult but risky, and is therefore optional (see below). When using a TAN file, you should investigate the entire chain of rights. If you find a discrepancy between the license of a TAN file and that of its sources you should respect the more restrictive one. If a TAN file has a very liberal, open license for the data, this does not necessarily mean that the material upon which it depends is in the public domain. The TAN file's source may be under tight restrictions. If you wish to indicate what license governs a source, use <desc> in <source>. TAN adopts the Creative Commons licenses as its default key vocabulary. See .

Keys and Inclusions Many if not most TAN files are created alongside or in the context of a project, where certain elements will be repeated. Explicit repetition from one file to the next makes them prone to error. Changes might be made in one file but not in another. TAN has two features—keys and inclusions—that help avoid duplication, reduce the likelihood of incomplete editing, and lead to cleaner, smaller files. In general, you should first work with keys. If they are not doing the job you need, then try inclusions.

Keys Most often, an editor wants a simple, shorthand reference to an entity commonly referred to from one file to the next in a single project, e.g., the person who is the principle editor, roles, and division types. Projects are advised to create their own <TAN-key> files populated with commonly used vocabulary. Using those files is a two-step process. First, the TAN-key file is declared via <key>. Second, elements (normally in <definitions>) can take @which instead of the customary IRI + name pattern. @which points to a <name> in the TAN-key file. TAN includes a number of standard TAN-key files located at and documented in . Any element that takes @which can take full advantage of those files, without <key>. It is strongly recommended that you depend upon only TAN-key files you have written, and not those of a different project.

Inclusions More powerful than TAN-keys are inclusions. Unlike other forms of inclusion you may be familiar with, TAN inclusion involves only select elements, never an entire file. As with keys, TAN inclusion is a two-step process. First, a TAN file is made available for inclusion via <inclusion>s (inside <head>). Like <key>, an <inclusion> does nothing on its own. It merely indicates a file that may be used for inclusions. Second, elements that allow it make take @include, which points to the @xml:id reference of the <inclusion>. In the validation process, those elements will be replaced with every element of that name found in the inclusion file, checked recursively (see below), and ignoring duplicated elements. <inclusion>s are critically important to the content of the TAN file, so any file with <inclusion>s that cannot be located will be regarded as being in fatal error. Because of the importance of access to included files, it is strongly recommended that inclusions be limited to files locally available, in the same project. Inclusions are recursive. If a TAN file A has

<x
                        include='B'>

and file B has <x include='C D E'> then file A will be given all <x>s found in B, C, D, and E. In any recursive activity, circularity is fatal. That is true for TAN inclusion as well, but only within a given element name. It is perfectly legal for two files to include each other, as long as they do not try to include the same elements. TAN inclusion removes elements from their original context, which means that values that must be interpreted locally are converted before the elements are included. For example, @which must be interpreted in light of the included document's keys, not those of the including document. Similarly, different numeration systems, e.g., Roman numerals, must be interpreted locally and converted, before inclusion (see ).

Distinguishing <code><link linkend="element-source"><source></link></code>s and <code><link linkend="element-see-also"><see-also></link></code>s Creating and editing a class 1 TAN file frequently involves working with non-TAN digital files. In the course of editing, and making the material TAN-compatible, you will likely start to correct errors, to normalize conventions, or to bring the transcription closer to an earlier version. At such times it may unclear how to credit the digital files. To answer this, first determine a class 1 file's <source>. Everything else is then a <see-also>. If you find in the course of editing that you are starting to depend upon the source of your source, then that earlier version should be credited as the <source> and the file you were using should be moved to <see-also>.

Attribute inheritability and priority Many attributes are not inheritable, e.g., @xml:id. Others are inheritable, indicating something about the host element and all its descendants. When a descendant has the same attribute, the default behavior is for the new attribute to cancel any inherited ones, e.g., @xml:lang, @affects-element, @claimant. In other cases, the inherited effect is additive, e.g., @cert. Consult individual attribute entries to understand an attribute's behavior. Some attributes in an element have priority for interpretation. @claimant, for example, has priority over @cert second. That is, the two attributes in the same element are to be interpreted to mean: "@claimant has @cert confidence about the following claim:...."

Defining Words and Tokens At the heart of interaction between class 1 and class 2 files is a reference system that counts or names words. This poses a problem at the outset. The term word is notoriously difficult to define, no matter the language. In different contexts, for example, "New York" and "didn't" can each be justifiably taken to be one or two words. Furthermore, some scholars consider punctuation to be words (e.g., commas in modern prose, representing "and"), whereas others ignore them as being anachronistic or capricious (e.g., ancient Greek and Latin). In the end, the number of meanings for "word" reflects the rich variety of scholarly disciplines. TAN adopts the proximate term token—a word that is defined not according to grammar but according to a regular expression (see ). A TAN token is a reference pointer, not a linguistic marker. To define a token in TAN does not entail any linguistic commitments. Neither editors nor users of TAN data should infer that a <tok> points to a morpheme, a lexeme, or any other linguistic entity. There will frequently be a fortuitous correlation between the two, but it is not guaranteed. In TAN, a token is purely a method of reference. TAN was developed in service of ancient literature, where punctuation is generally ignored as being late or not central to the text. Even in contemporary use, most people ignore punctuation when they count words. Therefore the default <token-definition> defines a token as being any continuous string of word characters, the soft hyphen, the zero-width space, or the zero-width joiner, formally defined: <token-definition regex="[\w‍]+"/> This pattern will result in a close resemblance to what is ordinarily thought of as words, but perhaps with some surprises (see above, ). If no <token-definition> is explicitly given, the pattern above will be assumed. If you are working with modern texts, where punctuation might be important to name and number, try the built-in keyword

letters and
                  punctuation

: <token-definition regex="\w+|[^\w\s]"/> This expression defines a token as a sequence of word characters or any single character that is neither a word nor a space. The string "(I go!)" (the text inside the quotation marks) would have five tokens:

( I go !
                     )

. Above are two built-in, TAN-defined <token-definition>s. You may customize your own <token-definition> to suit your needs. But keep in mind that TAN files were meant to be shared across fields and disciplines. You are encouraged to to define tokens in manner customary to users of the text. Specialized definitions make it less likely that your TAN file will be able to mesh well with other TAN files. Two class-2 files annotating the same class-1 file cannot be easily compared or synthesized if they use different definitions of token. Given those caveats, consider a specialized case, where you wish to prepare your transcriptions such that certain Unicode characters precisely delimit tokens that are synonymous with a particular linguistic category, say lexeme. Say, for example, you use specialized control characters (e.g., U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text of your class 1 file. You might then create a <token-definition> like this: <token-definition regex="[^\p{Cf}\s]+"/> The statement defines a token as any consecutive sequence of non-spacing and non-control format characters. Such customized approaches may make the technique unwieldy or impossible to use, thereby limiting your TAN file's interoperability and utility. It is recommended that if you use control formatting characters or other special characters that are invisible to use the xml entity, e.g., ‍, so they can be seen in your file.

Class-1 TAN Files, Representations of Textual Objects (Scripta) This chapter provides general background to class 1 TAN files. For detailed discussion of specific elements or attributes see . Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri, stones, or any other objects with writing on them—collectively termed here scripta (sg. scriptum). Files of this class are the foundation of any project. No class 2 files (e.g., alignment, morphology) can be created without class 1 files. Transcriptions come in two different formats, identified by the root element. <TAN-T> is a simple, generic format, as close as one can get to plain text. <TEI> (also referred to in this manual as TAN-TEI), on the other hand, can be complex and highly expressive. Because the two types function almost identically, the generic TAN-T format is described first, followed by supplemental comments on TAN-TEI.

Principles and Assumptions

General (For more general principles and assumptions applying to all TAN files, not just class 1, see .) Class 1 formats are designed for faithful but judiciously normalized digital transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of a single work found in a single scriptum (text-bearing object), segmented and uniquely labeled with a common reference system. Editors of TAN-T(EI) files should be able to read, write, and proofread texts in the languages of the transcriptions. They should understand the texts well enough to segment them and label them according to the conventions used for those works. They should be able to distinguish the text of a primary source from its editorial apparatus. They should be familiar with normalizing conventions for texts from the period, language, and culture. They should know how the transcription might be used in other contexts, especially translation studies or a study of quotations. Editors need not understand everything about their texts, and they need not have any specialized skill in grammar or lexicography. They need not know the morphology of individual words, or how individual parts of the text have been translated. Those skills should be used in other TAN formats. TAN-T(EI) editors stand at the beginning of a larger workflow for text alignment. It is critical that work not be published hastily, and only after careful proofreading. Many transcriptions, especially those of long texts, have typographical errors. Eliminating as many as possible before publication will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed with the assumption that all our files have typographical errors that can and should be corrected as they are found. If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate alignment and annotation, which depends critically upon a stable, familiar reference system. Transcription files should be segmented and labeled according to a reference system that can be easily applied to other versions of the same text in other languages. If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should be prioritized over visual (lines, columns, pages, volumes). See below on reference systems.

Domain model Contributors and users of TAN files should strongly distinguish between a scriptum (text-bearing object) and a conceptual work, e.g., a specific printed copy of the Iliad versus the Iliad concieved generally. The former has materiality (digital files are treated as being material) and the latter does not. Even though both are constitutively necessary for any transcription, the two are sharply differentiated in the TAN format: <source> and @src point to physical exemplars; <work> and @work to the conceptual. The distinction may remind some readers of the domain model defined by the Functional Requirements for Bibliographical Records (FRBR), which identifies four types of entities for what they call Group 1 (Products of intellectual & artistic endeavor): Work, Expression, Manifestation, and Item, the first pair being conceptual, non-material entities and the latter pair material ones. TAN has been designed with a slightly different domain model in mind. FRBR Items are equivalent to what TAN calls scripta. Multiple scripta that for all intents and purposes are indistinguishable (i.e., items reproduced mechanically) are equivalent to FRBR Manifestations, but in TAN no corresponding entity has been defined. It is best to think of TAN scripta as being equivalent to FRBR Items, with FRBR Manifestations being sets of indistinguishable TAN scripta. As for conceptual entities, TAN has been designed with the assumption that most users will find the distinction between Works and Expressions to be unhelpful or misleading. What one person calls a FRBR Expression another may legitimately call a Work. TAN assumes that any derivation of a Work (or Works) is itself a Work, which is really shorthand for work-version. Thus, in this manual the term version indicates merely a type of work that is known either to derive from another work or to be the basis for other versions of a work. TAN avoids altogether the term Expression. Aside from the issues mentioned above, the term implies a medium (without which nothing can be expressed) and therefore materiality.

One version, one work, one object, one reference system Every TAN-T(EI) file must be restricted to a transcription of a single version of a single conceptual work found on a single scriptum, segmented and labeled according to a single reference system. This restrictive principle is critical to the the success of the network. It reduces the risk of confusion, simplifies the files, and shifts markup complexity from an individual transcription file to the network in which that file participates.

One scriptum Each TAN-T(EI) file transcribes one and only one text-bearing object or scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a bottlecap. If the object you've chosen has been made mechanically and is virtually indistinguishable from other objects created by the same process (e.g., copies of a printed book or copies of a digital file), then the entire set of copies is to be treated as a single object (an entity some librarians call a manifestation). The definition of some scripta require an editor's discernment and judgment. For example, some manuscripts have been split up, their parts now residing in multiple libraries around the world; others may be a composite of older manuscripts. In such cases, you may need to define your scriptum in a way that might not match the way others define it. But the decision is your prerogative, not theirs. You have both the right and responsibility to define your object in the way that you think will most benefit users of your files. It is a good idea to name your scriptum in <source> with an <IRI> value in the form of an http URL provided by a library catalogue. This way you provide a way for others, perhaps through an algorithm, to retrieve extensive, structured bibliographical information. You also save yourself the hassle of writing a detailed bibliographical description that your users would probably not be able to import into their reference management software. If a URL cannot be found for <IRI>, you may simply coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses the same source, it would be a good idea to adopt that name.

One work The transcription must be restricted to a single creative work, identified by <work>. Many scripta have more than one work. Identifying and defining the creative work you transcribe is, once again, your prerogative. Suppose the scriptum you have is a Bible. The work you choose from that object can take whatever contours you wish. Perhaps you wish to encode the entire Bible and treat it as a single work. Or maybe you wish to treat only the New Testament as the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that gospel, or simply the Beatitudes. Any definition of a work is permitted, but a TAN-T(EI) file should contain nothing but the work you have defined. It should be a complete representation of what is found on the object, even if only partially preserved, and respect as far as is practical the order of the text in the scriptum. Well-known works may have a suitable IRI name already assigned to them, say by means of a DBPedia entry. Most works have not been assigned IRIs or are named in IRI vocabularies that are not well known. You may assign any work your own URN, through a UUID or a tag URN.

One version The transcription must be restricted to a single version of the creative work, identified by <version> (optional). In most cases, <version> is unnecessary, because <work> in conjunction with <source> are sufficient to identify a particular work-version. But if the source carries multiple versions (e.g., a bilingual edition of a text), then <version> should be included. Each versions from a scriptum should have its own separate TAN-T(EI) file. Notes should be included only if they are an integral part of the primary work (i.e., by the same author, not by a later editor). If you think the notes to a work are important, consider putting them in their own TAN-T(EI) file, or converting them to claims in a TAN-A-div file. If you need to specify exactly where on a scriptum a version appears, <desc> or <comment> should be used. Very few work-versions have their own URN names. It is advisable to assign a tag URN or a UUID. If the IRI you have used for <work> is in a namespace that you own or control, then you are entitled to modify it, and you may wish merely to add a suffix to the work IRI to name the version.

One reference system Every TAN transcription must be segmented into a hierarchy of uniquely labeled divisions, defined in the <body> through <div>s and their @type and @n values. Those divisions, whenever possible, should align with the reference system that prevails for the work across versions or translations, what is sometimes called a canonical reference system. Because even the most familiar reference system admits degrees and dispute, the term canonical is problematic, so reference system is preferred in these guidelines. If you have your choice, preference should be given to systems that follow the semantic contours of the work, not the physical features of a particular object. Chapter, paragraph, and sentence numbers are preferable to volume, page, and line numbers, because other derivative versions of a work (e.g., translations, paraphrases) will only roughly, if at all, follow an object-oriented reference system. Sometimes an object-based reference system is inescapable, or is the most common reference system for a work (e.g., Porphyry's commentary on the Categories). It is perfectly acceptable to adopt that scheme, but it may eventually entail more labor for the alignment process. If a given work has multiple systems (e.g., the works of Plato and Aristotle, which have two reference systems—semantic- and object-oriented—both of which are standard and important), then the recommended practice is to encode the same text twice, placing in each file a <see-also> pointing to the other and a <relationship> with the keyword

alternatively
                        divided edition

as the value of @which. A pair of alternatively divided editions can usefully serve as the basis for concordances. In fact, the pair can be used as the first step in converting other versions of the work from one reference system to the other. If there is a good reference system, but the divisions are overly lengthy, you may introduce subdivisions. Such subdivided texts are compatible with references to the older system. But there is no guarantee that the provisional subdivisions you introduce will be adopted by other editors who create or edit TAN versions of the same work, and in the end editors working independently upon the same text may produce discordant schemes. The TAN-A-div format was designed to reconcile such differences. If there is no reference system, or if you think that the ones that exist are inadequate or misguided, create one of your own. If you develop your own reference system, be sure to optimize for all versions of the work, whether known or not. In the <definitions>, at least one <div-type> must be supplied, declaring the types of divisions into which the text has been segmented, to be referred to by @type in each <div>. To declare a <div-type> does not require you to use it in the transcription. It is advisable to keep the abbreviation you adopt in @xml:id brief but meaningful. Well-known division types already have suitable IRI names. See for a list of core TAN vocabulary for division types, both common and uncommon. If you encounter a rare division type, or one that needs custom specificity, you should mint your own, either in the declarations or in a separate TAN-key file. Reference systems have as a central component numbering systems. TAN supports five major numeration systems: Arabic numerals. 1, 2, 3, etc. Roman numerals. Values up to 5000, utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with liberal syntactic rules (within a roman numeral, any digit preceding one of a higher value is assumed to be a subtraction from the total value; all others are positive values). Alphabetic sequences. The 26-letter Roman alphabet, with numbers higher than 26 (or any multiple of 26) beginning with the letter a incrementally repeated, e.g., y (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed. Arabic numerals + alphabetic sequences. Arabic numerals followed immediately by an alphabetic sequence. The second item is to be calculated as a subsequence of the first item, with the lack of a second item taking highest priority. E.g., 4, 4a, 4b, 4c.... Alphabetic sequences + Arabic numerals: As above, but with alphabetic sequence preceding Arabic numerals. TAN file processors will attempt to convert all values of @n to Arabic numerals. Some values are ambiguously Roman numerals or alphabetic sequences, e.g., c (= 3 or 100). Such numerals are assumed to be roman, unless you supply a <ambiguous-letter-numerals-are-roman> and define it as false. There are also tools for other numeration systems, but they have not been implemented in the validation process. See tan:letter-to-number() and dependencies.

Normalizing transcriptions You should declare how you have normalized the transcription via <alter> and its children, e.g., <normalization>. (For suggestions on values of <IRI> for <normalization> see .) Generally speaking, normalization entails the suppression of things extraneous to or separable from the work you have chosen. You are encouraged to omit parenthetical editorial insertions (especially quotation references), stray handwritten remarks, discretionary word-breaking hyphens, editorial comments, inserted cross-references, and reference numerals (page numbers, section numbers, etc.). If chapter 4 begins "4." or "IV" then leave out the prefatory numeral—you've already indicated it in @n. In addition, you should resolve ligatures and correct unintended typographical errors. (Such orthographic corrections are useful to those users who want to generate lexico-morphological data automatically or semiautomatically.) The goal is a transcription whose text is free of the interpretive voice of later editors. You should remove from the text anything that is not part of the work proper and would interfere with detailed word-for-word alignment, or would require extra preprocessing or postprocessing work for later users. If you are segmenting a source into line breaks, and you are required to break a word between divisions, you should either use the soft hyphen () or the zero-width joiner (‍) at the end of the first leaf <div>. TAN processors that handle a leaf <div> will automatically normalize the space in the element, then place a space between that leaf <div> and the next unless if one of those two characters are found at the end of the first, in which case the character will be deleted and the two <div>s will be joined with no intervening space. For more on issues regarding whitespace, see . In a digital source, variable lengths of spacing marks (e.g., General Punctuation U+2000..U+200B) should be converted to ordinary spaces, and superscript combining Roman letters (U+0363..U+036F) should probably be converted to their non-combining counterparts. All Unicode must be normalized to NFC forms (see ). If you are working with a text with notes, distinguish between those written by the same person who wrote the work you're transcribing from those that aren't. Treat the former as part of the work proper and give each note a <div> with a suitable @type and place it after the <div> it annotates. It will be assumed by processors of the data that, absent more specific information, any <div> of an annotating @type is an annotation of the last <div> that is not an annotation. (Alternatively, you may use the <note> feature of TAN-TEI, but bear in mind that this element will be treated by users as part of the leaf div to which it belongs, not separate from it.) If the notes are not part of the work per se—for example, translator's notes in a translation of a primary source—you should treat them as a separate work altogether, and put them in a separate TAN-T(EI) file, perhaps linking the two through <see-also>. You may wish to structure that file so that it mirrors the reference system of the primary source, to facilitate automatic alignment between the two. Remember that the note signals in the main text and in the footnote area are metadata meant to help readers link corresponding passages of texts, and should be deleted. If the connective function served by the note signal is important, create <claim>s in a TAN-A-div file, which supports correlating comments to specific ranges of text. This principle holds true for variants in the scriptum. For example, a manuscript may have correctors' marks. Or a set of footnotes (or apparatus criticus) might comment on how and why the main text differs from previous readings. In those cases, each set of corrections might be wholly incorporated into the <claim>s of a TAN-A-div file, perhaps also with a separate TAN-T file. Overall, normalization is a difficult topic, and it is not well studied. Not all decisions will be clear-cut. You may justly hesitate before normalizing orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode that lend themselves to varying conventions may need special consideration. You may need to consider whether an unusual or rarely used Unicode character might be misinterpreted or hinder other users. Document any decisions in the <alter>. In some ambiguous areas, you can use TAN-TEI to your advantage. Suppose, for example, a manuscript has reference numerals that are sui generis. That is, these reference numbers do not correspond to the "canonical" reference scheme. On the one hand, they are metadata, and should arguably be deleted; on the other, they are part of the text, and witness to how a text was read and changed over time. A middle-ground approach would move these references to TAN-TEI's <milestone rend="">. In that way, the numerals are removed from the main text; on the other hand, the information is retained. Generally speaking TEI's @rend is an excellent way to remove something from the main text, without removing it from the file altogether.

Transcriptions The sole purpose of the <body> of a class 1 file is to contain a segmented transcription of a single version of a single work from a scriptum. <body> may take @in-progress and must take @xml:lang that the majority of the text is in. If a change in language occurs in a descendant <div>, ensure that its @xml:lang value (explicity or by inheritance) indicates the language that is used. <body> takes one or more <div> elements, each of which govern either other <div> elements, or text (or TEI elements). The term leaf div refers to those <div>s that contain text and therefore no other <div>s. Within this treelike structure of <div>s, the concatenation of @n values, starting from the most ancestral <div>, provides the flat ref, the reference system used by class 2 files to refer to parts of TAN-T(EI) files.

Flattened References, and the Leaf Div Uniqueness Rule One of the most important validation rules is the Leaf Div Uniqueness Rule, which states that the flat ref for each leaf <div> must be unique. This rule applies only to leaf <div>s and not to <div>s in general, since on occasion a major textual unit will be broken by another. For example, chapters 24 and 30 in the book of Proverbs of the Septuagint are split and interleaved (24.1–22e [22a–e are verses not extant in the Hebrew]; 30.1–14; 24.23–34; and 30.15–33).

Transcriptions Using the Text Encoding Initiative (<code><TEI></code>) This section is to be read in conjunction with and , which address related technical issues. Some creators and editors of transcriptions will find the rather stripped-down TAN-T format inadequate. Some may wish to mark up the text further. Some may already have a library of transcriptions whose annotations are desirable to keep, even if uninteresting to most users. In these cases, you should use TAN-TEI, an extension to the Text Encoding Intiative (TEI) format, which is well known for its expressiveness, its stability, its flexibility, and its widespread use in scholarship. TEI was designed to be maximally expressive and flexible, to serve the detailed needs of humanities scholars. In serving this mission, TEI has come to define more than five hundred different element names, and more than two hundred attributes (roughly six times more than are defined in TAN). Of course, any given TEI file uses only a small subset of those elements and attributes, and TEI itself comes in different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to TEI All, which opens up almost the entire library. Although the TEI format is oftentimes seen as a standard, it lacks some of the charactistics one normally expects in a standard. It is very flexible, admits flavors and interpretation, and has been designed to encourage customization. Individuals and projects may define their own subset of TEI elements, to constrict or expand the allowable rules as they see fit. TAN-TEI is one of those customizations. The major difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to ensure that transcriptions are maximally likely to be interchangeable with other TAN-TEI files. TAN's customization of the TEI can be summarized as follows (the default namespace in this section is the TEI namespace, http://www.tei-c.org/ns/1.0): Synopsis of TAN-TEI customization TEI element summary of alteration <TEI> must have @id with IRI name should take new namespace declaration, xmlns:tan="tag:textalign.net,2015:ns" takes a new child element, <head>, placed between <teiHeader> and <text> <text> Only the child <body> will be considered. <front> and <back> will be ignored. <body> must take @xml:lang may take @in-progress must take exclusively one or more <div>s any elements or text between <div>s will be ignored contents must be restricted to a single work any and all text nodes will be treated as part of the transcription <div> must take either only <div>s or no <div>s at all must take @type and @n (or @include)

Like all other TAN files, the root elements of TAN-TEI files must take an @id, the IRI name. See above, . TAN-TEI files have two heads, which may strike you as odd. The TEI head and the TAN head were designed for different purposes. Whereas the TAN <head> is meant to be brief and keyed to both IRIs and human-readable data, the <teiHeader> permits quite an expansive range of metadata, and about matters that bear only indirectly on the transcription (e.g., manuscript descriptions). Further, <teiHeader> was designed to be read principally by humans. Processors of TAN-TEI files will in general ignore the contents of <teiHeader>, since the contents are unpredictable. If your <teiHeader> has any kind of metadata relevant to TAN users, you will need first to create a standard TAN <head> (see and ). This conversion needs to be performed manually, since the two headers are incommensurate, and writing each one requires a different kind of mentality. In a TAN-TEI file, the TAN <head> must take the TAN namespace, i.e.,

<head
                  xmlns="tag:textalign.net,2015:ns">

or <tan:head> if the prefix tan: has been defined in the root element. Within any leaf <div>, you may use whatever TEI markup you wish, to whatever level of depth or complexity. All users of your TAN-TEI file will be interested in the text; only a subset will care about any markup within leaf <div>s. For this reason, even if you change the value of @xml:lang within a leaf <div>, there is no guarantee that readers or processors of your data will take it into account. TAN-TEI should not be used to try to represent the physical appearance of the text on the object. You may need to prepare a TEI file to be TAN compliant. As a matter of practicality, it is helpful to envision the conversion process as falling in three steps: Structure: insert new processing instructions (TAN-TEI validation files); adjust root element by supplying IRI name to @id, TAN namespace to @xmlns:tan. Metadata: create new <head> and populate it Data: edit <body> to restrict the content to a single work; restructure <body> content into nesting <div>s with correct @type and @n values. It has been the experience of those who have made TEI to TAN-TEI conversions that step 2 is the most time-consuming. The TAN <head> requires one to more carefully curate the metadata than does <teiHeader>. But step 3 should not be underestimated, either. Many people write TEI files with a focus on the original textual object, and they do not normalize to the level expected in a TAN file. In general, the more simple the TEI file the better.

Class-2 TAN Files, Annotations of Texts This chapter provides general background to class 2 TAN files. For detailed discussion of individual elements and attributes see . TAN-A-div files provide broad, macroscopic alignment of multiple versions of any number of works. It also provides a place for annotating the texts through general claims. TAN-A-tok files provide narrow, microscopic alignment of any two class 1 files, identifying word-for-word or character-for-character correspondence. TAN-A-lm files support lexico-morphology (part-of-speech) for either a single class 1 file or a language. In translation studies, it is common to use the term source (or sources) to refer to a translated text and the term target to refer to the translation. TAN, however, has been designed for cases where it may not be clear which is the target and which is the source. Further, there is a more generic use of source that takes precedent. In these guidelines, therefore, we avoid the term target altogether, and when we use the word source, we are referring only to one of the class 1 files upon which a class 2 alignment depends.

Common Elements The class 2 formats have been designed to be human readable, particularly references to class 1 files. In ordinary conversation, when refering to specific parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters, and so forth. We use relational words (e.g., "first"), and the very text itself. We might say, for example, "See page 4, second paragraph, the last four words." Or, "See page 4, second paragraph, first sentence, second occurence of 'pull'." Those familiar conventions are the basis for the TAN pointer syntax, and so it differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers depend upon a fourfold hierarchy of: works, divisions, word tokens, and characters. Works, defined above (see ), are defined by the source (which may not have more than one work). Divisions are defined by the <div> structure of each source. Tokens are words of those divisions, defined according to one or more tokenization rules. And characters are defined as non-modifying codepoints in a word token. (A modifying character is always included with the base character it modifies.) Parts of this fourfold hierarchy—works, divisions, tokens, and characters—normally have familiar names. Sources can be given a meaningful abbreviated name (e.g., xml:id = "hamlet-1741"); divisions are named according to @n; tokens are referred to by position, by their actual values, or both (e.g., pos = "1 - 5", pos = "last-1 - last", val = "hath"; see ). Characters are always identified by number (e.g., chars = "2, 7"). This approach not only makes the syntax human readable, it also mitigates disruptions from corrections to the dependencies. For example, if an incorrectly duplicated <div> is deleted, disruption to the reference system is isolated and does not affect the rest of the document.

Class 2 Metadata (<code><link linkend="element-head" ><head></link></code>) Class 2 files share a few common features in their metadata, mostly to facilitate the human-friendly reference system outlined above. All class 2 files have as their sources nothing other than class 1 files. Therefore each <source> must take the . Editors of class 2 files must be able to name or number word-tokens in a transcription, via an optional <token-definition>. See . Inevitably, some class 1 sources will have differences. Perhaps works or div types were not defined with the same IRIs, or perhaps one version follows an idiosyncratic reference system. If sources need to be reconciled, alterations are specified in <alter>, which stipulates a set of actions that should be applied to the sources that have been named. Alteration actions include: <skip> allows you to ignore specific <div>s, deeply or shallowly. <rename> allows you to rename specific <div>s. <equate> allows you to provide synonyms for @n values. <reassign> allows you to move parts of leaf <div>s elsewhere. These actions allow you to reconcile sources that are somewhat at odds. Actions are applied first hierarchically and then in the sequence stated above. That is, the validation routine will go level by level through a given source. Any rules that are found in one level will be applied (skips taking top precedence, reassigns the lowest) before moving to the next level of the source. So if you wish in a given source to change chapter 1 to chapter 2, any subdivisions will be collated. If you wanted to do further things with (original) 1.5, you would need to refer to it as 2.5, and you would also need to realize that if original 2.5 exists, the action will be applied to both. Each action adds time to the validation routines. On lengthy texts these can become quite time-consuming. You are advised to keep <alter>s to a minimum. If a source has numerous alterations, you find it less time-consimung to create a new version of a source.

Class 2 Data Patterns (<code><link linkend="element-body" ><body></link></code>) The three types of class 2 files treat different kinds of phenomena, so their data structures look quite different. Nevertheless, a few elements and attributes are shared by at least two class 2 formats. Many class 2 elements take @src and @ref. @src points via ID reference to one or more <source>s and @ref points to one or more <div>s through their flat ref, perhaps substituted with their new values if <alter>s have been invoked (see . In the example

ref = "1.2-4,
                     1.5"

, the periods are arbitrary (but the hyphen and comma, which have special meanings here, are not). You may use any separating punctuation or space you wish, except for hyphens and commas, which are reserved to create ranges and joins. You may also use other numeral systems.

<code><link linkend="attribute-pos">@pos</link></code> and <code><link linkend="attribute-val">@val</link></code> To point to a token, one of three methods may be used. @pos alone. Under this method, one or more digits, or the phrase last or last- plus a digit, joined by hyphens or commas indicate one or more token numbers. For example, 2, 4-6, last-2 - last refers to the second, fourth, fifth, sixth, antepenult, penult, and final tokens in passage. The numerical value to which the keyword last resolves depends upon the length of each <div>. @val alone. Under this method, a single token is picked by means of a string value equivalent to the token. For example,

@val =
                              "bird"

, points to the first occurence of the token bird. @pos and @val together. Under this method, specific occurences of a token are picked. For example,

@val="bird" @pos="2,
                              4"

picks the second and fourth occurences of the token bird. During validation, if @pos or @val are missing, they are supplied with their default values, 1 and .+ respectively. That is, @pos by default points to the first instance and @val by default points to any string. @pos and @val must be used carefully. For example, the attribute combination val="bird" pos="last-5" will produce an error if the word token bird does not occur at least six times. It is advisable to use @val, and not merely @pos. If the editor makes corrections to your source texts, references are more likely to become corrupt, and less likely to be traceable, if there is no @val .

Division-Based Annotations and Alignments (<code><link linkend="element-TAN-A-div"><TAN-A-div></link></code>) TAN-A-div is the format for macroscopic, division-based alignment, and is dedicated to aligning any number of versions of any number of works on the basis of <div>s, or even smaller, ad hoc segments in the sources invoked. A TAN-A-div file allow you to make general claims about a work, or a particular version of a work.

Root Element and Header The root element of a TAN division-based alignment file is <TAN-A-div>. TAN-A-div's <head> has one or more <source>s. Any concepts that will be mentioned in the <claim>s need to be supplied in <definitions>.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-div file takes, in addition to the customary optional attributes (see @in-progress and ), @claimant, @object, @subject, or @verb, stipulating the default values for any claims to come. The rest of the body consists of <claim>s whose model is inspired by the Resource Description Framework (RDF; see ). RDF depends upon a simple data model, where each datum consists of three items termed a subject, a predicate, and an object. The first and third are thought of as nodes, and the second as a connector between the nodes. A connector, our preferred term, is frequently elsewhere called an edge, but that term elicits a metaphor that is confusing and misleading. A cylinder, for example, has two edges, but they don't connect anything. Furthermore, "edge" implies that what's really of interest is the void beyond the surface of a three-dimensional object. TAN was designed to serve scholars, who normally find RDF-like sentences unsatisfactory. They lack context or qualifiers. It is unclear who made them, or when, or if they were uttered with any doubt or nuance. Sometimes we wish to claim a bare negation, e.g., "Aristotle was not the author of De mundo"—an assertion not possible to express in RDF. A TAN <claim> adds some of this nuance and complexity to RDF. Every claim must be assigned to a claimant (and claims can be recursive, e.g., X claims that Y claims that Z claims that...). The RDF terminology subject + predicate + object is adjusted by TAN RDF to subject + verb + object. A <claim> may be be restricted to a particular date or place, or it may be tempered by certainty and modified with adverbs. If the object is data, the data type can be restricted to a specific type and lexical form. Despite being somewhat more complex than RDF, TAN-c syntax is more human readable. <claim> may be used for a variety of things, e.g.,: to list quotations and allusions; to indicate which passages deal with what general subjects and topics; to connect commentary or notes from one source with another; to indicate where other scripta have different readings (apparatus criticus). These assertions are made in <claim>s whose <subject> or <object> points to passages of text. Any textual <subject> or <object> may take @work or @src. The former takes a single reference to a <source>, but adopts the reference as a proxy to make a claim applicable to all versions of the same work. @src restricts the claim to specific versions, not to the work as a whole.

Token-Based Annotations and Alignments (<code><link linkend="element-TAN-A-tok" ><TAN-A-tok></link></code>) TAN-A-tok files provide a microscopic view of how two sources relate to each other. The format is intended to allow you to specify exactly where, how, and why two transcriptions align, and to do so on the most granular level possible. TAN-A-tok files also allow you to express levels of confidence or alternative opinions. Creators and editors of TAN-A-tok files should be able to read the languages of their sources and to explain as precisely as possible the relationship between the two sources. They should be prepared to think about and specify types of textual reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A-div files are because they reflect work that is more detailed, and therefore more time-consuming, than simple en masse alignment of sources. Because of the detailed nature of the inquiry, token alignment is restricted to two texts, referred to jointly as a bitext. Each half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources share some special relationship, direct or indirect, and relate through one or more types of textual reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such as literal translations, may line up quite nicely word for word. Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at all. So annotating a bitext is oftentimes not easy, and requires you to think hard about assumptions you have made in two key areas: the relationship that holds between two scripta and the types of reuse that was involved in turning one version into the other (or a common ancestor into both). Relationship of sources' scripta. What is the the physical relationship or history that connects the two sources' scripta? Is one a direct descendant (copy) of the other? If not, what common ancestor do they share? Here you consider the material aspect of the bitext, because you are trying to answer how object A's text relates to object B's. Types of reuse. What categories of text reuse do you consider operative? Such a declaration tells users of your data what paradigm you bring to your analysis. You may wish to keep your categories nondescript and somewhat vague, using loosely defined concepts such as translation, paraphrase, quotation, and so forth without much specificity. On the other hand, you may subscribe to a detailed view of text reuse. Perhaps you have adopted field-specific categories such as obligatory explicitation, optional explicitation, pragmatic explicitation, or translation-inherent explicitation. You may also wish to declare secondary types of reuse, such as scribal omission or dittography, to declare secondary types of reuse that may have intervened. You must declare at least one type of reuse. Or you may use those that are built into the TAN format. See .

Root Element and Header The root element of a token-based alignment file is <TAN-A-tok>. The TAN-A-tok header builds upon the core and class 2 headers (see and ). TAN-A-tok files take exactly two <source>s. The sequence is arbitrary. Each <source> must take an @xml:id. <definitions> takes, in addition to all the elements allowed in class 2 files (see ), two elements unique to TAN-A-tok: <bitext-relation> and <reuse-type>. The former describes the genealogical relationship between each source's scripta. The second attends to the qualitative aspect of the bitext relationship.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-tok file takes, in addition to the customary optional attributes (see @in-progress and ), required @bitext-relation and @reuse-type, which take one or more id references from <bitext-relation> and <reuse-type>, indicating the default values that govern the alignment. <body> has only one type of child: one or more <align>s, each of which collects sets of <tok>s from one or both sources, known collectively as a token cluster. Clusters may overlap, to handle translations in which words fall in one-to-one, one-to-many, many-to-one, and many-to-many relationships. The independence of token clusters allows you to register differences of opinion about the same set of tokens. An <align> may take an @xml:id, to facilitate external discussions about an assertion. Nothing should be inferred from silence in a TAN-A-tok file. Unmentioned tokens in either source do not represent gaps in a translation. All that can be inferred is that the creators and editors of the TAN-A-tok file have said nothing about the tokens. If you wish to declare that one or more words in one source were left out of a translation or inserted into one—that is, words in one source have no match in the other—you must do so through a half-null alignment, i.e., a token cluster that has tokens from only one source. A half-null alignment implies insertions or omissions. A fully aligned bitext may result in a TAN-A-tok file with a very long <body> (in contrast to the typical TAN-A-div file). That does not mean, however, that everything in a source must be encoded or described. In writing and editing a TAN-A-tok file you do not commit you to saying everything possible about the bitext. You might choose to encode only a few token clusters. If there are multiple IDs in @reuse-type or @bitext-relation, the intersection, not the union, of those values is to be understood. For example, reuse-type="trans para" would indicate that the token cluster results from a combination of translation and paraphrase. If you wish to claim that the token cluster might be a translation or it might be a paraphrase, then you should create two separate <align>s, and add @cert.

Lexico-Morphology TAN-A-lm files are used to associate words or word fragments with lexemes and morphological categories. These files have two kinds of dependencies: a class 1 source (optional) and the grammatical rules defined in one or more TAN-mor files. Therefore this section should be read in close conjunction with its companion: ).

Principles and Assumptions Editors of TAN-A-lm files should understand the vocabulary and grammar of the chosen languages. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files you have adopted. Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to lexical analysis your own expertise and supply lexical headwords unattested in printed authorities. Although TAN-A-lm files are simple, they can be laborious to write and edit, more than other types of TAN files. They can also be hard to read if the underlying TAN-mor files use cryptic codes. It is customary for an editor of a TAN-A-lm file to use tools to help create and edit the data.

Root Element and Header The root element of a lexico-morphological file is TAN-A-lm. TAN-A-lm files are either source-specific or language-specific. In the case of the former, <source> points to the one and only TAN-T(EI) file that is the object of analysis. In the case of the latter, <for-lang> is used to indicate the languages that are covered. <definitions> takes the elements common to class 2 files (see . It takes two other elements unique to TAN-A-lm: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential. There is, at present, no TAN format for lexica and dictionaries, although this may change in the future. So even if a digital form of a dictionary is identified through the , validation tests do not take this element into account. Because you or other TAN-A-lm editors are likely to be authorities in your own right, <person> can be treated as if a <lexicon>, and be referred to by @lexicon in the <body> .

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-A-lm file takes, in addition to the customary optional attributes found in other TAN files (see @in-progress and ), @lexicon and @morphology, to specify the default lexicon and grammar. <body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes <l>s and <m>s). If due to tokenization a linguistic token must occupy more than one <tok>, you may use <group> to group <tok>s together. Elements within an <ana> are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok>. Many TAN-A-lm files will be populated by a stylesheet or other algorithm that automatically lists all possible morphological values of each token. It is advised that such automatically calculated results always include @cert with weighted values.

Class-3 TAN Files, Varia This chapter provides general background to the elements and attributes that are unique to all class 3 TAN files, which are devoted to formats that do not fit the other two classes. For detailed discussion of specific elements and attributes, see .

Keyword Vocabulary (<code>TAN-key</code>) All too often, a project has a set of vocabulary it draws from time and again. To repeat the can be both tedious and treacherous. If a project with hundreds of TAN files sdecides to change or augment its vocabulary it could take a long time to find and make all the changes. The TAN-key format is intended to allow a project to define the IRI + name patterns for things that it regularly names, to be applied to any element that takes @which. For example, it is a suitable way to gather the IRI + name patterns for the people who worked on a project, or to define special kinds of div types. TAN-key files are a core part of the TAN schema, defining commonly used concepts in <token-definition>, <div-type>s, and so forth. For a complete list of predefined TAN keywords, see For more details on how this format relates to other TAN formats, see .

Root Element and Head A TAN-key file has <TAN-key> as the root element. The <definitions> of a TAN-key file will be empty, or have <group-type>s.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-key file consists simply of <item>s, perhaps gathered into groups via <group> or @group. These groups have, at present, no effect upon other TAN files that import them. They have been useful, however, in more advanced uses of the format, particularly in the case of the standard TAN-key file for <div-type> (), where common types of divisions have been given a rudimentary typology suitable for transformations into other formats. Most frequently, a TAN-key file will contain items that have the IRI + name pattern. The only exception is when it contains <token-definition>s.

Morphological Concepts and Patterns (<code>TAN-mor</code>) TAN-mor files are used to describe the grammatical morphological features of a given language, to assign codes to those features, and to define rules governing the application of those codes. The format allows specificity, flexibility, and responsiveness. Assertions in the format may be doubted, rules may be expressed as contingent upon other conditions, and warnings and error messages may be sent to users who have used a pattern incorrectly, or not in accordance with best practices. The TAN-mor format is a kind of Schematron for the grammar of human languages. You specify the categories and codes for a given language, then you may create tests to define invalid uses of those codes. Those tests are attached to reports and assertions allowing editors of TAN-A-lm files to see not only if the rules have been violated, but why, and exactly where. This chapter should be read in close conjunction with .

Principles and Assumptions Certain assumptions and recommendations are made regarding morphology files, complementing the more general ones; see . TAN-mor files are restricted exclusively to describing the categories and rules for the grammar of a natural language. Editors of these files should be well versed with the grammar of the languages they are describing. The TAN-mor format has been designed with the assumption that patterns of word inflection and formation can be categorized, classified, named, and described. It has also been assumed that scholars may reasonably differ, perhaps radically, on how categories are defined and applied. TAN-mor is meant to allow those differences to be declared. It is up to other users to decide whether or not to adopt them. The TAN-mor format has also been designed to cater to two different approaches to morphological codes: categorized or uncategorized. Codes that are categorized are interpreted according not only to code but to position. For example, the categorized codes adopted by Perseus for morphological analysis of Greek, Latin, and other highly inflected languages stiplate ten categories, with the first two being the major and minor parts of speech, and the subsequent categories devoted to person, number, tense, and so forth. Each word that is analyzed must have a value, even if null, and the position of the code is important. Uncategorized codes simply give each each grammatical feature a unique code, to be applied in any permitted sequence and combination. This approach is viable for any language (including highly inflected ones such as Greek or Latin), but it is most often found in tagging sets for languages that are not highly inflected, e.g., the Brown and Penn sets for English.

Root Element and Header The root element of a morphological rule file is <TAN-mor>. Zero or more <source> elements describe the grammars or related works that account for the rules declared in the TAN file. If the rules are not based upon any published work, then <source> may be omitted. Any TAN-mor file without a source will assume to be based upon the personal knowledge of the <person>s who edited the file. <definitions> is populated with the grammatical <feature>s that are considered operative. If a particular discipline customarily uses codes that are not allowed in @xml:id, you may wish to create an <alias>.

Data (<code><link linkend="element-body"><body></link></code>) The <body> of a TAN-mor file takes the customary optional attributes found in other TAN files (see @in-progress and ). The children of <body> begin with one or more <for-lang>s, followed by any number of <where>s (containing <assert>s or <report>s) or <category>s (if relying upon structured codes). <category>, used for structured codes, sorts <feature>s into groups, assigning them @code values that are unique within the <category>. <assert>s and <report>s are used to declare rules that must be followed, or must never be followed, by any dependent TAN-A-lm file. An <assert> and <report> will be checked only if the conditions in the enclosing <where> are met in the context of a given <m> in a dependent TAN-A-lm file: @m-matches: <m> matches the pattern (regular expression). @tok-matches: one of the values of <tok> in the given <ana> matches the pattern (regular expression). @m-has-features: <m> has the specified features. @m-has-how-many-features: <m> has the given number of features. An <assert> also has one or more of the truth conditions above. If the test proves false in a given <m> then the <m> will be marked as erroneous and the message included by the <assert> should be returned. <report> has the same effect, but the role of the test is the opposite: the error and message will be returned only if the test proves true.

TAN Catalog Files (<code>collection</code>) TAN catalog files are intended to facilitate the discovery of relevant TAN files and to support the XSLT function collection(). They catalog or index any or all TAN files within a local directory and perhaps its subdirectories. These catalog files must always be named catalog.tan.xml. They depart from all other TAN files in their structure. They have no namespace. They have neither body nor head. Rather, they are patterned off the catalog.xml description provided by Saxonica (), to . Any XML file passed to the stylesheet

/do things/populate/populate TAN
                  catalog file.xsl

will automatically generate one of these files. The root element of a catalog file is <collection>, with children <doc>s that hold simple metadata about the TAN files that are in a directory and its subdirectories. Only TAN files may be registered in a <doc>.

Working with the Text Alignment Network Best Practices in Working with TAN Files In this chapter we discuss ways to manage, create, edit, and share TAN files. The material discussed here is non-normative. That is, these are suggestions based upon the experience of TAN users.

Local Setup TAN files may be set up in any kind of structure one wishes, but because those files are meant to be shared and interlinked, it is beneficial to use similar local conventions, so that relative URLs remain intact from one person's system to another. It is especially important that collections be able to "talk" to each other via local URLs in @href, so it is a good idea to name collection subdirectories as predictably as possible. Below is one way to organize the subdirectories of a typical setup for local TAN work: library-[abbreviated name of creator 1] [abbreviated name of collection 1]—TAN-T(EI) files here TAN-A-div (for TAN-A-div files) TAN-A-tok (for TAN-A-tok files) [etc.] [abbreviated name of collection 2] [etc.] library-[abbreviated name of creator 2] output—saved results from transformations, tests pre-TAN—third-party files to be used to populate TAN files, or to be converted into them TAN-2018 —the core TAN files, downloaded from the website or the Git repository stylesheets—stylesheets you have created tools—third-party tools Under this approach, you create a library subdirectory for each provider or creator (including one for yourself). For any TAN corpus you publish, you should advise what name should be used for the library subdirectory. Likewise, for any TAN corpus you download, you should use the library name suggested by the provider. Any time you create or download a collection of TAN files, you save them in a subdirectory within the creator's library subdirectory. Once again, you should advise on the name to be used, and use the names that are advised. If you use Git, it is advisable to make each collection its own Git repository. If you use GitHub, it is advisable to use your username for the library subdirectory. This two-step approach to subdirectories anticipates cases where different people will want to encode the same body of texts, particularly heavily quoted collections that will commonly be given very brief, descriptive names, e.g., bible, quran. When you name class 1 files (the filename, not the IRI name; see ), it is a good idea to start with an acronym or abbreviation for the work, followed by the language code, the editor's last name, the date when the source scriptum was created or published. If a work lends itself to multiple reference schemes, you may need to include that in the filename. Some examples: ar.cat.grc.1949.minio-paluello-sem.xml (Aristotle's Categories, in Greek, 1949, edition by Minio Paluello, following a reference system based on semantic units [paragraphs, sentences, independent clauses]). apocr.eng.kjv.1760.xml (apocrypha, English, King James Version, 1760 edition) tlg0059.tlg031.perseus-grc1-Pl.Ti.xml (Plato's Timaeus in Greek) Class 2 files are tougher. Because they bring two or more files or concepts together, filenames could become very long or unpredictably structured. At this time, the best recommendation is to make sure that each class 2 file is put into a subdirectory, separate from class 1 files, given a brief but meaningful name that points to the research question that motivated its creation. Some examples: ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml (lexico-morphology for Aristotle's Categories, in Greek) nt.grc-syr.selections.TAN-A-tok.xml (word-for-word correspondences between the Syriac and Greek New Testaments) plato.TAN-A-div.xml Class 3 are a bit easier. It is recommended that TAN-mor files begin with the language code then an acronym for the person or group responsible for creating the features. TAN-key files are written generally to serve a specific project or collection, so the collection name and the TAN type should suffice. Examples: ar.cat.TAN-key.xml eng.kalvesmaki.com,2014.1.xml (tagging scheme #1 for English) If you have a local copy of someone else's TAN collection, and you wish to create TAN files that depend on them, you are in all likelihood going to use relative URLs to copies of the files stored on your local drive. It is recommended that you also include absolute URL through secondary <location>s. The validation routine checks only the first document available. From time to time, you might comment out the first <location> and run the validation process again. If you share your dependent TAN file with someone else who does not have a local copy of the collection, the second <location>, with the absolute URL, will point to the original copy of the document. In a given project, you are likely to repeat basic information, particularly <person>, <role>, and <work>. such as elements with the , consider moving those to a TAN-key file. It is almost always preferable to develop TAN-keys before resorting to <inclusion>s. Sorting out lines of inclusion can be confusing.

Creating and populating TAN files TAN is a representational format. Every TAN file models some source. If those sources are non-digital, it is a relatively straightforward task to create and populate a TAN file. You just start editing everything by hand. In some cases, you might get a head start through a rough computer algorithm. For example, optical character recognition (OCR) on an edition might give you a dirty but useful start for a TAN-T file. Or OCR on an index might get you the outlines of a TAN-A-div file that indexes all quotations. Despite the computer's assistance, the majority of the task is converting non-digital claims into digital ones, and the manual effort is central. In many other cases, you are trying to take something that already exists digitally and convert it into the TAN format. In these cases, it is advised to think of the problem computationally, and do your best to resist the urge to manually edit anything. Suppose you find a Word file, a web page, or plain text that can serve as the basis for a TAN file. A common first impulse is to copy the desired content, paste it into the body of our TAN file, and then begin to manually correct and change things. You may find that you made a major mistake that cannot, at that point be undone. Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you eliminated line breaks that were useful signals about where <div>s should be separated. Even if all goes well, after all that hard work you might be find out that the pre-TAN data source has been updated, with errors corrected. If any significant time has elapsed, you may have forgotten what procedure you followed to convert the data. And even if you remember, you have to repeat the steps again, and plan for the next time when the pre-TAN source is updated. Or you find yourself making piecemeal corrections. For all these reason, it is recommended that you set up an XSLT-based workflow to convert the data to TAN. When you find mistakes such as those described above, no harm is done. You can adjust your algorithm and re-run the process as often as you need, each time getting better and better results. This approach requires extra initial work. That is, you will need to get to know XSLT (or an alternative) well. Establishing a good transformation process can be time consuming. But the investment pays off in the long run. The routines you write for one set of files might save you some work for the next. Under this method, you should begin the process by creating a template TAN file that resembles, even if skeletally, your desired output. You then write XSLT-based rules that (1) make alterations to the input, (2) infuse the altered input into the template, then (3) save the new file. This method has been used successfully to handle several different kinds of conversion, including ones where the source files are updated very frequently. In such cases, the traditional cut-paste-and-edit method is not only unproductive; it is foolish. Writing transformations may seem laborious at first, because of how difficult it is to think how how best to handle and manipulate a TAN file. But there is a good chance that the labor you have in mind has already been done for you in the built-in TAN functions (see ). See also the files provided under the subdirectory /do things.

Sharing TAN files TAN files have been designed to be shared. Although individual TAN files are likely to be valuable on their own, even when removed from their context (e.g., via an email attachment), they may be critically crippled without their dependencies. As a result, TAN files are most likely to be distributed or published in groups, as collections. One way to distribute a collection is by making it available as a repository via Git or some other version control software (VCS). This approach has many advantages. The files become available to whomever wants them, and the editorial history is preserved. VCS features and tools are extremely fast and useful, and they allow users to modify TAN collections without impacting the original source. Collections may also be distributed through shared syncing services (e.g., Drive, Box, or Dropbox). Or put on a server. In the latter case, it may be difficult for users to browse a collection. In that case, you may wish to expose the collection as a compressed ZIP archive. This saves on your own bandwidth, and it still exposes the files for XML processing. But a ZIP archive is not suitable for linking from one TAN file to another, nor is it appropriate as a <master-location>. Unpacking a compressed file requires writing to the disk, which is treated as a security risk during validation, and so is disallowed. Such zipped archives are good ways to distribute a collection, but they should not be treated as a primary repository.

Doing things with TAN files The TAN format is not an end in itself. Indeed, there is no point to any file format, unless you can do things with it. TAN was designed to allow users to do unusual and interesting things. /do things, a major subdirectory in the project file, is populated with folders named with actions you might want to perform on a TAN file, and they contain XSLT stylesheets that fall into that area of activity. Those stylesheets are the front end of a long process that begins with TAN validation. Whenever you validate a TAN file, the Schematron validation file (the companion to the RELAX-NG validation file) is invoked. But that Schematron file is small, and the majority of the work is done by a very large library of XSLT stylesheets that resolve and expand the document, and marking its errors along the way. That extensive library of XSLT we call here the function library (we use both words, to distinguish the collection from individual, generic functions). The function library provides definitive interpretations of the TAN format, marking parts that are in error. The function library is also an important step to creating your own tools or stylesheets, anticipating, as it does, many things you might want to do with a TAN file. Certain considerations that have been put into the design of the function library are worth noting. First, the function library has a structure similar to that of the RELAX-NG schemas. That is, the primary access point is through one of the XSLT files named after a primary TAN format. You may also wish to include (or import) the extra functions, . During Schematron validation, it is quite common for the computer to calculate all global variables, even those that are unused. Therefore the function library defines only those global variables that are central to the validation process. The most complex and important global variables are the two principal transformations to the TAN file itself, $self-resolved and $self-expanded. $self-resolved is the result of changing the TAN file through some key steps, including (1) stamping the original uri of the file @base-uri in the root element, (2) converting all numeration systems to Arabic numerals, (3) replacing all elements that have @include with resolved forms of the element, (4) replacing elements with @which with their resolved IRI + name form, (5) stamping elements with @q and a number representing the nth place of that element relative to its original siblings (included elements are given the @q of their host element). If any errors arise, the relevant information is placed in the resolved file as an <error> or <warning>, based upon the master list of errors. @q, @base-uri, and other newly introduced attributes and elements are not defined by the TAN schema. $self-expanded is the result of putting the file through a series of expansions. As noted earlier, there are three levels of Schematron validation—terse, normal, and verbose—and there are three corresponding levels of $self-expanded. Expansion is intended chiefly to support validation, and so checks for errors. It does so by normalizing the text, converting each attribute to one or more elements (one per value), checking id references, and doing a number of other activities. For a class 2 file, $self-expanded includes not only an expansion of itself, but an expansion of its dependencies (TAN-T or TAN-mor). When taken to the verbose level, a TAN-A-div file will include in its $self-expanded special documents with a root element <TAN-T-merge>. Each work has one TAN-T-merge file, a collation into a single reference structure all the relevant sources. All these expansions provide an excellent starting point for conversion into other formats. The next most important global variables deal with referred files: Global variables for referred files Raw (first document available) Resolved Expanded <inclusion> $inclusions-1st-da $inclusions-resolved — <key> $keys-1st-da $keys-resolved $keys-expanded <source> $sources-1st-da $sources-resolved $self-expanded[position() gt 1] <see-also> $see-alsos-1st-da $see-alsos-resolved —

The column labeled "raw" lists variables that hold the first documents available, without alteration. Variables in the next column hold the resolved form, following the same process described above for $self-resolved. The resolved forms of <inclusion> and <key> are sufficient for validation, therefore they do not have expanded versions. Expanded sources are always found after the first document in $self-expanded. These global variables have been described above very generally. To understand better how their values are calculated, please consult the function library. The other components of the function library—the functions, keys, and templates—cannot be described conveniently or succinctly here. But they are critical parts of building successful stylesheets that transform TAN files. The next chapter provides a comprehensive, detailed view of how they work.