<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<book xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:xi="http://www.w3.org/2001/XInclude" version="5.0">
   <info>
      <title>The Text Alignment Network: Official Guidelines</title>
      <legalnotice>
         <info>
            <title>Text Alignment Network: Official Guidelines</title>
            <copyright>
               <year>2015-present</year>
               <holder>Joel Kalvesmaki</holder>
            </copyright>
            <author>
               <personname>Joel Kalvesmaki</personname>
               <email>kalvesmaki@gmail.com</email>
            </author>
         </info>
         <remark> This document and the files it describes are licensed under a Creative Commons
            Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/
         </remark>
      </legalnotice>
      <revhistory>
         <info>
            <releaseinfo>Latest version: <link
                  xlink:href="http://textalign.net/release/TAN-2018/guidelines/"
                  >http://textalign.net/release/TAN-2018/guidelines/</link>.</releaseinfo>
         </info>
         <revision>
            <revnumber>2018</revnumber>
            <date>2018-01-09</date>
            <revdescription>
               <para>Formats: <link
                     xlink:href="http://textalign.net/release/TAN-2018/guidelines/xhtml/index.xhtml"
                     >HTML</link> • <link
                     xlink:href="http://textalign.net/release/TAN-2018/guidelines/pdf/TAN-2018-guidelines.pdf"
                     >PDF</link> • <link
                     xlink:href="http://textalign.net/release/TAN-2018/guidelines/main.xml"
                     >Docbook</link> (master)</para>
               <warning>
                  <para>In case of contradictions, apparent or not, between these guidelines and the
                     core TAN files, priority should be given to the RELAX-NG schemas (compact
                     syntax), then to the functions, and then to these guidelines.</para>
               </warning>
            </revdescription>
         </revision>
      </revhistory>
   </info>
   <part xml:id="general_overview">
      <title>General Overview</title>
      <chapter>
         <title>Introduction</title>
         <section xml:id="tan_definition">
            <title>Definition and purpose </title>
            <para>The Text Alignment Network (TAN) is a suite of highly regulated XML formats
               intended to allow scholars to align and share texts and textual analysis at a maximal
               level of syntactic and semantic interoperability. TAN is particularly suited to
               textual works with multiple versions (translations, paraphrases), and to expressing
               quotations, word-for-word alignments, and grammatical features.</para>
            <para>TAN files are simple, modular, and networked, allowing users, working
               independently and collaboratively, to edit, study, and annotate shared files. The
               extensive validation rules depend upon a library of functions that definitively
               interpret the format, thereby helping anyone studying or editing the files, and
               providing a foundation for customized tools and applications.</para>
            <para>Although expressive of scholarly nuance and complexity, the TAN format has been
               designed to benefit everyone, scholars and non-scholars alike, and can be used
               broadly for multilingual publishing, language learning, and machine translation. </para>
         </section>
         <section>
            <title>Rationale and Purpose</title>
            <para>Scholars working with texts frequently need to study numerous versions. Some texts
               have been lost in their original form and can be studied only through later
               translations, paraphrases, or fragmentary quotations. Even when an original survives,
               its later versions are often worth study, revealing as they do something of how
               words, concepts, and works were preserved, altered, or combined across the
               generations and cultures who read and circulated the versions.</para>
            <para>Such textual comparison requires words, sentences, paragraphs, and other text
               segments to be aligned. Such alignment can be challenging. Some versions might be
               defective, or follow an idiosyncratic sequence. One editor may have divided the text
               according to a system not easily applied to other versions. Identifying which words
               or phrases in a translation correspond to which words or phrases in the original
               might result in complex, overlapping spans. And even larger segments such as
               sentences and paragraphs may not line up well. Further, every version of a text is
               part of a much larger, complex history of text reuse, and a complete study of that
               context requires not engagement with other works and other languages, requiring
               collaboration across projects and fields of study.</para>
            <para>The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly
               analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans
               to read and edit, expressive enough to allow scholars to register doubt and nuance,
               and sufficiently structured to permit complex computer-based queries across
               independent datasets. The format is actually a suite of formats, built modularly,
               with each format designed to allow an editor to focus exclusively on a single set of
               tasks. The format encourages or requires editors to declare their views or
               assumptions about language and texts in a structured manner, so that other users of
               the data (both human and computer) can determine whether the data is suitable for
               their needs. Because nearly all TAN data must be expressed in way that computers can
               parse, the information can be used in semantic web applications.</para>
            <para>TAN has been designed to support two kinds of scholarly activity: <emphasis
                  role="bold">creation</emphasis> and <emphasis role="bold"
               >research</emphasis>.</para>
            <para>When we <emphasis role="bold">create</emphasis> our primary sources or analyses of
               them, we normally want what we create to be useful to our colleagues. TAN was
               designed to augment the utility of such creative scholarly activities as:</para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>Creating and sharing a transcription of a particular version of a textual
                        work such that it is most likely to align with any other TAN version of that
                        text created by someone else;</para>
                  </listitem>
                  <listitem>
                     <para>Creating an index of quotations that is semantically rich and can be
                        applied to any other version of the quoting or quoted works;</para>
                  </listitem>
                  <listitem>
                     <para>Specifying exactly (e.g., word-for-word) where a source and its
                        translation correspond, even when there may be messy overlapping or
                        ambiguous relationships, or where doubt or alternative possibilities of
                        alignment need to be expressed;</para>
                  </listitem>
                  <listitem>
                     <para>Listing the lexicomorphogical features of each word in a text or a
                        language such that the linguistic data has meaning above and beyond a
                        particular coding scheme, and can be collated with lexicomorphological data
                        for other languages.</para>
                  </listitem>
               </itemizedlist>
            </para>
            <para>TAN files that are published and shared produce a decentralized but interoperably
               corpus of texts. As this TAN-compliant corpus expands across linguistic,
               chronological, and spatial boundaries, third-party tools and applications can expand
               the repertoire of <emphasis role="bold">research</emphasis> questions beyond any
               single corpus, to help scholars fruitfully investigate broader, comparative questions
               such as:<itemizedlist>
                  <listitem>
                     <para>For classical Greek texts, how were words with the root -ιστημι ("stand")
                        translated into ancient Latin? In what specific ways did the vocabulary of
                        technical terms shift from pre-Christian translations into later, Christian
                        ones?</para>
                  </listitem>
                  <listitem>
                     <para>How do the reformed Chinese translation technique of Sanskrit Buddhist
                        texts, attested by Dao An (312-385 CE), compare to reforms in the seventh
                        and eighth centuries of Syriac translations of Greek texts?</para>
                  </listitem>
                  <listitem>
                     <para>How do Arabic translations of Greek texts from the Abbasid period differ
                        from contemporaneous translations from Sanskrit into Arabic?</para>
                  </listitem>
                  <listitem>
                     <para>Can an anonymous English translation of a modern French novel be
                        identified with known translators of French novels from the same
                        period?</para>
                  </listitem>
                  <listitem>
                     <para>How do present-day translations of official United Nations documents
                        differ across languages?</para>
                  </listitem>
               </itemizedlist></para>
            <para>This is not to say that the TAN format, in itself, it answers such questions. It
               merely lays a framework within which such questions can be investigated. Some other caveats:<itemizedlist>
                  <listitem>
                     <para>Although TAN comes with an extensive library of functions and templates,
                        it is not a tool per se. It does not provide software or applications to
                        create, edit, or display TAN-compliant files, nor does it dictate how such
                        tools should behave. Rather, it allows you or a developer (especially an XML
                        developer) to create customized applications and tools.</para>
                  </listitem>
                  <listitem>
                     <para>The TAN formats are specialized. They supplement, and does not replace,
                        other common text formats such as TEI, Docbook, and so forth, or other
                        alignment formats such as XLIFF or TMX. Converting from TAN into these
                        formats is usually straightforward, but will normally entail loss. On the
                        other hand, converting from one of these formats into TAN normally cannot be
                        completely automated, the TAN format has scholarly expectations that are not
                        required in the other formats. Conversion must be given careful
                        thought.</para>
                  </listitem>
                  <listitem>
                     <para>TAN has a restricted field of inquiry (defined and explained in these
                        guidelines). The format is not suitable for many lines of iniquiry, e.g.,
                        representing how a text was displayed in a particular edition.</para>
                  </listitem>
                  <listitem>
                     <para>TAN has been designed to serve those who prioritize legibility and
                        readability over computational efficiency. The extensive TAN validation
                        routines—essential to aiding interoperability—can be taxing to run on
                        numerous or enormous files.</para>
                  </listitem>
               </itemizedlist></para>
         </section>
         <section>
            <title>About this version</title>
            <para>These guidelines were written for version 2018 of the TAN format. This version is
               semi-stable. Changes will be made quite reluctantly to the RELAX-NG schema files.
               Other files may be changed, the most work will be going into the next version of
               TAN.</para>
         </section>
         <section xml:id="tan_participation">
            <title>Participation</title>
            <para> At the present, changes are made regularly to the schemas and functions. If you
               have a TAN library, sharing it with other participants, particularly via Git, will
               help test any changes that have been made, and allow others to offer updates or
               corrections to your library.</para>
            <para>Participants in testing, using, and developing the Text Alignment Network are
               welcome. Our core purpose is to develop and maintain the schemas, the guidelines, and
               the functions and templates. Inquiries about participation should be sent to the
               project manager, <link xlink:href="http://kalvesmaki.com/">Joel Kalvesmaki</link>, by
               email: kalvesmaki at gmail.com.</para>
            <para>Official announcements are made by <link
                  xlink:href="http://groups.google.com/group/textalign?hl=en">email (Google
                  Group)</link> and by <link xlink:href="https://twitter.com/textalign"
                  >Twitter</link>.</para>
         </section>
      </chapter>
      <chapter xml:id="gentle_guide">
         <title>Starting off with the TAN Format</title>
         <para>If you are new to markup languages, or if you are unfamiliar with acronyms such as
               <emphasis role="italic">XML</emphasis>, <emphasis role="italic">RDF</emphasis>,
               <emphasis role="italic">XPath</emphasis>, or technical terms such as
               <emphasis>Unicode</emphasis>, you should start with this chapter, which uses a simple
            example to illustrate the steps typically taken to create and edit TAN files. By the
            end, you will have a sense of how to create and edit a simple collection of TAN
            transcriptions and alignments. If you are familiar with basic markup concepts, you may
            wish to read through the chapter very quickly, or skip it altogether.</para>
         <para>The discussion touches on a number of general concepts that will be introduced only
            briefly. If you find the concept new or confusing, follow the prompts for further
            reading to get better grounded in a particular topic or technology. </para>
         <section>
            <title>Creating TAN Transcription and Alignment Data</title>
            <para>Let us take a simple example, that of aligning two English versions of the nursery
               rhyme <emphasis role="italic">Ring-a-ring-a-roses</emphasis>, sometimes known as
                  <emphasis role="italic">Ring around the Rosie</emphasis>. Our goal here is to
               publish two versions of the nursery rhyme in the TAN format so that they are most
               likely alignable with any other TAN version of the poem that someone might
               create.</para>
            <para>We begin by finding previously published versions. In this case we have taken an
               interest in the versions published in <link xlink:href="http://lccn.loc.gov/12032709"
                  >1881</link> and <link xlink:href="http://lccn.loc.gov/87042504">1987</link> (one
               published in the UK and the other, the US). Each of these books have other rhymes,
               but we've already decided to focus upon the one particular nursery rhyme, so we
               transcribe those parts and nothing else:<table frame="all">
                  <title>Ring around the Rosie</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <thead>
                        <row>
                           <entry>1881 (UK) version</entry>
                           <entry>1987 (US) version</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>
                              <para>Ring-a-ring-a-roses,</para>
                              <para>A pocket full of posies;</para>
                              <para>Hush! Hush! Hush! Hush!</para>
                              <para>We're all tumbled down.</para>
                           </entry>
                           <entry>
                              <para>Ring-a-round the rosie,</para>
                              <para>A pocket full of posies,</para>
                              <para>Ashes! Ashes!</para>
                              <para>We all fall down.</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table></para>
            <para>We must be sure to save each of the two transcriptions as plain text, preferably
               with <code>.xml</code> at the end of each file name. Do not bother with word
               processor (Word, OpenOffice, Google Docs, and so forth), because those programs are
               too sophisticated for our work. They sometimes generate erroneous data, even when you
               export to plain text. We will not be concerned with italics, colors, fonts, margins,
               and so forth, so much better for our work is a <link
                  xlink:href="http://en.wikipedia.org/wiki/Text_editor">text editor</link>, which
               works only on plain text. But even those do not check to see if the rules of the
               format have been followed. So the best tool is an <link
                  xlink:href="http://en.wikipedia.org/wiki/XML_editor">XML editor</link>, which does
               the same thing a text editor does, but saves much typing and prevents syntax errors.
               More important, an XML editor will tell us when our TAN file is invalid, and will
               provide information and help in our TAN files.<note>
                  <para>Software suitable for your needs comes in many styles and prices. In
                     addition to the links in the paragraph above, you may wish to visit the
                     comparative lists for both <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_text_editors">text
                        editors</link> and <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_XML_editors">XML
                        editors</link>. TAN was developed using <link
                        xlink:href="https://www.oxygenxml.com">oXygen</link>, which is very powerful
                     but possibly confusing to new users. To avoid exasperation or despair, take
                     advantage of tutorials and documentation associated with the XML editor you
                     have chosen. </para>
               </note></para>
            <para>Our first task is to get these two versions into separate files with the
               appropriate markup. Each TAN transcription file has two major parts: a head and a
               body. For now, we focus on only the second part, the body, as well as a few the
               necessary preliminary lines that stand above both the head and the body. First, the
               1881 (UK) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body xml:lang="eng" in-progress="false">
        &lt;div type="line" n="1"></emphasis>Ring-a-ring-a-roses,<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="2"></emphasis>A pocket full of posies;<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="3"></emphasis>Hush! Hush! Hush! Hush!<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="4"></emphasis>We're all tumbled down.<emphasis role="bold">&lt;/div>
    &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
               And now the 1987 (US) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring02">
   &lt;head>
   . . . . . . .
   &lt;/head>
   &lt;body xml:lang="eng" in-progress="false">
      &lt;div type="l" n="1"></emphasis>Ring-a-round the rosie,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="2"></emphasis>A pocket full of posies,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="3"></emphasis>Ashes! Ashes!<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="4"></emphasis>We all fall down.<emphasis role="bold">&lt;/div>
   &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
            </para>
            <para>These are standard eXtensible Markup Language (XML) files. (If you are already
               familiar with XML you may wish to skip ahead to the next section.) XML lets you take
               a text or a collection of data and structure it via markup. In the examples above,
               the markup is in boldface.</para>
            <para>Each file begins with a <emphasis role="bold">prolog</emphasis>, marked by the
               lines that begin with <code>&lt;?</code>. The first line in the prolog simply states
               that what follows is an XML document. The next two lines are <emphasis role="bold"
                  >processing instructions</emphasis> that point to the files that will be used to
               check to see whether or not our data is valid. For now we will not explain the
               details of those first three lines, which will be identical, or nearly so, from one
               TAN file to the next. We can simply cut and paste those lines when we want to start a
               new one.</para>
            <para>The fourth line is the opening tag of what is called the root <emphasis
                  role="bold">element</emphasis>, here called <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. That opening tag, <code>&lt;TAN-T...></code> is
               answered by a closing tag, <code>&lt;/TAN-T></code>, the last line. The paired-tag
               relationship is true for all the other elements in this example. <code><link
                     linkend="element-head">&lt;head></link></code> is answered by
                  <code>&lt;/head></code>, <code><link linkend="element-body"
                  >&lt;body></link></code> by <code>&lt;/body></code> and each
                  <code>&lt;div...></code> by <code>&lt;/div></code>. These elements nest within or
               beside each other, but they never overlap. (The prohibition on overlapping elements
               is one of the cardinal rules of XML.) This relationship means that every XML file can
               be thought of as a tree, with the root at the trunk and the enveloped elements as
               branches, terminating in metaphorical leaves. It is helpful to use the tree metaphor
               when we describe the path we take, toward either the leaves or the root. In this
               manual, we may use the terms <emphasis role="italic">rootward</emphasis> and
                  <emphasis role="italic">leafward</emphasis> when we want to trace movement up and
               down the hierarchy of an XML document.</para>
            <para>An XML document is also profitably thought of as a family tree, a metaphor that
               provides commonly used terminology. In our examples above, <code><link
                     linkend="element-TAN-T">&lt;TAN-T></link></code> is the <emphasis role="italic"
                  >parent</emphasis> of <code><link linkend="element-body">&lt;body></link></code>,
               and <code><link linkend="element-body">&lt;body></link></code> the parent of the four
                     <code><link linkend="element-div">&lt;div></link></code> elements. Likewise,
               each <code><link linkend="element-div">&lt;div></link></code> is the <emphasis
                  role="italic">child</emphasis> of <code><link linkend="element-body"
                     >&lt;body></link></code>, and <code><link linkend="element-body"
                     >&lt;body></link></code> is the child of <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. Distant parental relationships can be described with
               the terms <emphasis role="italic">ancestor</emphasis> and <emphasis role="italic"
                  >descendant</emphasis>. <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is the ancestor of every element it encompasses, and
               every element encompassed by <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is its descendant. Paratactic relationships are also
               important. <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code> are <emphasis role="italic"
                  >siblings</emphasis> to each other, and every <code><link linkend="element-div"
                     >&lt;div></link></code> is a sibling to every other <code><link
                     linkend="element-div">&lt;div></link></code>.</para>
            <para>Inside of the opening tags for the <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>, <code><link linkend="element-body"
                  >&lt;body></link></code>, and <code><link linkend="element-div"
                  >&lt;div></link></code> elements are pairs of text joined by an equals sign,
               collectively called an <emphasis role="bold">attribute</emphasis>. The left side of
               the equals sign is the attribute name, and on the right side, within the quotation
               marks, is the attribute value. <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> has two attributes, <code>@xmlns</code> and <code><link
                     linkend="attribute-id">@id</link></code> (when we discuss an attribute outside
               its original context, we often preface the name with @). We will skip
                  <code>@xmlns</code> for now; this attribute (actually, a pseudo-attribute)
               specifies the <emphasis role="bold">namespace</emphasis> of the XML file, an advanced
               topic that need not be discussed now. </para>
            <para>The <code><link linkend="attribute-id">@id</link></code>, however, is quite
               important. Every TAN file has an <code><link linkend="attribute-id">@id</link></code>
               that uniquely and permanently identifies the file itself. It should not be changed,
               even as we make edits. The name you save the file as can be changed, but keep in mind
               that other people may be depending on it, and may be unable to  find it. </para>
            <para>The value of <code><link linkend="attribute-id">@id</link></code> is always what
               is called a tag uniform resource name (tag URN). It always starts with
                  <code>tag:</code>, followed by an email address or domain name that we own or
               owned. (It is okay to use an obsolete address; this part is only for identification.)
               After that email address or domain name comes a comma (no spaces) and a date on which
               we owned it, in the international standard format of year, month, and date, joined by
               hyphens, e.g., 2014-12-31. If we leave off a day value, it is assumed to be the first
               of the month; if we leave off the month value it is assumed to be January. In the
               examples above, <code>parkj@textalign.net,2015</code> points to the person who owned
               that particular email address on the stroke of midnight (Coordinated Universal Time)
               January 1, 2015. (In this example, we are pretending to be that person.) After that
               comes a colon, and then any name we wish to assign to the file. </para>
            <para>We have anticipated a simple collection of texts, so we've called the files
                  <code>ring01</code> and <code>ring02</code>. (If we run out of names, or want to
               restart, we can simply use a new email-date preface, e.g.,
                  <code>parkj@textalign.net,2015-01-02</code>.)</para>
            <para>The idea here is that hundreds of years from now, even when that email will be
               defunct or owned by someone else, someone might still be able to identify the person
               responsible for the file.</para>
            <para>The element <code><link linkend="element-body">&lt;body></link></code> contains
               our transcription. <code><link linkend="attribute-xmllang">@xml:lang</link></code>,
               required, specifies the principal language of the transcribed text. We use the
               standard 3-letter abbreviation for English. (See later in the guide for more complex
               language requirements.) By saying that <code><link linkend="attribute-in-progress"
                     >@in-progress</link></code> is <code>false</code>, we indicate that we have
               finished our transcription and have no further plans to develop it. It doesn't mean
               that the file is free of errors. We can make corrections later. It just means that we
               have no more substantive revisions are planned, and any further changes will be
               restricted to corrections of typographical errors. This attribute is optional. If it
               is left off, our TAN file is assumed to be a work in progress, and it serves as a
               kind of warning to anyone who might want to use it.</para>
            <para>Our transcription has been divided into four <code><link linkend="element-div"
                     >&lt;div></link></code> elements. How we divide up the work is entirely up to
               us. But we must make sure that every bit of text is enclosed by a leafmost
                     <code><link linkend="element-div">&lt;div></link></code>. That is, every
                     <code><link linkend="element-div">&lt;div></link></code> must be the parent of
               only other <code><link linkend="element-div">&lt;div></link></code>s, or none at all.
               We cannot have a <code><link linkend="element-div">&lt;div></link></code> that mixes
               text with other elements (such as other <code><link linkend="element-div"
                     >&lt;div></link></code>s). The values of <code><link linkend="attribute-type"
                     >@type</link></code> and <code><link linkend="attribute-n">@n</link></code>
               indicate, respectively, the type of division and the name of the division. We have
               used <code>line</code> in the first example, but we could easily have also used
                  <code>l</code> (as we did in the second) or <code>ln</code> or any other phrase
               that we think will make intuitive sense to other users. The choice is arbitrary (we
               will see why below). We have used arabic numerals for the values of <code><link
                     linkend="attribute-n">@n</link></code>, but the value, once again, could have
               been anything. We could have used Roman numerals, or some other naming scheme that is
               standard in the field.</para>
            <para>Aside from the <code><link linkend="element-head">&lt;head></link></code> element
               (discussed later), that's all we need in the transcription. We can now move to
               alignment and annotation.</para>
            <para>There are two different types of alignment, one emphasizing breadth, the other,
               depth. The broad type of alignment, called TAN-A-div, allows us to specify TAN
               transcriptions of as many versions of as many works as we wish, and to make claims
               about those texts. The other type of alignment, emphasizing depth, is called
               TAN-A-tok and allows us to take any two (and no more) TAN transcriptions, create
               word-to-word (or better put, token-to-token) relationships, and specify what type of
               relationship holds between each set of aligned words. TAN-A-div is suitable for work
               that focuses on the general alignment of multiple versions of one or more works at a
               single time. TAN-A-tok is for highly detailed, precise alignment of two text
               versions.</para>
            <para>For our example, we start with a TAN-A-div file (once again suppressing
                     <code><link linkend="element-head"
               >&lt;head></link></code>):<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body/>
&lt;/TAN-A-div></programlisting></para>
            <para>In the prolog, the first line is identical to the first line of our transcription
               files. The second and third lines, the processing instructions, are identical, aside
               from pointing to the validation files for alignment. Even the fourth line looks like
               the transcription file, other than the new name for the root element, <code><link
                     linkend="element-TAN-A-div">&lt;TAN-A-div></link></code>, and the new value for
                     <code><link linkend="attribute-id">@id</link></code>.</para>
            <para>The penultimate line, <code>&lt;body/></code>, is what is called an empty element,
               and is equivalent to <code><link linkend="element-body"
                  >&lt;body></link>&lt;/body></code>—a shorthand syntax for elements contains
               nothing. It will become apparent, when we discuss <code><link linkend="element-head"
                     >&lt;head></link></code> below, why our <code><link linkend="element-body"
                     >&lt;body></link></code> can be empty.</para>
            <para>The other kind of alignment, TAN-A-tok, takes a bit more work, because we must
               first identify words that correspond with each other. Even before we do that, we need
               to decide what kind of relationship holds between the two texts. Let us pretend, for
               the sake of example, that the 1987 version is a direct descendant (and therefore
               variation) of the 1881 one. So our task is to show exactly what parts of the the
               older version correspond to those of the newer one. We will simplify in this case,
               and assume an interest only in words, ignoring space and that punctuation. We will
               also adopt, <emphasis>tokens</emphasis> instead of <emphasis>words</emphasis>
                  (<emphasis role="italic">word</emphasis> is notoriously difficult to define, and
               has connotations lacking from <emphasis>token</emphasis>).</para>
            <para>We now create a TAN-A-tok
               file:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-tok.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body bitext-relation="B-descends-from-A" reuse-type="adaptation" in-progress="false">
        &lt;!-- Examples of picking tokens by number -->
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="1"/>
            &lt;tok src="ring1987" ref="1" ord="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="2"/>
            &lt;tok src="ring1987" ref="1" ord="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="3"/>
            &lt;tok src="ring1987" ref="1" ord="3"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="4"/>
            &lt;tok src="ring1987" ref="l" ord="4"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="5"/>
            &lt;tok src="ring1987" ref="1" ord="5"/>
        &lt;/align>
        &lt;!-- Examples of picking tokens by value -->
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="A"/>
            &lt;tok src="ring1987" ref="2" val="A"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="pocket"/>
            &lt;tok src="ring1987" ref="2" val="pocket"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="full"/>
            &lt;tok src="ring1987" ref="2" val="full"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="of"/>
            &lt;tok src="ring1987" ref="2" val="of"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="posies"/>
            &lt;tok src="ring1987" ref="2" val="posies"/>
        &lt;/align>
        &lt;!-- Examples of picking ranges of tokens -->
        &lt;align>
            &lt;tok src="ring1881" ref="3" ord="1, 2"/>
            &lt;tok src="ring1987" ref="3" ord="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="3" ord="3 - 4"/>
            &lt;tok src="ring1987" ref="3" ord="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="1"/>
            &lt;tok src="ring1987" ref="4" ord="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="3"/>
            &lt;tok src="ring1987" ref="4" ord="2"/>
        &lt;/align>
        &lt;!-- examples of using "last" -->
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="last-1"/>
            &lt;tok src="ring1987" ref="4" ord="last-1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="last"/>
            &lt;tok src="ring1987" ref="4" ord="last"/>
        &lt;/align>
    &lt;/body>
&lt;/TAN-A-tok></programlisting></para>
            <para>Once again, the first four lines, the prolog and root element, should look
               familiar, with the only significant changes being the names of the validation files,
               the name of the root element (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>) and the value of <code><link
                     linkend="attribute-id">@id</link></code>.</para>
            <para>The heart of the data is <code><link linkend="element-body"
                  >&lt;body></link></code>, which has, in addition to <code><link
                     linkend="attribute-in-progress">@in-progress</link></code>, two more
               attributes, <code><link linkend="attribute-reuse-type">@reuse-type</link></code>,
               which specifies the default type of relationship between the two sources, and
                     <code><link linkend="attribute-bitext-relation">@bitext-relation</link></code>,
               which specifies how the versions relate to each other. Our two values,
                  <code>B-descends-from-A</code> and <code>adaptation</code>, are arbitrary names
               that we define in the <code><link linkend="element-head">&lt;head></link></code>
               (discussed later). </para>
            <para>You will also notice some lines that begin <code>&lt;!--</code> and end
                  <code>--></code>. These are <emphasis role="bold">comments</emphasis>, and can be
               placed within or beside any element, and can be any number of lines. If you wish to
               ignore, say temporarily, some elements, an XML editor can help you toggle them on and
               off as comments.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> is the parent of one or
               more <code><link linkend="element-align">&lt;align></link></code> elements, each of
               which correlates a set of tokens in the two texts through its <code><link
                     linkend="element-tok">&lt;tok></link></code> children. Each <code><link
                     linkend="element-tok">&lt;tok></link></code> has, in this example, three
               attributes. <code><link linkend="attribute-src">@src</link></code> takes a nickname
               (an <code><link linkend="attribute-id">@id</link></code> reference) that points to
               one of the two transcriptions; we have used <code>ring1881</code> and
                  <code>ring1987</code> but we could have just as easily used anything else such as
                  <code>uk</code> and <code>us</code>. <code><link linkend="attribute-ref"
                     >@ref</link></code> has a value that points to a specific <code><link
                     linkend="element-div">&lt;div></link></code> in the source transcription; and
                     <code><link linkend="attribute-pos">@pos</link></code> or <code><link
                     linkend="attribute-val">@val</link></code> specify which token is intended,
               either by word number (<code><link linkend="attribute-pos">@pos</link></code>) or
               text of the actual word (<code><link linkend="attribute-val">@val</link></code>).
               Either technique is fine, and can be mixed, as in the example. You may also notice
               that the comma and hyphen can be used in <code><link linkend="attribute-pos"
                     >@pos</link></code> to point to multiple words within the same <code><link
                     linkend="element-div">&lt;div></link></code>, and that <code>last</code> and
                  <code>last-X</code> (where <code>X</code> is a digit) can be used to point to a
               word token relative to the last one in a <code><link linkend="element-div"
                     >&lt;div></link></code>.</para>
            <para>Each <code><link linkend="element-align">&lt;align></link></code> can establish
               one-to-one, one-to-many, many-to-one, or many-to-many relationships between words
               from the two texts. Words may feature in multiple <code><link linkend="element-align"
                     >&lt;align></link></code> elements (a kind of overlapping that doesn't offend
               the XML rule against overlapping). And if an <code><link linkend="element-align"
                     >&lt;align></link></code> has <code><link linkend="element-tok"
                  >&lt;tok></link></code> elements belonging to only one source, such as in the
               fourth-to-last <code><link linkend="element-align">&lt;align></link></code> above, we
               have what is called, in these guidelines, a <emphasis>half-null alignment</emphasis>.
               This half-null alignment indicates that the second word of line four of the 1881
               version is excluded from the act that we have called <code>adaptation</code> (which
               is, as we shall see, defined in the <code><link linkend="element-head"
                     >&lt;head></link></code>). If this were a translation, it would be as if we
               were saying that this word was excluded from the translation. (A half-null alignment
               containing only tokens of the later source might point to words that the translator
               added.) </para>
            <para>A half-null alignment should not be confused with our own silence. As creators of
               this file, we are under no obligation to indicate every word-for-word correspondence.
               If we fail to mention certain words, all that can be implied is that we opted not to
               say anything about them.</para>
            <para>We could have aligned the two texts in different ways. Perhaps further study will
               reveal that we were in error to associate the second "ring" with "round" in line 1.
               We can make corrections, even after publication, and signal the change to users of
               our data. There are also ways to express doubt or alterative opinions. We can even
               correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these
               more advanced uses are discussed elsewhere in these guidelines.</para>
         </section>
         <section>
            <title>The Principles of TAN Metadata (<code><link linkend="element-head"
                     >&lt;head></link></code>)</title>
            <para>At this point, we have finished four TAN files: two transcriptions, one TAN-A-div
               file, and one TAN-A-tok file. But we've suppressed the <code><link
                     linkend="element-head">&lt;head></link></code> in all of them, until now.
               Before getting into details, we need first to explain a few TAN principles.</para>
            <para>Unlike <code><link linkend="element-body">&lt;body></link></code>, which carries
               the raw data, <code><link linkend="element-head">&lt;head></link></code> contains
               what is oftentimes called metadata. That is, <code><link linkend="element-head"
                     >&lt;head></link></code> describes the raw data. Because the TAN format is
               intended primarily to serve scholars, and because the format is heavily regulated
               (that is, there are numerous validation rules that supplement the basic ones behind
               XML), the metadata requirements are stricter than they are for Word documents, HTML,
               TEI, or other formats you might know better. Scholars who find our file really need
               to know some essential things before they can responsibly use it. For example, what
               are the sources we have used? Who produced the data? When? What key assumptions have
               been made in producing the data? What licenses govern the data? The questions are not
               difficult to answer, but they are critical, and we should take some time to provide
               accurate answers.</para>
            <para>Some metadata questions are specific to certain formats. For example, in a
               TAN-A-tok file, we ask what relationship holds between the two sources. But that
               makes no sense for a TAN-T file. But other questions apply universally across all TAN
               files, no matter what kind of data. As we go from one TAN format to the next, we need
               to deal as much we can with similar structures and expectations. This reduces any
               potential confusion in creating and editing a TAN file, and helps other people using
               our data to find the information they want. More importantly, what we write in one
               file might save us some work in another.</para>
            <para>The rigorous scholarly requirements for TAN metadata are offset somewhat by
               another principle that was adopted in the design of TAN, namely, that each format's
                     <code><link linkend="element-head">&lt;head></link></code> should focus
               exclusively upon the data in <code><link linkend="element-body"
                  >&lt;body></link></code> and not other things. That is to say, in a transcription,
               we should definitely indicate what our source is. But we should not try to write a
               catalog entry, or even a structured citation, for the book we have used. We are not
               library catalogers. Our obligation is merely to point somewhere a reader can get more
               complete information. The <code><link linkend="element-head">&lt;head></link></code>
               is designed to help us to stay focused on the task and data at hand.</para>
            <para>TAN was also designed with the assumption that all metadata should be useful to
               both humans and computers. For our example above, we must describe the work we have
               chosen (<emphasis role="italic">Ring around the Rosie</emphasis>) in a way that is
               comprehensible not just to the reader but to the computer.</para>
            <para>Take for example the 1881 book we have used for our first transcription. For the
               human reader we can say simply something like "Kate Greenaway, <emphasis>Mother
                  Goose</emphasis>, New York, G. Routledge and sons [1881]". But computers need a
               more controlled, predictable syntax before they can be directed to the correct
               edition of <emphasis>Mother Goose</emphasis> (or rather to a digital surrogate of the
               edition). The human-readable string is too complex, and syntactically opaque. A more
               computer-friendly identifier would be international standard book numbers (ISBNs),
               which distinguish the 1984 version of <emphasis>Mother Goose</emphasis> illustrated
               by Kayoko Okumura from the one of the same year illustrated by William Joyce. The
               ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be
               converted into a machine-actionable string called universal resource names (URNs), in
               this case <code>urn:isbn:0-671493159</code> and <code>urn:isbn:0-394865340</code>.
               (Our 1881 version was published before the ISBN program was introduced. We will see
               below another way to name it.)</para>
            <para>URNs are families of formalized naming schemes regulated by a central body
               (Internet Assigned Numbers Authority, IANA) to ensure permanent, persistent, unique
               names for various types of things. There are URN schemes for journals (via ISSNs),
               articles (DOIs), and movies (ISANs), which means that anyone can use them to refer
               unambiguously to a particular kind of thing.</para>
            <para>All URNs are simply names. They don't tell you where an object is. To provide a
               unique <emphasis role="italic">location</emphasis>, however, we have universal
               resource locators (URLs), which might be much more familiar from daily use of the
               Internet, e.g., <code>http://academia.edu</code>. Like URNs, URLs are also centrally
               regulated, with individuals or organizations buying the rights to domain names from a
               central registry (usually through a third-party vendor).</para>
            <para>Both URNs and URLs can be thought of as the same type of thing, namely, a
               universal resource identifier (URI), sometimes called an international resource
               identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not
               just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and
               URLs. These four acronyms can be easily confused, and it is best to disambiguate them
               by thinking of the last letter in each. UR<emphasis role="bold"
                  >I</emphasis>s/IR<emphasis role="bold">I</emphasis>s <emphasis role="bold"
                  >I</emphasis>ncorporate both <emphasis role="bold">L</emphasis>ocators
                  (UR<emphasis role="bold">L</emphasis>) and <emphasis role="bold">N</emphasis>ames
                  (UR<emphasis role="bold">N</emphasis>).</para>
            <para>If those acronyms are confusing, don't worry. For our purposes here, they are
               pretty much are the same, and from this point onward we'll use merely the term IRI
               (unless we really mean a location, which we'll call a URL).</para>
            <para>IRIs are essential to a system frequently called the semantic web or linked (open)
               data, an agreed way of writing and processing data that relies upon IRIs and a simple
               data model. The semantic web allows people to make assertions in a way that computers
               can "understand." If people, working independently, happen to use the same IRIs to
               describe the same things, then computers can be programmed to make associations
               between disparate, heterogenous datasets. This allows us to find connections across
               disciplines and projects, to marshall computers to make inferences we might not make
               on our own, and to create a network of linked data.</para>
            <para>TAN has been designed to be linked-data friendly, and so requires in its
                     <code><link linkend="element-head">&lt;head></link></code> almost all data to
               be representable not just in a human-readable form but also computer-readable, as an
               IRI. </para>
            <para>Our first task, then, in writing the <code><link linkend="element-head"
                     >&lt;head></link></code> sections of our four TAN files is to look for IRI
               vocabulary that will be familiar to the people most likely to use our files. In
               trying to find suitable IRIs, we will find that the persons, things, and concepts we
               want to describe will range from the highly familiar to the unfamiliar.</para>
            <para><emphasis role="italic">Highly familiar</emphasis>: The two books that provide the
               basis of our transcription are well catalogued and generally known. A number of
               services provided by librarians provide a controlled IRI vocabulary that can be used
               by anyone to describe uniquely a particular version of a book. <link
                  xlink:href="http://www.worldcat.org">WorldCat</link> (run by OCLC) and the <link
                  xlink:href="http://catalog.loc.gov">Library of Congress</link> are good examples.
               In our case, we have found accurate Library of Congress IRIs for both editions of
                  <emphasis>Mother Goose</emphasis>: <code>http://lccn.loc.gov/12032709</code> and
                  <code>http://lccn.loc.gov/87042504</code>. Observe that these two IRIs are also,
               perhaps confusingly, URLs (locations). If we paste these strings into our browser, we
               retrieve a record that describes the book. This locator does not lead us to the book
               itself, only to information <emphasis role="italic">about</emphasis> the book.
               Nevertheless, the Library of Congress has decided to make this URL also a name for
               the book. Anyone who owns a domain name can designate a URL as a name for an object.
               And that allows them to set up their server to also return information about the
               object the IRI names. This subtle ambiguity—that the URL both names an entity and is
               a location for a webpage—can sometimes be confusing to those who are new to the
               semantic web, because such URLs name in reality two types of things: an entity and a
               location to find out more information about that entity. </para>
            <para>We now have IRIs for the sources. Let's now find an IRI to name the work,
                  <emphasis role="italic">Ring around the Rosie</emphasis>. The work is widely
               known, and even has a <link
                  xlink:href="http://en.wikipedia.org/wiki/Ring_a_Ring_o%27_Roses">Wikipedia
                  entry</link>. That Wikipedia entry is a benefit. The Universities of Leipzig and
               Mannheim and Openlink Software have collaborated on a project called <link
                  xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>, which is committed to
               providing a unique URN for every Wikipedia entry in the major languages. The DBPedia
               URN in this case is <code>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</code>.
               Once again, this is both a name and a locator. It names a specific intangible object,
               namely a nursery rhyme that we've called <emphasis>Ring around the Rosie</emphasis>,
               no matter what specific version. But if you put that name into your browser, you will
               get back more information about that named object.</para>
            <para><emphasis role="italic">Familiar, but only in small circles</emphasis>: We will
               need to have names for some of the people who edited the file. Here we're not
               interested in the authors of our books. We are interested in crediting the people who
               helped make the TAN file. Most people who write and edit our TAN file will not be
               well-known, public figures. If they are, and if they are famous enough to have a
               Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are
               also published authors, there is a good chance that they are listed in the databases
               of either <link xlink:href="http://viaf.org">VIAF</link> or <link
                  xlink:href="http://isni.org">ISNI</link>, both of which publish unique IRIs for
               persons. </para>
            <para>Many contributors to TAN files, however, will not be listed in these general
               databases. In those cases, we can name these participants with an IRI that we "own."
               We have already done something like this by assigning tag URNs to our four
               transcriptions (the value of <code><link linkend="attribute-id">@id</link></code> in
               the root element). Our editors can do the same thing. If a student Robin Smith has
               been helping with proofreading, Robin can take an email address (even one that
               doesn't work any more) and a date when the email address was used and construct a tag
               URN such as <code>tag:smith.robin@example.com,2012:self</code>. This has a slight
               drawback in that we cannot type this string into our browser to find out more about
               the Robin, but it at least allows us to assign a name that will not be confused as
               the Robin Smith identified by ISNI as
                  <code>http://isni.org/isni/0000000043306406</code>. (If we want to go a step
               further, Robin could mint a URN from a domain name that she owns, and set up a linked
               data service that offers more information, human- and computer-readable. But this is
               not required, and it can be a lot of work to maintain.)</para>
            <para>Now we come to a more difficult challenge. We have to assign an IRI to the
               relationship that we claim holds between two text-bearing objects. Making that clear
               is important, because if we had a different view on how one related to the other, it
               would probably affect the specifics of our word-for-word alignments. </para>
            <para>We are assuming for the sake of illustration that the version published in the
               1987 <emphasis>Mother Goose</emphasis> is a direct descendant of the 1881 version.
               Because no suitable IRI vocabulary yet exists for such concepts, TAN has coined an
               IRI that can be used by anyone wishing to declare that the second of two sources
               descends from the first through an unknown number of intermediaries:
                  <code>tag:textalign.net,2015:bitext-relation:a/x+/b</code>.</para>
            <para>We face a similar issue when thinking about text reuse. We generally consider the
               1987 version to be an adaptation of the 1881 version. And there are not stable,
               well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI,
                  <code>tag:textalign.net,2015:reuse-type:adaptation:general</code>.</para>
            <para>In both cases above, we could have come up with our own vocabulary. But the idea
               here is that we should be sharing a common vocabulary whenever possible. The built-in
               TAN vocabulary simply gives us a convenient lingua franca for describing some
               important but abstract concepts. For other examples of IRIs coined by TAN, see <xref
                  linkend="keywords-master-list"/>.</para>
            <para><emphasis role="italic">Generally unfamiliar</emphasis>: Some things or concepts
               will be unknown to very few people, perhaps even us. If we plan to refer to that
               thing or concept often, it is preferable to coin a tag URN, as described above. But
               in some cases, we might find that a tag URN we minted for some concept or thing was,
               in hindsight, misleading or poorly constructed, because we hadn't thought as
               thoroughly as we could have about the category. If we wish to avoid these kinds of
               situations, we can assign a randomly generated IRI called a universally unique
               identifier (UUID), e.g., <code>urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0</code>.
               Uuid URNs are very useful. The likelihood that a randomly generated uuid will be
               identical to any other uuid is astronomically improbable, making them reliably unique
               names for anything (barring someone copying and reusing that uuid URN to name some
               other object or concept). Numerous free UUID generators can be found online.</para>
            <para>To humans, a UUID on its own is meaningless, and rather ugly. But it is a start.
               We always have the option, later, of adding an IRI. It's perfectly fine to give one
               object or concept multiple IRIs. But the reverse is never true. One should never use
               the same IRI to identify more than one object or concept.</para>
         </section>
         <section>
            <title>Creating TAN Metadata (<code><link linkend="element-head"
               >&lt;head></link></code>)</title>
            <para>Now that we have explored various IRI vocabularies for concepts around our
               versions of <emphasis>Ring-a-ring-a-roses</emphasis>, we can now complete the
               metadata in our four TAN files. Let us start with the TAN-T file of the 1881
               version:<programlisting>    &lt;head>
        &lt;name>TAN transcription of Ring a Ring o' Roses&lt;/name>
        &lt;master-location href="http://textalign.net/release/TAN-2018/examples/ring-o-roses.eng.1881.xml"/>
        &lt;license>
            &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
            &lt;name>Attribution 4.0 International&lt;/name>
        &lt;/license>
        &lt;licensor who="park"/>
        &lt;source>
            &lt;IRI>http://lccn.loc.gov/12032709&lt;/IRI>
            &lt;name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]&lt;/name>
        &lt;/source>
        &lt;definitions>
            &lt;work>
                &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
                &lt;name>"Ring a Ring o' Roses" or "Ring Around the Rosie"&lt;/name>
            &lt;/work>
            &lt;div-type xml:id="line">
                &lt;IRI>http://dbpedia.org/resource/Line_(poetry)&lt;/IRI>
                &lt;name>line of poetry&lt;/name>
            &lt;/div-type>
            &lt;person xml:id="park">
                &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
                &lt;name>Jenny Park&lt;/name>
            &lt;/person>
            &lt;role xml:id="creator">
                &lt;IRI>http://schema.org/creator&lt;/IRI>
                &lt;name xml:lang="eng">creator&lt;/name>
            &lt;/role>
        &lt;/definitions>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2014-08-13" who="park">Started file&lt;/change>
    &lt;/head></programlisting></para>
            <para><code><link linkend="element-name">&lt;name></link></code>, the human readable
               counterpart to the <code><link linkend="attribute-id">@id</link></code> that is
               inside the root element, can be anything. And we can supply more than one <code><link
                     linkend="element-name">&lt;name></link></code>, in case we wish to provide it
               in different languages or variations.</para>
            <para><code><link linkend="element-master-location">&lt;master-location></link></code>
               is mandatory only if we have claimed through <code><link
                     linkend="attribute-in-progress">@in-progress</link></code> that the file is no
               longer in progress. One or more of these elements provide URLs where master versions
               of the file are kept (and updated). We provide this as a courtesy to others who might
               be using our data. Anyone who validates a local copy of the file will be warned if it
               does not match the master version, and be told the most recent changes. This allows
               users to found out if changes have been made, and it allows us to make corrections
               and silently notify other users of our alterations. To communicate this, we do not
               have to keep track of who is using the file.</para>
            <para><code><link linkend="element-license">&lt;license></link></code> specifies the
               license under which we are releasing our data. This element has nothing to do with
               the copyright of the source we have used (although, having been published in 1881,
               the book is clearly in the public domain). That is, we are declaring the rights
               attached to the data, not its source. This once again gets to the TAN metadata
               principle of describing our data and not other things. We can if we want describe the
               license of the source we have used (see the rest of the guidelines for guidance), but
               we absolutely must declare whether we have placed additional scrictures on the
               dataset we have created. In this example, we have released the data under a creative
               commons license. The child element <code><link linkend="element-IRI"
                  >&lt;IRI></link></code> specifies the IRI assigned by Creative Commons, and
                     <code><link linkend="element-name">&lt;name></link></code> is the
               human-readable form.</para>
            <para><code><link linkend="element-licensor">&lt;licensor></link></code>, by means of
                     <code><link linkend="attribute-who">@who</link></code>, indicates who holds the
               license. In this case it points to a person</para>
            <para>The conjunction of <code><link linkend="element-IRI">&lt;IRI></link></code> and
                     <code><link linkend="element-name">&lt;name></link></code>, the <emphasis>IRI +
                  name pattern</emphasis>, is a recurrent feature of TAN files. We may include any
               number of <code><link linkend="element-IRI">&lt;IRI></link></code> or <code><link
                     linkend="element-name">&lt;name></link></code> elements in an IRI + name
               pattern. But if we do so, we are stating that they all name the same thing, not
               different things.</para>
            <para><code><link linkend="element-source">&lt;source></link></code> points, through its
               IRI + name pattern, to a computer- and human-readable description of the book we have
               chosen. </para>
            <para><code><link linkend="element-definitions">&lt;definitions></link></code> contains
               data that is specific to TAN file types, to define our terminology. </para>
            <para><code><link linkend="element-work">&lt;work></link></code> uses the IRI + name
               pattern to name the work we have chosen to transcribe. <code><link
                     linkend="element-div-type">&lt;div-type></link></code> specifies the type of
               divisions we have chosen to use to segment the transcription. In a more complex text,
               there would be several <code><link linkend="element-div-type"
                  >&lt;div-type></link></code>s. Each one has an <code><link
                     linkend="attribute-xmlid">@xml:id</link></code>, which takes as a value some
               nickname that we wish to use for <code><link linkend="attribute-type"
                  >@type</link></code> values of <code><link linkend="element-div"
                  >&lt;div></link></code>s.</para>
            <para>The IRI + name pattern is also used for <code><link linkend="element-person"
                     >&lt;person></link></code>, which describes who was involved in creating the
               data, and <code><link linkend="element-role">&lt;role></link></code>. We may have as
               many <code><link linkend="element-person">&lt;person></link></code>s and <code><link
                     linkend="element-role">&lt;role></link></code>s as we wish. In this case, Jenny
               Park, has been given a tag URI. The <code><link linkend="element-IRI"
                  >&lt;IRI></link></code> value of <code><link linkend="element-role"
                     >&lt;role></link></code> comes from the vocabulary of <link
                  xlink:href="http://schema.org">schema.org</link>, which is maintained by Bing,
               Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated
               to universal Internet standards), but we could have used Dublin Core or some other
               IRI vocabulary describing behaviors, responsibilities, and roles.</para>
            <para>Those roles and persons get combined after the <code><link
               linkend="element-definitions">&lt;definitions></link></code> , in a <code><link
                  linkend="element-resp">&lt;resp></link></code>, which stipulates who was responsible for what roles.<note>
                  <para>If you decide to modify someone else's TAN file, then you become responsible
                     for changes, not the original person or organization. Your first point of order
                     should be add a <code><link linkend="element-person">&lt;person></link></code>
                     to the head, identifying yourself. You need not change the document's
                           <code><link linkend="attribute-id">@id</link></code>, but you should take
                     responsibility for any changes you make, otherwise you are incorrectly
                     attributing your changes to someone else.</para>
               </note></para>
            <para>Remember that <code><link linkend="element-head">&lt;head></link></code> is
               focused on the data, not its sources, so the claim that Jenny Park is the creator
               pertains only to the data. No inference should be made about who created the source.
               If someone wants that information, or anything else about the source, they should
               pursue the identifier we have provided under <code><link linkend="element-source"
                     >&lt;source></link></code>.</para>
            <para><code><link linkend="element-change">&lt;change></link></code> has attributes
                     <code><link linkend="attribute-when">@when</link></code> and <code><link
                     linkend="attribute-who">@who</link></code> that specify who made the
               change/comment and when. The value of <code><link linkend="attribute-when"
                     >@when</link></code> is always a date plus optional time formatted according to
               the standard <code>YYYY-MM-DD</code> + time (optional). <code><link
                     linkend="attribute-who">@who</link></code> always carries a value that refers
               to an <code>agent/<link linkend="attribute-xmlid">@xml:id</link></code>. Neither
                     <code><link linkend="element-change">&lt;change></link></code> nor <code><link
                     linkend="element-comment">&lt;comment></link></code> take <code><link
                     linkend="element-IRI">&lt;IRI></link></code> or any other children.</para>
            <para>So now we have finished one transcription file's metadata. The other one will look
               similar, but we'll also take a couple of
               shortcuts:<programlisting>    &lt;head>
      &lt;name>TAN transcription of Ring around the Rosie&lt;/name>
      &lt;master-location>ring-o-roses.eng.1987.xml&lt;/master-location>
      &lt;license>
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US&lt;/IRI>
         &lt;name>Creative Commons Attribution 4.0 International License&lt;/name>
         &lt;desc>This data file is licensed under a Creative Commons Attribution 4.0 International
            License. The license is granted independent of rights and licenses associated with the
            source. &lt;/desc>
      &lt;/license>
      &lt;licensor who="park"/>
      &lt;source>
         &lt;IRI>http://lccn.loc.gov/87042504&lt;/IRI>
         &lt;name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.&lt;/name>
      &lt;/source>
      &lt;definitions>
         &lt;work>
            &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
            &lt;name>Ring around the Rosie&lt;/name>
         &lt;/work>
         &lt;div-type xml:id="l" which="line (verse)"/>
         &lt;person xml:id="park" roles="creator">
            &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
            &lt;name xml:lang="eng">Jenny Park&lt;/name>
         &lt;/person>
         &lt;role xml:id="creator" which="creator"/>
      &lt;/definitions>
      &lt;alter>
         &lt;normalization which="no hyphens"/>
      &lt;/alter>
      &lt;resp roles="creator" who="park"/>
      &lt;change when="2014-10-24" who="park">Started file&lt;/change>
      &lt;comment when="2014-10-24" who="park">See p. 39 of source.&lt;/comment>
   &lt;/head></programlisting></para>
            <para>One significant difference is that three of the elements that normally take the
                  <xref linkend="pattern-iri_and_name"/> have been replaced with a simpler form that
               takes merely <link linkend="attribute-which"><code>@which</code></link> and
                     <code><link linkend="attribute-xmlid">@xml:id</link></code>. For a number of
               elements, TAN has predefined vocabulary that can be invoked by calling it (through
                  <link linkend="attribute-which"><code>@which</code></link>) and giving it an
               abbreviation to be used elsewhere in the document (<code><link
                     linkend="attribute-xmlid">@xml:id</link></code>).</para>
            <para>After <code><link linkend="element-definitions">&lt;definitions></link></code>
               comes a new element, <code><link linkend="element-alter">&lt;alter></link></code>,
               which contains a <code><link linkend="element-normalization"
                     >&lt;normalization></link></code> statement that declares, through the name and
               the IRI in the underlying TAN definition, that we have opted to remove word-break
               line-end hyphenation. This provides a cautionary note to users of our data who might
               value line-end hyphenation. Any number of <code><link linkend="element-normalization"
                     >&lt;normalization></link></code>s can be used to describe any alterations we
               might have made in our transcription. In other transcriptions we could use this
               feature to declare other suppressions, such as editorial comments or footnote
               signals.</para>
            <para>Note that the value of <code>div-type/<link linkend="attribute-xmlid"
                     >@xml:id</link></code> here, the letter <code>l</code>, differs from our
               previous transcription file, <code>line</code>. Even though we have adopted a
               different nickname, they are treated as equivalent because in each file we have
               defined <code>l</code> or <code>line</code> with the same IRI,
                  <code>http://dbpedia.org/resource/Line_(poetry)</code>. A computer that later
               looks for files with lines of poetry will not care about <code>l</code> and
                  <code>line</code>, but will look at the underlying IRI that defines these terms.
               This exemplifies how linked data (see above) can support our work. We are free to use
               abbreviations and terms that make sense to us, yet we tie those abbreviations to IRIs
               that have valence outside our project.</para>
            <para>Now that we have created the metadata for our transcriptions, we turn to the
               alignment files. Those <code><link linkend="element-head">&lt;head></link></code>s
               will look slightly different. We start with the TAN-A-div
               file:<programlisting>    &lt;head>
       &lt;name>div-based alignment of multiple versions of Ring o Roses&lt;/name>
       &lt;master-location>ringoroses.div.1.xml&lt;/master-location>
       &lt;license which="by_4.0"/>
       &lt;licensor who="park"/>
       &lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml&lt;/location>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml&lt;/location>
       &lt;/source>
       &lt;definitions>
           &lt;person xml:id="park">
               &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
               &lt;name xml:lang="eng">Jenny Park&lt;/name>
           &lt;/person>
           &lt;role xml:id="creator">
               &lt;IRI>http://schema.org/creator&lt;/IRI>
               &lt;name xml:lang="eng">creator&lt;/name>
           &lt;/role>
       &lt;/definitions>
       &lt;resp who="park" roles="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
    &lt;/head></programlisting></para>
            <para>Much of the code above will look similar to the previous two examples. Every
               alignment file has only one kind of source, namely TAN transcription files, nothing
               else. Therefore <code><link linkend="element-source">&lt;source></link></code>'s
                     <code><link linkend="element-IRI">&lt;IRI></link></code> always takes the
                     <code><link linkend="attribute-id">@id</link></code> value of the corresponding
               TAN transcription file. <code><link linkend="element-name">&lt;name></link></code> is
               arbitrary. It may replicate exactly the title found in the transcription file, or it
               may be modified, perhaps to harmonize better with the descriptions of the other texts
               aligned in the file. <code><link linkend="element-source">&lt;source></link></code>
               also has an child element not seen in the earlier two examples, <code><link
                     linkend="element-location">&lt;location></link></code>, which specifies where
               the digital file was accessed and when (through <code><link
                     linkend="attribute-when-accessed">@when-accessed</link></code>). We may include
               as many of these <code><link linkend="element-location">&lt;location></link></code>
               elements as we wish, with the most preferred or reliable location at the top, since
               the validation process will use first document that is available. The <code><link
                     linkend="attribute-when-accessed">@when-accessed</link></code> value is
               important, because the validator will look for changes in the file, and if there have
               been changes since we last accessed the file, it will return a warning with a summary
               of the number and kind of changes. If such a report is returned, it is up to us to
               determine if the alterations merit any action on our part.</para>
            <para>Our TAN-A-div file could have any number of <code><link linkend="element-source"
                     >&lt;source></link></code>s, and not necessarily for the same work. It also
               does not matter in which order we put the <code><link linkend="element-source"
                     >&lt;source></link></code>s. <code><link linkend="element-definitions"
                     >&lt;definitions></link></code> is empty, mainly because we have, in this
               case, no working assumptions to declare. In more advanced uses, this element would
               not be empty.</para>
            <para>This <code><link linkend="element-head">&lt;head></link></code> explains why the
                     <code><link linkend="element-body">&lt;body></link></code> of our TAN-A-div
               file is allowed to be empty. We have already specified which sources are to be
               aligned and where they are to be found. All TAN-A-div files assume, by default, that
               every source that is a version of the same work should be aligned upon the basis of
               the <code><link linkend="attribute-n">@n</link></code> value of <code><link
                     linkend="element-div">&lt;div></link></code>s. That is, any user or processor
               of a TAN-A-div file may assume that all implicit alignments should be made unless
               otherwise specified. </para>
            <para>For transcriptions that are already similarly structured and labeled, a TAN-A-div
               file is unnecessary for alignment. But we will see that the options available in a
               TAN-A-div's <code><link linkend="element-definitions">&lt;definitions></link></code>
               and <code><link linkend="element-body">&lt;body></link></code> will allow us not only
               to deal with inconsistencies in source transcriptions but to make important claims,
               e.g., where one work quotes from another.</para>
            <para>Meanwhile we turn to our fourth file, TAN-A-tok, whose <code><link
                     linkend="element-head">&lt;head></link></code> looks like
               this:<programlisting>    &lt;head>
        &lt;name>token-based alignment of two versions of Ring o Roses&lt;/name>
        &lt;master-location>ringoroses.01+02.token.1.xml&lt;/master-location>
        &lt;license which="by-nc-nd_4.0" rights-holder="park"/>
        &lt;source xml:id="ring1881">
            &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
            &lt;name>Ring o roses 1881&lt;/name>
            &lt;location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml&lt;/location>
        &lt;/source>
        &lt;source xml:id="ring1987">
            &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
            &lt;name>Ring o roses 1987&lt;/name>
            &lt;location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml&lt;/location>
        &lt;/source>
        &lt;definitions>
            &lt;bitext-relation xml:id="B-descends-from-A">
                &lt;IRI>tag:textalign.net,2015:bitext-relation:a/x+/b&lt;/IRI>
                &lt;name>B descends directly from A, unknown number of intermediaries&lt;/name>
                &lt;desc>The 1987 versions is hypothesized to descend somehow from the 
                    1881 version, mainly for the sake of illustration.&lt;/desc>
            &lt;/bitext-relation>
            &lt;reuse-type xml:id="adaptationGeneral">
                &lt;IRI>tag:textalign.net,2015:reuse-type:adaptation:general&lt;/IRI>
                &lt;name>general adaptation&lt;/name>
            &lt;/reuse-type>
            &lt;token-definition src="ring1881 ring1987" which="letters"/>
            &lt;person xml:id="park" roles="creator">
                &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
                &lt;name xml:lang="eng">Jenny Park&lt;/name>
            &lt;/person>
            &lt;role xml:id="creator" which="creator"/>
        &lt;/definitions>
        &lt;change when="2015-01-20" who="park">Started file&lt;/change>
    &lt;/head></programlisting></para>
            <para>The TAN-A-tok <code><link linkend="element-head">&lt;head></link></code> looks
               similar to the previous examples, except that <code><link
                     linkend="element-definitions">&lt;definitions></link></code> has some new
               content.</para>
            <para><code><link linkend="element-bitext-relation">&lt;bitext-relation></link></code>
               states through an IRI + name pattern the stemmatic relationship we think holds
               between the two sources. (Stemmatics is the study of the chain of transmission—the
               relationship of an original text-bearing object to the ones that survive. It
               frequently involves the creation of genealogical-like trees to illustrate the work's
               version history.) We have used the entire IRI + name pattern, but we could have
               substituted it with <link linkend="attribute-which"><code>@which</code></link> and
               the value <code>a/x+/b</code>.</para>
            <para>One or more <code><link linkend="element-reuse-type"
               >&lt;reuse-type></link></code>s specify how one text has reused another. The IRI we
               have used shows that we believe that the later text has generally adapted the earlier
               one. If this were a translation or a quotation or some other kind of text reuse, we
               might have used a different IRI.</para>
            <para>A third declaration, <code><link linkend="element-token-definition"
                     >&lt;token-definition></link></code>, specifies how we have defined our word
               tokens. <code><link linkend="attribute-src">@src</link></code> has more than one
               value, specifying that the same tokenization rule should be applied to both sources.
               This element is optional. If we leave it out, users are to assume that we mean
                  <code>letters</code>. This is because most often, whenever in ordinary
               conversation we refer to the nth word in a sentence we assume people will skip
               punctuation marks when they count.</para>
            <para>The value for <link linkend="attribute-which"><code>@which</code></link>,
                  <code>letters</code>, is a reserved TAN keyword that specifies that any
               consecutive string of word characters, ignoring spaces and punctuation. Under this
               token definition the phrase <code>"Hush!" said he</code> would have three tokens. Had
               we set the value of <link linkend="attribute-which"><code>@which</code></link> to the
               reserved TAN keyword <code>letters and punctuation</code>, we would have six tokens,
               since each punctuation mark would be defined as a token.</para>
         </section>
         <section>
            <title>Aligning across Projects</title>
            <para>We now have a small corpus of TAN files. Let us imagine what it might be like to
               connect our TAN corpus to another. Let us assume that we have found elsewhere, in a
               German project, a TAN transcription of a work that looks quite similar to our
               own:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel">
   &lt;head>
      &lt;name>TAN Transkription, Ringelreihen mit Riederfallen&lt;/name>
      &lt;master-location>http://beispiel.com/TAN-T/ringel.xml&lt;/master-location>
      &lt;license>
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
         &lt;name>Creative Commons Namensnennung 4.0 International Lizenz&lt;/name>
         &lt;desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0
            International Lizenz.&lt;/desc>
      &lt;/license>
      &lt;licensor who="schmidt"/>
      &lt;source>
         &lt;IRI>http://www.worldcat.org/oclc/4574384&lt;/IRI>
         &lt;name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus
            allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig,
            1897.&lt;/name>
      &lt;/source>
      &lt;definitions>
         &lt;work>
            &lt;IRI>tag:beispiel.com,2014:texte:holderbusch&lt;/IRI>
            &lt;name>"Die Kinder auf dem Holderbusch"&lt;/name>
         &lt;/work>
         &lt;version>
            &lt;IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e&lt;/IRI>
            &lt;name>zweite Version&lt;/name>
         &lt;/version>
         &lt;div-type xml:id="Zeile">
            &lt;IRI>http://dbpedia.org/resource/Gedichtzeile&lt;/IRI>
            &lt;name>Gedichtzeile&lt;/name>
         &lt;/div-type>
         &lt;person xml:id="schmidt" roles="Produzent">
            &lt;IRI>tag:hans@beispiel.com,2014:selbst&lt;/IRI>
            &lt;name xml:lang="eng">Hans Schmidt&lt;/name>
         &lt;/person>
         &lt;role xml:id="Produzent">
            &lt;IRI>http://schema.org/producer&lt;/IRI>
            &lt;name xml:lang="eng">Produzent&lt;/name>
         &lt;/role>
         &lt;ambiguous-letter-numerals-are-roman>false&lt;/ambiguous-letter-numerals-are-roman>
      &lt;/definitions>
      &lt;alter>
         &lt;normalization>
            &lt;IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off&lt;/IRI>
            &lt;name>Keine Bindestriche&lt;/name>
         &lt;/normalization>
      &lt;/alter>
      &lt;resp who="schmidt" roles="Produzent"/>
      &lt;change when="2014-08-13" who="schmidt">Anfang&lt;/change>
      &lt;comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht&lt;/comment>
   &lt;/head>
   &lt;body xml:lang="deu" in-progress="false">
      &lt;div type="Zeile" n="a">Ringel, Ringel, Reihe!&lt;/div>
      &lt;div type="Zeile" n="b">Sind der Kinder dreie,&lt;/div>
      &lt;div type="Zeile" n="c">Sitzen auf dem Holderbuch,&lt;/div>
      &lt;div type="Zeile" n="e">Schreien alle: husch, husch, husch!&lt;/div>
   &lt;/body>
&lt;/TAN-T></programlisting></para>
            <para>It seems clear to us that this 19th-century German version is quite similar to our
               two English versions. We have some alignment options open to us. Two more sets of
               word-for-word alignments would be interesting, but remember, just because we find a
               text that nicely aligns with others does not mean that we <emphasis role="italic"
                  >must</emphasis> align them, or even if we choose to make an alignment that we
               have to align <emphasis>everything</emphasis>. In this case, we choose not to worry
               about word-for word alignments, and we focus here only on the TAN-A-div alignment, so
               that, for example, we can later read the three versions in parallel and study their
               relationships.</para>
            <para>To that end, we first observe some differences between this transcription and our
               other two. First, the value of <code><link linkend="element-work"
                  >&lt;work></link></code> is not the one we have given our two versions. Second,
               the <code><link linkend="element-div-type">&lt;div-type></link></code> is defined as
                  <code>http://dbpedia.org/resource/Gedichtzeile</code> (Gedichtzeile = line of
               poetry). Third, the lines have been lettered instead of numbered (and they are
               stipulated to be letter numerals, not roman, through <code><link
                     linkend="element-ambiguous-letter-numerals-are-roman"
                     >&lt;ambiguous-letter-numerals-are-roman></link></code>). And last, the editor
               seems to have made a typographical error, making the last line <code>n="e"</code>
               instead of <code>n="d"</code>). These four differences typify some of the
               inconsistencies that are commonly found in digital texts.<note>
                  <para>There are a few other differences in this third transcription that do not
                     affect our alignment. <code><link linkend="element-version"
                        >&lt;version></link></code> is used to distinguish different versions of the
                     same work found on the same text-bearing object. That is, if we are
                     transcribing a bilingual edition, we can use <code><link
                           linkend="element-version">&lt;version></link></code> to specify which of
                     the two versions we are encoding. Notice that the <code><link
                           linkend="element-IRI">&lt;IRI></link></code> value is a uuid. In this
                     case the editor was not prepared to deploy a formal IRI naming scheme (perhaps
                     using a tag URN) that would be satisfactory for work-versions.</para>
               </note></para>
            <para>These are points we can easily reconcile in our TAN-A-div file, which we now
               expand to include the German version. We make the following adjustments (in
               boldface):<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment">
    &lt;head>
       &lt;name>div-based alignment of multiple versions of Ring o Roses&lt;/name>
       &lt;master-location>ringoroses.div.1.xml&lt;/master-location>
       &lt;license which="by_4.0"/>
       &lt;licensor who="park"/>
       &lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml&lt;/location>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml&lt;/location>
       &lt;/source>
       <emphasis role="bold">&lt;source xml:id="ger">
          &lt;IRI>tag:beispiel.com,2014:ringel&lt;/IRI>
          &lt;name>Transcription of an ancestor of Ring around the roses in German&lt;/name>
          &lt;location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml&lt;/location>
          &lt;location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml&lt;/location>
       &lt;/source></emphasis>
       &lt;definitions>
           &lt;person xml:id="park">
               &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
               &lt;name xml:lang="eng">Jenny Park&lt;/name>
           &lt;/person>
           &lt;role xml:id="creator">
               &lt;IRI>http://schema.org/creator&lt;/IRI>
               &lt;name xml:lang="eng">creator&lt;/name>
           &lt;/role>
           <emphasis role="bold">&lt;alias id="ring" idrefs="ger eng-us"/></emphasis>
       &lt;/definitions>
<emphasis role="bold">      &lt;alter src="ger">
           &lt;rename n="5" by="-1"/>
       &lt;/alter></emphasis>
       &lt;resp who="park" roles="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
       <emphasis role="bold">&lt;change when="2014-08-22" who="park">Added German version.&lt;/change></emphasis>
    &lt;/head>
    &lt;body/>
&lt;/TAN-A-div></programlisting></para>
            <para>The first major change is the insertion of a third <code><link
                     linkend="element-source">&lt;source></link></code>, pointing to the new file
               and specifying its name and IRI. Note that two locations have been provided, one for
               the original location and another for the copy saved locally into our project folder.
               Validation will occur at the first document available. If we wanted to work primarily
               off our local copy, we would have put that <code><link linkend="element-location"
                     >&lt;location></link></code> first. By placing it second, we allow the
               validation engine to look for updates and changes in the master version. If that
               version is unavailable, validation will be made against second, local copy.</para>
            <para>The second major change, to address the German version's different value of
                     <code><link linkend="element-work">&lt;work></link></code>, is the addition of
               an <code><link linkend="element-alias">&lt;alias></link></code>. If and when we make
               claims about a work in general, via <link linkend="attribute-work">@work</link>, the
               id value <code>ring</code> will mean that we're asserting the claim to be true for
               any scriptum that shares the IRI values of the <code><link linkend="element-work"
                     >&lt;work></link></code> in either the German or the US version (which is why
               we do not need to specifically mention <code>eng-uk</code> in the <code><link
                     linkend="element-alias">&lt;alias></link></code>, since it already has a work
               IRI in common with the US version). </para>
            <para>A <code><link linkend="element-rename">&lt;rename></link></code> takes care of the
               apparent typographical error, this time anchoring the German version to the US one.
               Note that the German version uses <code>e</code>, but we have used <code>5</code>.
               But we could have used <code>e</code>, or even the Roman numeral <code>v</code>, had
               we wished to. Every TAN file's numeration system is evaluated locally, independent of
               any companion files. So we need not reconcile the <code>a</code>, <code>b</code>, and
                  <code>c</code> in the <code><link linkend="attribute-n">@n</link></code> values in
               the German version, because these will be automatically treated as equivalent to
                  <code>1</code>, <code>2</code>, and <code>3</code>. The TAN format allows four
               numeration systems other than Arabic numerals: Roman numerals (uppercase or
               lowercase), alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet
               combinations (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5).
               The last two systems will be treated as numerical pairs (1 and 1, 1 and 5,
               etc.).</para>
            <para>The last major insertion is a new <code><link linkend="element-change"
                     >&lt;change></link></code>, documenting when we made the alterations. The value
               of <code><link linkend="attribute-when">@when</link></code> effectively updates the
               version of our TAN-A-div file.</para>
            <para>With these changes, the new version is aligned with the other two. Our work may
               have been simplified if we had just modified the German version ourself. But such
               changes would have affected only our local copy, not the master one. Changing only
               our local copy would not allow us to connect our work to other TAN files that may be
               depending upon the same master file.</para>
            <para>But perhaps Hans Schmidt, the producer of the German version, can be contacted. We
               do so, and we suggest that he modify the version to make it align better. In the case
               of <code><link linkend="element-div-type">&lt;div-type></link></code>, he need merely
               add another element: <code><link linkend="element-IRI"
                  >&lt;IRI></link>http://dbpedia.org/resource/Line_(poetry)&lt;/IRI></code> (or even
               better, use the built-in TAN vocabulary). Perhaps he has reasons for labeling the
               lines with letters, and perhaps he is reluctant to explicitly identify this poem with
                  <emphasis role="italic">Ring around the Rosie</emphasis>. That is within his
               rights. But the conversation might lead to our pointing out that <code>n="e"</code>
               should probably be <code>n="d"</code> and that there is an apparent discrepancy in
               the last line. (The original, printed book has the poem twice on page 438, one with
               the spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct
               his master file, he can add a new <code><link linkend="element-change"
                     >&lt;change></link></code>, and thereby tacitly notify anyone else using the
               file that corrections have been made.</para>
            <para>At this point we have a network of five TAN files, four in our corpus and one from
               outside. Although simple, the network could be the basis for some creative and
               complex research questions. Stylesheets could be used to automatically align the
               versions for reading and study, or to perform statistical analysis. Study of the rest
               of these guidelines, as well as example TAN libraries, will suggest numerous ways to
               create, manage, share, and use TAN files.</para>
         </section>
      </chapter>
   </part>
   <part xml:id="detailed_description">
      <title>Detailed Description</title>
      <partintro>
         <para>This part of the guidelines provides a detailed description of the formats of the
            Text Alignment Network. The material is organized according to the structure that
            governs the schema files, so both can be read in tandem.</para>
         <para><xref linkend="concepts_common"/> outlines, in a non-technical way, the principles
            and technical foundations of the TAN format.</para>
         <para><xref linkend="class_common"/>, <xref linkend="class_1"/>, <xref linkend="class_2"/>,
            and <xref linkend="class_3"/> comprehensively describe all the TAN formats. Each chapter
            starts with theoretical or scholarly background, to provide a contextual explanation for
            the technical points that follow. </para>
         <para><xref linkend="elements-attributes-and-patterns"/>, the first of two very long
            chapters, provides a comprehensive, detailed explanation of the rules for every element
            and attribute, as well as the patterns into which they fall. This chapter includes a
            thorough list of relevant validation rules and examples. It has been written using a
            stylesheet that traverses the official TAN schemas, functions, and examples.</para>
         <para><xref linkend="keywords-master-list"/> lists all the vocabulary items that have
            already been defined as a core part of the format. This chapter is, essentially, a
            different way of looking at the TAN-key files that are in the <code>TAN-key</code>
            folder.</para>
         <para>The chapters in this part of the guidelines should be read selectively, not
            consecutively. They have been written with the assumption that you have already read the
            previous part (<xref linkend="general_overview"/>) and that you have already started to
            create and edit a TAN collection.</para>
         <para>Because readers will come from different specialties, all acronyms, abbreviations,
            and concepts are defined and explained, albeit tersely. Concepts or technologies are
            discussed only insofar as they affect the use of TAN; suggestions for further reading
            are provided for those who want a more thorough introduction to a topic. </para>
      </partintro>
      <chapter xml:id="concepts_common">
         <title>General Underpinnings</title>
         <para>This chapter retains something of the introductory spirit of the previous one by
            providing an overview of the fundamental principles and technologies behind TAN. The
            overall goal of this chapter is to explain design principles of the format. Although
            this chapter assumes on your part no prior knowledge of any particular technology, it is
            also not meant to be a tutorial. Links to further reading will take you to more adequate
            introductory material.</para>
         <section xml:id="design_principles">
            <title>Design Principles</title>
            <para>The TAN formats have been designed around a few basic design principles:</para>
            <para><emphasis role="bold">Scholarly habits</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Be patient.</para>
                  </listitem>
                  <listitem>
                     <para>Simplify.</para>
                  </listitem>
                  <listitem>
                     <para>Stay focused.</para>
                  </listitem>
                  <listitem>
                     <para>Avoid redundancy.</para>
                  </listitem>
                  <listitem>
                     <para>Don't state the obvious.</para>
                  </listitem>
                  <listitem>
                     <para>Use familiar conventions.</para>
                  </listitem>
               </itemizedlist></para>
            <para>
               <emphasis role="bold">Scholarly freedom</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Express doubt.</para>
                  </listitem>
                  <listitem>
                     <para>Offer alternatives.</para>
                  </listitem>
                  <listitem>
                     <para>Exercise independence.</para>
                  </listitem>
                  <listitem>
                     <para>Invite interdependence.</para>
                  </listitem>
               </itemizedlist></para>
            <para>
               <emphasis role="bold">Scholarly responsibility</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Declare your assumptions.</para>
                  </listitem>
                  <listitem>
                     <para>Make your work citable.</para>
                  </listitem>
                  <listitem>
                     <para>Satisfy scholars' expectations:</para>
                     <itemizedlist>
                        <listitem>
                           <para>Who did what when?</para>
                        </listitem>
                        <listitem>
                           <para>What are your sources?</para>
                        </listitem>
                        <listitem>
                           <para>How do you define your terms?</para>
                        </listitem>
                        <listitem>
                           <para>What alterations have you made to your sources?</para>
                        </listitem>
                        <listitem>
                           <para>What rights do I have to use your material?</para>
                        </listitem>
                     </itemizedlist>
                  </listitem>
               </itemizedlist>
               <emphasis role="bold">General utility</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Use stable technology.</para>
                  </listitem>
                  <listitem>
                     <para>Keep design predictable, consistent.</para>
                  </listitem>
                  <listitem>
                     <para>Make the data human readable.</para>
                  </listitem>
                  <listitem>
                     <para>Make the data computer actionable.</para>
                  </listitem>
               </itemizedlist>
            </para>
         </section>
         <section>
            <title>Format Organization</title>
            <para>The Text Alignment Network is a modular suite of XML encoding formats, each one
               designed for a specific type of textual data, divided into three classes:
               transcriptions (class 1), annotations and alignments of transcriptions (class 2), and
               everything else (class 3). </para>
            <para><emphasis role="bold">Class 1</emphasis>, representations of textual objects,
               consists solely of transcription files. Each transcription file contains the text of
               a single work from a single text-bearing object (which we term
                  <emphasis>scriptum</emphasis>), whether physical or digital. There are two types
               of transcription file: a standard generic format and a TEI extension. These two types
               are differentiated by the root element, <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code> and <code>&lt;TEI></code> respectively. </para>
            <para><emphasis role="bold">Class 2</emphasis>, annotations of class 1 files, are used
               to encode claims about texts, and to align them. There are two types of alignment,
               one for broad, general alignments and another for granular, word-for-word aligments.
               The former, with <code><link linkend="element-TAN-A-div">&lt;TAN-A-div></link></code>
               as the root element, aligns any number (one or more) of class 1 files, and permits
               assorted claims about those files. The latter, <code><link
                     linkend="element-TAN-A-tok">&lt;TAN-A-tok></link></code>, aligns only pairs of
               class 1 files. Lexico-morphology files, <code><link linkend="element-TAN-A-lm"
                     >&lt;TAN-A-lm></link></code>, are used to encode the lexical and morphological
               (or part of speech) forms of individual words in a single class 1 file.</para>
            <para><emphasis role="bold">Class 3</emphasis>, covers everything else. <code><link
                     linkend="element-TAN-mor">&lt;TAN-mor></link></code> is used to define the
               grammatical categories or features of a given language and to specify rules for
               tagging words in a dependent TAN-A-lm file. <code><link linkend="element-TAN-key"
                     >&lt;TAN-key></link></code> collects and defines terms frequently used in other
               TAN files. <code><link linkend="element-collection">&lt;collection></link></code>
               marks TAN catalog files, which provide an index of locally available TAN
               files.</para>
            <para>This modular approach supports what is sometimes called <emphasis role="italic"
                  >stand-off annotation</emphasis> (or <emphasis role="italic">stand-off
                  markup</emphasis>), in contrast to <emphasis role="italic">in-line
                  annotation</emphasis>, in which a text and its annotations are placed in a single
               file. (Most TEI and HTML files feature in-line annotation.) In stand-off annotation,
               the annotations reside in files separate from the text. This provides several
               benefits: <itemizedlist>
                  <listitem>
                     <para>An editor can work on a file with minimal distraction, focusing on a
                        limited set of closely related questions. </para>
                  </listitem>
                  <listitem>
                     <para>Editors can work off the same master files, even if they have very
                        different research interests.</para>
                  </listitem>
                  <listitem>
                     <para>Complementary or competing annotations can be made, even if those
                        annotations overlap (a major problem for in-line annotation, where according
                        to XML rules no element may interlock or overlap with another).</para>
                  </listitem>
                  <listitem>
                     <para>TAN files become, collectively, a complex dataset, supporting lines of
                        research that might not have been anticipated by any single project.</para>
                  </listitem>
                  <listitem>
                     <para>Editorial labor can be conducted without central coordination, as
                        individuals work at their own pace, independently, on separate files.</para>
                  </listitem>
                  <listitem>
                     <para>When errors are found, they can be corrected in master files. Anyone
                        depending upon that master file as a source will be notified of changes that
                        have been made and they can deal with them accordingly. (Editor 1 can post
                        typographical corrections, and if she logs the change with a time-date
                        stamp, anyone using the file, upon validating their files, will be sent
                        information or a warning about the change. Similarly, Editors 2 and 4 can
                        let Editor 1 know about their work, and Editor 1 can update the Old French
                        versions with cross-references.)</para>
                  </listitem>
                  <listitem>
                     <para>Any data file can be released, circulated, and used independent of any
                        other that points to it, or to which it points.</para>
                  </listitem>
                  <listitem>
                     <para>Connected files can be combined and transformed in any number of ways to
                        produce a wide variety of derivative documents (e.g., collated versions,
                        statistical analysis). A transformation created for one set of TAN documents
                        will work identically on other TAN documents of the same format. (If someone
                        creates a tool to synthesize a transcription and an associated TAN-A-lm
                        file, it can be applied to both Editor 2's and Editor 4's work.)</para>
                  </listitem>
                  <listitem>
                     <para>The TAN family of formats can be expanded to allow other types of
                        linguistic data, and therefore other lines of research.</para>
                  </listitem>
               </itemizedlist></para>
            <para>Stand-off annotation is not without liabilities. Files might be altered or
               altogether deleted, rendering dependent files meaningless. An editor may find that
               not having the annotated text in the same place as the annotation is an
               inconvenience. These are significant challenges, but TAN validation rules have been
               designed to mitigate these as much as possible. </para>
         </section>
         <section>
            <title>Assumptions in the Creation of TAN Data</title>
            <para>All creators and users of TAN files are expected to share few basic
               assumptions.</para>
            <para>First, all TAN-compliant data is to be understood as largely
                  <emphasis>derivative</emphasis>. That is, data files have no originality or
               creativity independent of their sources (but see below about interpretation).
               TAN-compliant data is to be created with intent of adhering as closely as possible to
               some model or archetype. For example, a transcription should replicate faithfully
               some earlier digital edition or text-bearing material object (e.g., stone, papyrus,
               manuscript, printed book for written text; audiovisual media for oral or performative
               texts). Morphological files and alignment files should describe as clearly and as
               reliably as possible their source transcriptions. <emphasis>In creating and
                  publishing a TAN file you claim to have offered a good-faith representation or
                  description of something; in using a TAN file, you hold the creator to that
                  expectation.</emphasis></para>
            <para>Second, all core TAN files are <emphasis>interpretive</emphasis>. That is, they
               are permeated by editorial assumptions and opinions that might not be shared by
               everyone. If there is any originality or creativity in a TAN file, it is in that
               interpretive outlook. For example, if you edit a transcription file you must decide
               how to handle unusual letterforms and other visible marks. Your decisions will be
               informed by how you view the original text and its native writing system, and how you
               interpret and use Unicode. If you write an alignment file, you must make decisions
               about what factors caused one text to be transformed into another.
               Lexicomorphological files require you to commit to one or more grammars and
               dictionaries, and you must discern how best to handle cases of vagueness and
               ambiguity. No TAN file ever stands completely outside the interpretive act.
                  <emphasis>In creating and publishing a TAN file you claim to have disclosed as
                  best you can the assumptions behind your interpretive outlook; in using a TAN
                  file, you hold the creator to that expectation.</emphasis></para>
            <para>Third, all core TAN files are <emphasis>useful</emphasis>. That is, the
               interpretive impluse is assumed to be coupled with an equally strong desire to make
               the data as useful to as many users as possible, even those who may not share your
               assumptions or interpretation. A creator of a transcription file, for example, should
               normalize and segment texts with a minimum of idiosyncracies, adopting the most
               widely used reference systems, so as to optimize the alignment process. Morphological
               files should depend whenever possible upon commonly accepted grammars and lexica.
               Alignment files should work with comprehensible categories of text reuse. No TAN file
               will always be useful to everyone, but it should be as useful to as many as possible,
               as frequently as possible. <emphasis>In creating a TAN file you claim to use common,
                  shared conventions whenever possible, and to note any departures; in using a TAN
                  file, you hold the creator to that expectation.</emphasis></para>
         </section>
         <section>
            <title>Core Technology</title>
            <para>TAN depends upon a set of relatively stable technologies. Those technologies and
               the underlying terminology are very briefly defined and explained below, with
               particular attention to interpretive decisions that have been adopted by TAN
               validation rules. References to further reading will lead you to better and more
               thorough introductions. </para>
            <section xml:id="unicode">
               <title>Unicode</title>
               <section>
                  <title>What is it?</title>
                  <para>Unicode is the worldwide standard for the consistent encoding,
                     representation, and exchange of digital texts. Stable but still growing,
                     Unicode is intended to represent all the world's writing systems, living and
                     historical. Maintained by a nonprofit organization, the Unicode standard allows
                     us to share texts in any alphabet and reliably share that data with other
                     people, independent of individual fonts. </para>
                  <para>With more than 128,000 characters, Unicode is almost as complex as human
                     writing itself. The entire sequence of characters is divided into blocks, each
                     one reserved, more or less, for a particular alphabet or a set of characters
                     that share something in common. Within each block, characters may be grouped
                     further. Each character is assigned a single codepoint.</para>
                  <para>Because computers work on the binary system, codepoints have been numbered
                     according to the related hexadecimal system (base 16), which uses the digits 0
                     through 9 and the letters A through F. (The number 10 in decimal is A in
                     hexadecimal; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex 4F.) It
                     is helpful to think of Unicode as a very long ribbon sixteen squares wide, a
                     glyph in each square. This is illustrated nicely <link
                        xlink:href="http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF"
                        >in this article</link>. Each position along the width is labeled with a
                     hexadecimal number (0-9, A-F) that always identifies the last digit of a
                     character's code point value.</para>
                  <para>It is common to refer to Unicode characters by their value or their name.
                     The value customarily starts "U+" and continues with the hexadecimal value,
                     usually at least four digits. The official Unicode name is usually given fully
                     in uppercase. Examples:</para>
                  <para>
                     <table frame="all">
                        <title>Unicode characters</title>
                        <tgroup cols="3">
                           <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                           <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                           <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                           <thead>
                              <row>
                                 <entry>Character</entry>
                                 <entry>Unicode value</entry>
                                 <entry>Unicode name</entry>
                              </row>
                           </thead>
                           <tbody>
                              <row>
                                 <entry>" " (space)</entry>
                                 <entry>U+0020</entry>
                                 <entry>SPACE</entry>
                              </row>
                              <row>
                                 <entry>®</entry>
                                 <entry>U+00AE</entry>
                                 <entry>REGISTERED SIGN</entry>
                              </row>
                              <row>
                                 <entry>ю</entry>
                                 <entry>U+044E</entry>
                                 <entry>CYRILLIC SMALL LETTER YU</entry>
                              </row>
                           </tbody>
                        </tgroup>
                     </table>
                  </para>
               </section>
               <section xml:id="normalization">
                  <title>Normalization</title>
                  <para>TAN validation rules require all data to be normalized according to the
                     Unicode NFC algorithm. Any text in a TAN file that is not NFC normalized will
                     be marked as invalid. </para>
               </section>
               <section xml:id="unicode-characters-with-special-interpretation">
                  <title>Unicode characters with special interpretation</title>
                  <para>When the characters U+200D ZERO WIDTH JOINER and U+00AD SOFT HYPHEN occur at
                     the end of a leaf <link linkend="element-div"><code>&lt;div></code></link>,
                     perhaps followed by white space that will be ignored (see below), processors
                     will assume that the character is to be deleted, and when combined with the
                     next leaf div, no intervening space should be allowed. Furthermore, because
                     these characters are difficult to discern from spaces and hyphens, any output
                     based on the character mapping of the core functions should replace these
                     characters with their XML entities, <code>&amp;#x200d;</code> and
                        <code>&amp;#xad;</code>.</para>
               </section>
               <section xml:id="combining_characters">
                  <title>Combining characters</title>
                  <para>At the core level of conformance, Unicode does not dictate whether combining
                     characters (accents, modifying symbols) should be counted independently or as
                     part of a base character, nor does the family of XML languages. In most
                     circumstances, this point is negligible. But it affects regular expressions and
                     XPath expressions (see below). </para>
                  <para>Two of the class 2 formats allow the counting of characters. Such counting
                     is assumed to be made exclusively of non-combining characters, defined as the
                     regular expression <code>[^\p{M}]</code>. Any numerical reference made in a TAN
                     file to an individual character will be found by counting only non-combining
                     characters. When the nth character is requested, TAN functions will return the
                     nth base character along with any combining characters that immediately follow. </para>
                  <para>TAN rules stipulate that combining characters must have a preceding base
                     character. Any <link linkend="element-div"><code>&lt;div></code></link> that
                     starts with a combining character will be marked as invalid. See also <xref
                        linkend="reg_exp_and_comb_chars"/>.</para>
               </section>
               <section>
                  <title>Deprecated Unicode points</title>
                  <para>Because TAN is focused not at all on appearance, the following characters
                     will generate an error if found in a TAN file:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>U+00A0 NO-BREAK SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2000 EN QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2001 EM QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2002 EN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2003 EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2004 THREE-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2005 FOUR-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2006 SIX-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2007 FIGURE SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2008 PUNCTUATION SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2009 THIN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+200A HAIR SPACE</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
               <section>
                  <title>Further Reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://unicode.org">Unicode
                              Consortium</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="http://en.wikipedia.org/wiki/Unicode"
                                 >Unicode</link> (Wikipedia)</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="xml">
               <title>eXtensible Markup Language (XML)</title>
               <section>
                  <title>What is it?</title>
                  <para>Defined by the W3C, the eXtensible Markup Language (XML) is a
                     machine-actionable markup language that facilitates human readability.</para>
                  <para>For a basic introduction to XML see <xref linkend="gentle_guide"/>.</para>
               </section>
               <section>
                  <title>Schemas and validation</title>
                  <para>Validation files are found here: <code><link
                           xlink:href="http://textalign.net/release/TAN-2018/schemas/"
                           >http://textalign.net/release/TAN-2018/schemas/</link></code>.</para>
                  <para>Each TAN file is validated by two types of schema files, one dealing with
                     major rules concerning structure and data type (written in RELAX-NG) the other
                     with very detailed rules (written in Schematron). </para>
                  <para>The RELAX-NG rules are written primarily in compact syntax
                        (<code>.rnc</code>), and then converted to the XML syntax
                     (<code>.rng</code>). For TAN-TEI, the special format One Document Does it all
                        (<code>.odd</code>) is used to alter the rules for TEI All.</para>
                  <para>The Schematron files are generally quite short. The primary work is done by
                     a large function library written in XSLT. For more on this process, see <xref
                        linkend="tan-stylesheets-and-function-library"/>.</para>
                  <para>Some validation engines that process a valid TAN-compliant TEI file may
                     return an error something like <code>conflicting ID-types for attribute "who"
                        of element "comment" from namespace "tag:textalign.net,2015:ns"</code>. Such
                     a message alerts you to the fact that by mixing TEI and TAN namespaces, you
                     open yourself up to the possibility of conflicting <code>xml:id</code> values.
                     It is your responsibility to ensure that you have not assigned duplicate
                     identifiers. Very often, it is possible for you to configure an XML editor to
                     ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... >
                     XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)</para>
               </section>
               <section xml:id="whitespace">
                  <title>White space</title>
                  <para>By default in XML, unless otherwise specified, consecutive space characters
                     (space, tab, newline, and carriage return) are considered equivalent to a
                     single space. This gives editors the freedom they need to format XML documents
                     as they like, for either human readability or compactness. </para>
                  <para>All TAN formats assume space normalization, with an extra caveat, namely,
                     that some space is assumed to exist between adjacent leaf <code><link
                           linkend="element-div">&lt;div></link></code>s, even if no text node
                     intervenes. This behavior is overridden if the first leaf <code><link
                           linkend="element-div">&lt;div></link></code> ends in the soft hyphen or
                     the zero width joiner; see <xref
                        linkend="unicode-characters-with-special-interpretation"/>). </para>
                  <para>The TAN format does not stipulate how space-only text nodes should be
                     interpreted. It is up to processors to analyze the relevant <code><link
                           linkend="element-div-type">&lt;div-type></link></code> to infer an
                     appropriate type fo white-space separator. </para>
                  <para>If retention of multiple spaces is important for your research, then TAN
                     formats may not be appropriate, since TAN is not intended to replicate the
                     appearance of a <emphasis>scriptum</emphasis>. Pure TEI (and not TAN-TEI) might
                     be a practical alternative, since it allows for a literal use of space, and
                     encourages XML files that try to replicate the appearance of a
                        <emphasis>scriptum</emphasis>.</para>
                  <para>For more on white space see <link
                        xlink:href="https://www.w3.org/TR/REC-xml/#sec-white-space">the W3C
                        recommendation</link>.</para>
               </section>
               <section>
                  <title>Non-mixed content</title>
                  <para>Many familiar text formats such as TEI, HTML, and Docbook allow what is
                     called mixed content—a mixture of elements and nonspace text as siblings. The
                     TAN formats, aside from TAN-TEI, are committed to a non-mixed content model.
                     Nonspace text nodes and elements are never siblings. The practical effect of
                     this decision is that indentation may be applied to a TAN file as one wishes,
                     and space text nodes may be inserted between any two adjacent elements, without
                     affecting the meaning. </para>
                  <para>To specify in a class 1 file that two adjacent leaf <code><link
                           linkend="element-div">&lt;div></link></code>s should have no intervening
                     space, see <xref linkend="unicode-characters-with-special-interpretation"
                     />.</para>
               </section>
            </section>
            <section xml:id="namespace">
               <title>Namespaces</title>
               <section>
                  <title>What are they?</title>
                  <para>XML allow users to develop vocabularies of elements as they wish. One person
                     may wish to use the element <code>&lt;bank></code> to refer to financial
                     institutions, another to rivers. Perhaps someone wishes to mention both rivers
                     and financial institutions in the same document. XML was designed to allow
                     users to mix vocabularies, even when those vocabularies use synonymous element
                     names. This means that anyone using <code>&lt;bank></code> must be allowed to
                     specify exactly which vocabulary is being used. Disambiguation is accomplished
                     by associating IRIs (see <xref linkend="IRIs_and_linked_data"/> below) with the
                     element names. The actual full name of an element is the local name plus the
                     IRI that qualifies its meaning, e.g.,
                        <code>bank{http://example1.com/terms/}</code> and
                        <code>bank{http://example2.com/terms/}</code>. </para>
                  <para>The relationship between the element name and the IRI is analogous to that
                     between a person's given name and their family name. The IRI—the family name—is
                     called the <emphasis>namespace</emphasis>—not an ideal term, but the one that
                     has been adopted. Think of the namespace as the family name for a group of
                     elements. </para>
                  <para>Namespaces look a lot like attributes (they aren't). They take the form
                        <code>xmlns="http://example1.com/terms/"</code> (defining the default
                     namespace) or <code>xmlns:[PREFIX]="http://example2.com/terms/"</code>
                     (defining a namespace that has been assigned a particular prefix) placed inside
                     an opening tag. For example, <code>&lt;bank
                        xmlns="http://example1.com/terms/">...&lt;/bank></code> states, in effect,
                     the namespace for <code>&lt;bank></code> and the default namespace for all
                     descendants (it can be explicitly overridden). </para>
                  <para>Different types of <code>&lt;bank></code> can be mixed through
                     namespaces:</para>
                  <programlisting>&lt;bank xmlns="http://example1.com/terms/">
    &lt;bank xmlns="http://example2.com/terms/">
        ...
    &lt;/bank>
&lt;/bank>

&lt;bank xmlns="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    &lt;e2:bank >
        ...
    &lt;/e2:bank>
&lt;/bank>

&lt;e1:bank xmlns:e1="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    &lt;e2:bank >
        ...
    &lt;/e2:bank>
&lt;/e1:bank></programlisting>
               </section>
               <section>
                  <title>TAN namespace and prefix</title>
                  <para>The TAN namespace is <emphasis role="bold"
                           ><code>tag:textalign.net,2015:ns</code></emphasis>. The recommended
                     prefix is <emphasis role="bold"><emphasis>tan</emphasis></emphasis>.</para>
                  <para>The TAN-TEI format uses as its default the TEI namespace, <link
                        xlink:href="http://www.tei-c.org/ns/1.0"/>, normally given the prefix
                           <emphasis><emphasis role="bold">tei</emphasis></emphasis>.</para>
               </section>
            </section>
            <section xml:id="TEI">
               <title>The Text Encoding Initiative</title>
               <section>
                  <title>What is it?</title>
                  <para>The Text Encoding Initiative (TEI) is a collection of XML rules for the
                     representation of texts in digital form. Developed and maintained by a
                     consortium of scholars and scholarly organizations, TEI includes not only a
                     library of schemas, but guidelines and stylesheetsmore. The TEI Guidelines have
                     been widely used by libraries, museums, publishers, and individual scholars to
                     present texts for online research, teaching, and preservation. In addition to
                     the Guidelines themselves, the Consortium provides a variety of <link
                        xlink:href="http://www.tei-c.org/Support/Learn/">resources</link> and <link
                        xlink:href="http://members.tei-c.org/Events">training events</link> for
                     learning TEI, information on <link
                        xlink:href="http://www.tei-c.org/Activities/Projects/">projects using the
                        TEI</link>, a <link
                        xlink:href="http://www.tei-c.org/Activities/SIG/Education/tei_bibliography.xml"
                        >bibliography of TEI-related publications</link>, and <link
                        xlink:href="http://www.tei-c.org/Tools/">software</link>.<note>
                        <para>Taken from the TEI website <link
                              xlink:href="http://www.tei-c.org/index.xml"/>, accessed
                           2017-05-21.</para>
                     </note></para>
                  <para>Any TAN-T module can be easily cast into a TEI file, although much of the
                     computer-actionable semantics will be lost in the process. Likewise, a TEI file
                     can be converted to TAN-T, but there is a greater risk of loss of content,
                     particularly in the header, since the non-TEI TAN formats are restricted to a
                     small subset of TEI tags. </para>
                  <para>TAN-TEI is TAN's TEI extension, based on an ODD file that is in the same
                     directory as the rest of the schemas. TAN-TEI schemas are generated on the
                     basis of the official TEI All schema that is available at the time of release. </para>
                  <para>For more about the strictures placed upon the TEI All schema see <xref
                        linkend="tan-tei"/>. See also <xref linkend="class_common"/> and <xref
                        linkend="class_1"/>.</para>
               </section>
               <section>
                  <title>Further reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://www.tei-c.org/">Text Encoding
                                 Initiative</link></para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="data_types">
               <title>Data types</title>
               <para>Being written purely in XML technologies, TAN adopts its data types, e.g.,
                  strings, booleans, and so forth, from the <link
                     xlink:href="https://www.w3.org/TR/xmlschema-2/">official specifications</link>
                  made by the W3C. The following data types require some special comments.</para>
               <section xml:id="language">
                  <title>Languages</title>
                  <para>TAN adopts for language identification Best Common Practices (BCP) 47, which
                     standardizes identifies for languages and scripts. For most users of TAN, this
                     will be a simple three-letter abbreviation, sometimes supplemented with a
                     hyphen and an abbreviation designating a script or regional subtag. For
                     example, <code>eng</code>, <code>eng-UK</code>, and <code>eng-UK-Cyrl</code>
                     refer, respectively, to English (in general), English from the United Kingdom,
                     and English from the United Kingdom written in the Cyrillic script. As a
                     general rule, values of this type should begin with a three-letter language
                     code, preferably lowercase.</para>
                  <para>ISO codes for human languages appear in <code><link
                           linkend="attribute-xmllang">@xml:lang</link></code> and <code><link
                           linkend="element-for-lang">&lt;for-lang></link></code>. The former states
                     what language the enclosed text is in. The second indicates that some statement
                     or claim is being made about a specific language language. For example,
                           <code><link linkend="element-for-lang">&lt;for-lang></link></code> in the
                     context of a TAN-mor file indicates which languages the file was written
                     for.</para>
                  <para>For more information, see one of the following:<itemizedlist>
                        <listitem>
                           <para>BCP 47 <link xlink:href="http://tools.ietf.org/rfc/bcp/bcp47"
                                 >official specifications</link></para>
                        </listitem>
                        <listitem>
                           <para>BPC 47 <link
                                 xlink:href="http://www.w3.org/TR/xmlschema11-2/#language">technical
                                 details</link></para>
                        </listitem>
                     </itemizedlist></para>
               </section>
               <section xml:id="date_and_datetime">
                  <title>Dates and times</title>

                  <para>TAN adopts the standardized ISO form of dates and date-times, as interpreted
                     by XML data types. These begin with years (the largest unit) and ends with
                     days, seconds, or fractions of seconds (the smallest).</para>
                  <para>The simplest date takes this form: <code>YYYY-MM-DD</code>. If a time is
                     included, it is specified by continuing the string, first with a <code>T</code>
                     (for time) then the form <code>hh:mm:ss.sss(Z|[-+]hh:mm)</code>. For example,
                     the following is <code>2016-09-20T20:38:27.141-04:00</code> is an ISO date-time
                     for Tuesday, September 20, 2016 at 8:38 p.m. on the Eastern Time Zone.</para>
                  <para>More reading:<itemizedlist>
                        <listitem>
                           <para><link xlink:href="https://www.w3.org/TR/xmlschema-2/#dateTime">W3C
                                 specification</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="https://en.wikipedia.org/wiki/ISO_8601">Wikipedia
                                 entry on ISO 8601</link></para>
                        </listitem>
                     </itemizedlist></para>

               </section>
            </section>
            <section xml:id="IRIs_and_linked_data">
               <title>Identifiers and Their Use</title>
               <para>The acronyms for identifiers, and the meanings of those acronyms, can be
                  mystifying. Here is a synopsis:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><emphasis>IRI</emphasis>: Internationalized Resource Identifier, a
                           generalization of the URI system, allowing the use of Unicode; <link
                              xlink:href="http://www.ietf.org/rfc/rfc3987.txt">defined by RFC
                              3987</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URI</emphasis>: Uniform Resource Identifier, a string of
                           characters used to identify a name or a resource; <link
                              xlink:href="https://tools.ietf.org/html/rfc3986">defined by RFC
                              3986</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URL</emphasis>: Uniform Resource Locator, a URI that
                           identifies a Web resource and the communication protocol for retrieving
                           the resource.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URN</emphasis>: Uniform Resource Name, a term that
                           originally referred to persistent names using the <code>urn:</code>
                           scheme, but is now applied to a variety of systems that have registered
                           with the IANA. URNs are generally best thought of as a subset of
                           URIs.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>UUID</emphasis>: Universally Unique Identifier, a
                           computer-generated 128-bit number used to assign identifiers to any
                           entity. UUIDs can be built into a URN by prefixing them with
                              <code>urn:</code>.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>The TAN format generally prefers to refer to IRIs.</para>
               <para>See also <xref xlink:href="#tag_urn"/>.</para>
               <section xml:id="rdf_and_lod">
                  <title>Resource Description Framework (RDF) and Linked Open Data</title>
                  <section>
                     <title>What are they?</title>
                     <para>Identifiers are used in many contexts for many purposes. One such purpose
                        is called Linked Open Data (LOD) or the Semantic Web, which relies upon a
                        very simple data model called Resource Description Framework (RDF), a family
                        of World Wide Web Consortium (W3C) specifications originally designed as a
                        data model for metadata.</para>
                     <para>RDF was designed to be a data model to support general assertions. The
                        model rests upon the concept of a statement, made of three parts: subject,
                        predicate, and object. Subjects and predicates take identifiers that act as
                        names of things. The object may take an identifier or just data. The idea
                        behind LOD is that os we begin to use the same URLs for the same concepts,
                        then independently created datasets can be combined and compared. The entire
                        collection of RDF statements on the web allow inferences not possible on the
                        project level.</para>
                     <para>These URL identifiers look like a web page address (e.g.,
                           <code>http://...</code>), but are first and foremost names for things
                        (the term "Resource"—the R in RDF—is a clumsy way to refer to any person,
                        place, concept—anything at all). Ideally, those URLs will still name those
                        things after the domain name expires and the web resource cannot be found. </para>
                  </section>
                  <section>
                     <title>TAN and RDF</title>
                     <para>Much of TAN can be converted to RDF statements. In fact, TAN may be one
                        of the most human-friendly way to read and write RDF. Compare, for example,
                        this snippet (taken from <link
                           xlink:href="http://linkeddatabook.com/editions/1.0/"/>), written in
                        Turtle syntax,
                        ...<programlisting>1 @prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
2 @prefix foaf: &lt;http://xmlns.com/foaf/0.1/> . 
3 
4 &lt;http://biglynx.co.uk/people/dave-smith> 
5 rdf:type foaf:Person ; 
6 foaf:name "Dave Smith" .</programlisting></para>
                     <para>...with the TAN
                        equivalent:<programlisting>&lt;person xml:id="dsmith">
   &lt;IRI>http://biglynx.co.uk/people/dave-smith&lt;/IRI>
   &lt;name>Dave Smith&lt;/name>
&lt;/person></programlisting></para>
                     <para>In this case TAN and RDF are converted losslessly. But in many other
                        cases, TAN statements cannot be reduced to the RDF model. This happens most
                        often in the context of <code><link linkend="element-claim"
                              >&lt;claim></link></code>, which is designed to allow scholarly
                        assertions and claims that are difficult or impossible to express in RDF.
                        For example, RDF does not allow one to say "Person X is not the author of
                        text Y." </para>
                     <para>TAN claims have adapted the core concepts behind RDF to cater to
                        scholarly needs. For more details see <xref linkend="tan-a-div"/>.</para>
                  </section>
                  <section>
                     <title>Further reading</title>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para><link xlink:href="https://www.w3.org/RDF/">W3C
                                    recommendation</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://linkeddata.org/">Linked
                                 Data</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://lov.okfn.org/dataset/lov/">Linked Open
                                    Vocabularies</link></para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </section>
               </section>
               <section xml:id="tag_urn">
                  <title>Tag URNs</title>


                  <para>TAN files make extensive use of tag URNs (see <xref
                        xlink:href="#IRIs_and_linked_data"/>). In fact, TAN's namespace is a tag URN
                        (<xref linkend="namespace"/>). A <link xlink:href="http://www.taguri.org"
                        >tag URN</link> has two parts:</para>
                  <para>
                     <orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Namespace.</emphasis>
                              <code>tag:</code> + an e-mail address or domain name owned by the
                              person or organization that has authorized the creation of the TAN
                              file + <code>,</code> + an arbitrary day on which that address or
                              domain name was owned. The day is expressed in the form
                                 <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
                              <code>YYYY</code>. A missing <code>MM</code> or <code>DD</code> is
                              implicitly assigned the value of <code>01</code>.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Name of the TAN file.</emphasis>
                              <code>:</code> + an arbitrary string (unique to the namespace chosen)
                              chosen by the namespace owner as a label for the entire file and
                              related versions. It need not be the same as the filename stored on a
                              local directory. You should pick a name that is at least somewhat
                              intelligible to human readers.</para>
                        </listitem>
                     </orderedlist>
                  </para>
                  <para>Although you may use any tag URN coined by someone else, you may create a
                     tag URN only if you are the owner of that URN's namespace.</para>
                  <para>Great care must be taken in choosing the IRI name, because you are the sole
                     guarantor of its uniqueness. <emphasis role="italic">It is permissible for
                        something to have multiple IRIs, but never acceptable for an IRI to name
                        more than one thing.</emphasis> It is a good practice to keep a master
                     checklist of IRI names you have created. If you find yourself forgetting, or
                     think you run the risk of creating duplicate IRI names, you should start afresh
                     by creating a new namespace for your tag URNs, easily done just by changing the
                     date in the tag URN namespace.</para>
                  <para>
                     <example>
                        <title>TAN IRI names</title>
                        <programlisting>tag:jan@example.com,1999-01-31:TAN-T001
tag:example.com,2001-04:hamlet-tan-t
tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts
tag:bbrb@example.org,1995-04-01:pos-grc</programlisting>
                        <para>The first example comes from someone who owned the email address
                              <code>jan@example.com</code> on January 31, 1999 (at the stroke of
                           midnight, Universal Coordinated Time). The other examples follow a
                           similar logic. The namespace of the second and third examples are tied to
                           the owners of specific domain names, not those of email addresses. The
                              <code>2014</code> in the fourth example is shorthand for the first
                           second of January 1, 2014.</para>
                     </example>
                  </para>
                  <para>The TAN encoding format has chosen tag URNs over URLs for several
                     reasons:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><emphasis role="bold">Permanence.</emphasis> Authors of TAN data
                              are creating files that are meant to be relevant for decades and
                              centuries from now, well after specific domain names have changed
                              ownership or fallen into obsolesence, and well after the creators are
                              dead. URLs are not built for such permanence. </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Responsibility.</emphasis> The TAN format
                              requires every piece of data to be attributable to someone (a person,
                              organization, or some other agent). A tag URN implies who was
                              responsible for creating the URN. </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Accessibility.</emphasis> Tag URNs can be
                              made by anyone who has an email address. No one has to register with
                              any central authority. You can begin naming anything you want, any
                              time you want, without seeking anyone's approval.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Ease</emphasis>. Tag URNs are easier to use
                              than, say, http-form URLs, as recommended by RDF (see <xref
                                 xlink:href="#rdf_and_lod"/>). Many potential TAN authors never have
                              owned a domain name, and never will. Further, many of those who do own
                              domain names cannot or do not wish to configure and maintain servers
                              to administer the referral mechanisms upon which the semantic web
                              depends.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Disambiguation of name and
                                 location</emphasis>. In the semantic web, conflation of name with a
                              location to resolve it is considered a virtue because the single
                              string does double duty, both naming the resource and pointing to a
                              location where more can be learned. But this conflation is unhelpful
                              in the TAN context. TAN files are meant to be distributed widely, and
                              not rely upon a single location. And URLs are in common parlance
                              interpreted as locations for data, not as names for things. Tag URNS
                              don't confuse users by looking like locations. This upholds a
                              principle that is common in scholarly citation, namely, that one
                              should always distinguish the name of a resource from where it might
                              be found.</para>
                        </listitem>
                     </itemizedlist>
                  </para>
                  <para>Further reading:<itemizedlist>
                        <listitem>
                           <para><link xlink:href="https://tools.ietf.org/html/rfc4151">RFC
                                 4151</link>, the official definition of tag URNs</para>
                        </listitem>
                     </itemizedlist></para>

               </section>
            </section>
            <section xml:id="regular_expressions">
               <title>Regular Expressions</title>
               <para>Regular expressions are patterns for searching text. The term <emphasis
                     role="italic">regular</emphasis> here does not mean ordinary. Rather, it
                  derives from Latin <emphasis role="italic">regula</emphasis>, and points to a
                  rule-based syntax that provides patterns for finding and replacing text. Regular
                  expressions come in different flavors, and have several layers of complexity. TAN
                  regular expressions adhere closely to the <link
                     xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions">recommendation
                     of XSLT 3.0</link> (XML Schema Datatypes plus some extensions), and outlined in
                     <link xlink:href="http://www.w3.org/TR/xpath-functions-30/#regex-syntax">XPath
                     Fuctions 3.0</link>. <caution>
                     <para>XML Schema Datatypes define regular expressions differently than do Perl,
                        one of the most common forms of regular expression. For example, the pipe
                        symbol, |, is treated as a word character in XML regular expressions
                           (<code>\w</code>), but the opposite is true for Perl. For convenience,
                        here are the how codepoints U+0020..U+00FF are categorized according to XML
                        (and therefore TAN):</para>
                     <para><emphasis role="bold">Word characters </emphasis>(<code>\w</code>):
                           <code>$ + 0 1 2 3 4 5 6 7 8 9 &lt; = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ</code>
                     </para>
                     <para><emphasis role="bold">Non-word characters </emphasis>(<code>\W</code>):
                           <code>! " # % &amp; ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ­ ¶ · »
                           ¿</code></para>
                     <para>Some of these choices may seem counterintuitive or wrong. But at this
                        point it does not matter. The distinction is a legacy that will remain in
                        place. It is advisable to familiarize yourself with decisions that, in some
                        respect, are arbitrary.</para>
                  </caution></para>
               <para>A regular expression search pattern is treated just like a conventional search
                  pattern until the computer reaches a special escape character: <code>. [ ] \ | - ^
                     $ ? * + { } ( )</code>. Here is a brief key to how characters behave in regular
                  expressions, provided they are not in square brackets (on which see the
                  recommended reading below):</para>
               <para>
                  <table frame="all">
                     <title>Special characters in regular expressions</title>
                     <tgroup cols="2">
                        <colspec colname="c1" colnum="1" colwidth="1*"/>
                        <colspec colname="c2" colnum="2" colwidth="12.33*"/>
                        <thead>
                           <row>
                              <entry>Symbol</entry>
                              <entry>Meaning</entry>
                           </row>
                        </thead>
                        <tbody>
                           <row>
                              <entry>$</entry>
                              <entry>end of line</entry>
                           </row>
                           <row>
                              <entry>.</entry>
                              <entry>any character</entry>
                           </row>
                           <row>
                              <entry>|</entry>
                              <entry>or (union)</entry>
                           </row>
                           <row>
                              <entry>^</entry>
                              <entry>start of line</entry>
                           </row>
                           <row>
                              <entry>?</entry>
                              <entry>zero or one</entry>
                           </row>
                           <row>
                              <entry>*</entry>
                              <entry>zero or more</entry>
                           </row>
                           <row>
                              <entry>+</entry>
                              <entry>one or more</entry>
                           </row>
                           <row>
                              <entry>[ ]</entry>
                              <entry>a class of characters</entry>
                           </row>
                           <row>
                              <entry>( )</entry>
                              <entry>a group</entry>
                           </row>
                           <row>
                              <entry>\w</entry>
                              <entry>any word character</entry>
                           </row>
                           <row>
                              <entry>\W</entry>
                              <entry>any nonword character</entry>
                           </row>
                           <row>
                              <entry>\s</entry>
                              <entry>any of the four standard spacing characters: space (U+0020),
                                 tab (U+0009), newline (U+000A), carriage return (U+000D)</entry>
                           </row>
                           <row>
                              <entry>\S</entry>
                              <entry>anything not a spacing character</entry>
                           </row>
                           <row>
                              <entry>\d</entry>
                              <entry>any digit (0-9)</entry>
                           </row>
                           <row>
                              <entry>\D</entry>
                              <entry>anything not a digit</entry>
                           </row>
                           <row>
                              <entry>\p{IsGujarati}</entry>
                              <entry>any character from the Unicode block named Gujarati</entry>
                           </row>
                           <row>
                              <entry>\\</entry>
                              <entry>backslash (the backslash alone suggests that the next character
                                 is a special character)</entry>
                           </row>
                           <row>
                              <entry>\$</entry>
                              <entry>dollar sign</entry>
                           </row>
                           <row>
                              <entry>\(</entry>
                              <entry>opening parenthesis</entry>
                           </row>
                           <row>
                              <entry>\[</entry>
                              <entry>opening square bracket</entry>
                           </row>
                        </tbody>
                     </tgroup>
                  </table>
               </para>
               <para>Some examples:</para>
               <table frame="all">
                  <title>Examples of Regular Expressions</title>
                  <tgroup cols="3">
                     <colspec colname="newCol1" colnum="1" colwidth="1*"/>
                     <colspec colname="c1" colnum="2" colwidth="1.48*"/>
                     <colspec colname="c2" colnum="3" colwidth="6.59*"/>
                     <thead>
                        <row>
                           <entry>Expression</entry>
                           <entry>Meaning</entry>
                           <entry>What the expression matches when applied to "Wi-fi, good. A_hem*
                              isn't!"</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>^.+$</code></entry>
                           <entry>one whole line of characters</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>[ae]</code></entry>
                           <entry>a or e</entry>
                           <entry>"e"</entry>
                        </row>
                        <row>
                           <entry><code>[a-e]</code></entry>
                           <entry>a, b, c, d, or e</entry>
                           <entry>"d", "e"</entry>
                        </row>
                        <row>
                           <entry><code>[^ae]+</code></entry>
                           <entry>one or more characters that are anything except a or e</entry>
                           <entry>"Wi-fi, good. A_h", "m* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>.i</code></entry>
                           <entry>any character followed by i.</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>(.i)</code></entry>
                           <entry>when a character followed by an i is found treat it as a capture
                              group (used only in a search pattern)</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>$1</code></entry>
                           <entry>first capture group (used only in a replacement pattern, and
                              corresponds to the sequence of capture groups in the search
                              pattern)</entry>
                           <entry>In the example above, each match corresponds to $1</entry>
                        </row>
                        <row>
                           <entry><code>[aeiou]\w*</code></entry>
                           <entry>any lowercase vowel along with every word character that
                              follows</entry>
                           <entry>"i", "i", "ood", "em", "isn"</entry>
                        </row>
                        <row>
                           <entry><code>[t*].</code></entry>
                           <entry>any t or * and the following character</entry>
                           <entry>"* ", "t!" Note that the asterisk, if inside a character class,
                              acts as itself.</entry>
                        </row>
                        <row>
                           <entry><code>\s+</code></entry>
                           <entry>match one or more space characters</entry>
                           <entry>" ", " ", " "</entry>
                        </row>
                        <row>
                           <entry><code>\w+</code></entry>
                           <entry>match one or more word characters</entry>
                           <entry>"Wi", "fi", "good", "A_hem", "isn", "t"</entry>
                        </row>
                        <row>
                           <entry><code>\W+</code></entry>
                           <entry>match one or more nonword characters</entry>
                           <entry>"-", ", ", ". ", "* ", "'", "!"</entry>
                        </row>
                        <row>
                           <entry><code>[^q]+</code></entry>
                           <entry>one or more characters that are not a q</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>The examples above provide a taste of how regular expressions are constructed
                  and read.</para>
               <warning xml:id="reg_exp_and_comb_chars">
                  <title>Regular Expressions and Combining Characters</title>
                  <para>Regular expressions come in many different flavors, and each one deals with
                     some of the more complex issues in Unicode in their own manners. This ambiguity
                     will most keenly be felt in the use of combining characters. Suppose we have a
                     string of three characters, áb (i.e., an acute accent over the a,
                        <code>&amp;#x61;&amp;#x301;&amp;#x62;</code>). The regular expression
                        <code>a.</code> will in some search engines include the b and others
                     not.</para>
                  <para>Unicode has differentiated three levels of support for regular expressions
                     (see <link xlink:href="http://www.unicode.org/reports/tr18/">official
                        report</link>). Only level one conformance in TAN is guaranteed. Combining
                     characters fall in level two. In TAN, character counts depend exclusively upon
                     base characters, not combining ones (see <xref linkend="combining_characters"
                     />).</para>
               </warning>
               <para>TAN includes several functions that usefully extend XML regular expressions.
                  See <code><link linkend="function-regex">tan:regex</link></code>, <code><link
                        linkend="function-matches">tan:matches</link></code>(), <code><link
                        linkend="function-replace">tan:replace</link></code>(), <code><link
                        linkend="function-tokenize">tan:tokenize</link></code>().</para>
               <para>Further reading:<itemizedlist>
                     <listitem>
                        <para>Various <link
                              xlink:href="http://www.google.com/search?q=tutorial+regular+expressions"
                              >tutorials on Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para>Wikipedia, <link
                              xlink:href="http://en.wikipedia.org/wiki/Regular_expression">Regular
                              Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions"
                              >Regular Expressions in XSLT 3.0</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.unicode.org/reports/tr18/">Unicode and
                              Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xmlschema-2/#regexs">XML Schema
                              Datatypes</link></para>
                     </listitem>
                  </itemizedlist></para>
            </section>
         </section>
         <section xml:id="multiple-values">
            <title>Interpretation of multiple values</title>
            <para>Many TAN elements contain multiple values or have attributes that allow multiple
               values. Do those multiple values represent intersection, union, or distribution? For
               example, <code>attribute="A B"</code> could be interpreted to mean, using the diagram
               below, anywhere in y (intersection); anywhere in x, y, or z (union); or somewhere x
               or y <emphasis>and </emphasis>somewhere in y and z (distribution).</para>
            <figure>
               <title>Venn%20diagram.jpeg</title>
               <mediaobject>
                  <imageobject>
                     <imagedata fileref="img/Venn%20diagram.jpeg"/>
                  </imageobject>
               </mediaobject>
            </figure>
            <para>Multiple values in TAN are defined according to perceived common usage in ordinary
               English:</para>
            <para><emphasis role="bold">Union (= x, y, or z; default).</emphasis> Examples: anything
               that takes the <xref linkend="pattern-iri_and_name"/>, <code><link
                     linkend="element-equate">&lt;equate></link></code>, <code><link
                     linkend="element-period">&lt;period></link></code>, <code><link
                     linkend="element-where">&lt;where></link></code>. </para>
            <para><emphasis role="bold">Intersection (= y only).</emphasis> Examples: <code><link
                     linkend="attribute-adverb">@adverb</link></code> and other qualifications of
               claims. For example, "...probably not..." does not mean "...probably..." and
               "...not..."</para>
            <para><emphasis role="bold">Distribution (= x or y and y or z).</emphasis>
               <code><link linkend="attribute-affects-element">@affects-element</link></code>,
                     <code><link linkend="attribute-claimant">@claimant</link></code>, <code><link
                     linkend="attribute-object">@object</link></code>, <code><link
                     linkend="element-object">&lt;object&gt;</link></code>, <code><link
                     linkend="attribute-src">@src</link></code>, <code><link
                     linkend="attribute-subject">@subject</link></code>, <code><link
                     linkend="element-subject">&lt;subject&gt;</link></code>, <code><link
                     linkend="attribute-verb">@verb</link></code>. For example, "[Source A], [source
               B], are Z" means "Source A is Z" and "Source B is Z."</para>
            <para>The discussion above does not treat the important question of range. If an
               assertion is made about A, is it true for one point in x or y, or is it true for any
               and all points in x and y? At present, TAN does not address this ambiguity, and
               leaves the interpretation open. </para>
         </section>
      </chapter>
      <chapter xml:id="class_common">
         <title>Patterns and Structures Common to All TAN Encoding Formats</title>
         <para>This chapter provides general background to the elements and attributes that are
            common to all TAN files. For detailed discussion of individual elements and attributes,
            see <xref linkend="elements-attributes-and-patterns"/>.</para>
         <para>This chapter has no relevance for TAN catalog files. For an explanation of that
            format, see <xref linkend="catalog-files"/>.</para>
         <section xml:id="patterns">
            <title>Common Patterns</title>
            <section xml:id="pattern-iri_and_name">
               <title>IRI + name Pattern</title>
               <para>Both humans and computers need to read and write TAN metadata. Very often what
                  is readable to humans is unreadable to computers, and vice versa. So the TAN
                  format requires that all metadata be provided whenever possible in both forms.
                  Although this rule may appear to introduce redundancy and therefore opportunities
                  for error, the clarity is critical. It is the only way at present to ensure that
                  anyone who approaches the data—computer or human—can parse and use it. In
                  addition, doubly expressed metadata provides a safeguard much like a checksum:
                  human- and computer-readable descriptions should correspond. Any discrepancy is a
                  signal that an error should be diagnosed and fixed.</para>
               <para>Some metadata, such as comments, are neither easily nor profitably translated
                  into a computer-actionable string. In such cases only the human-readable form is
                  required. Other metadata involve regular expressions or ISO-compliant dates, both
                  of which are well formed and are usually human-legible. Such data are not
                  repeated. In cases where a datum is not understandable to humans, such as a
                  complex regular expression, a <code><link linkend="element-comment"
                        >&lt;comment></link></code> may be provided.</para>
               <para>Those exceptions aside, all other metadata takes what is called the <emphasis
                     role="italic">IRI + name</emphasis> pattern: one or more <code><link
                        linkend="namespace">&lt;IRI></link></code>s and <code><link
                        linkend="element-name">&lt;name></link></code>s and zero or more <code><link
                        linkend="element-desc">&lt;desc></link></code>s. If the thing being
                  described is a digital file, then the IRI + name pattern is part of a larger
                  pattern, the <xref linkend="digital_entity_metadata"/>.</para>
            </section>
            <section xml:id="digital_entity_metadata">
               <title>Digital Entity Metadata Pattern</title>
               <para>Some entities identified by the <xref linkend="pattern-iri_and_name"/> will be
                  digital resources. In those cases, the IRI + name Pattern is extended in two
                  different ways, according to whether the entity is a TAN file or not. </para>
               <para>If the entity is a TAN file, then <code><link linkend="namespace"
                        >&lt;IRI></link></code> (one and only one) must be a valid tag URN that
                  matches the <code><link linkend="attribute-id">@id</link></code> value of the TAN
                  file being referred to. This may seem excessive, since in other contexts (HTML,
                  TEI), one need only the <code>@href</code> or <code>@src</code>. This extra
                  measure has been introduced because TAN files are meant to be valid long after
                  their creation, when they may be separated from their original context, or when a
                  server no longer has the files referred to. Without the <code><link
                        linkend="attribute-id">@id</link></code> value, recovering the referred to
                  file would be difficult or impossible; with it, easier, and perhaps
                  possible.</para>
               <para>If the entity is not a TAN file, then any IRI may be used. If you choose to use
                  the digital resource's URL as its name (and as its location; see below), then it
                  will be inferred that you mean to identify the digital resource that appeared at
                  that URL at the date or time you accessed it.</para>
               <para>In either case, the pattern adds to the IRI + name pattern one or more
                        <code><link linkend="element-location">&lt;location></link></code>s and an
                  optional <code><link linkend="element-checksum"
                  >&lt;checksum></link></code>.</para>
            </section>
            <section xml:id="edit_stamp">
               <title>Edit Stamp</title>
               <para>Most TAN elements allow for an optional edit stamp, an <code><link
                        linkend="attribute-ed-who">@ed-who</link></code> and an <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, stating who created or
                  edited the enclosed data and when. Neither attribute is allowed without the other. </para>
               <para><code><link linkend="attribute-ed-when">@ed-when</link></code>, along with
                        <code><link linkend="attribute-when">@when</link></code> and <code><link
                        linkend="attribute-when-accessed">@when-accessed</link></code>, are the
                  attributes through which a TAN file's version is calculated. The latest date
                  serves as the version number.</para>
               <para>An edit stamp performs the same function as <code><link
                        linkend="element-change">&lt;change></link></code>, except that no
                  description can be provided, and it points precisely to the element where a change
                  has been made. If a description of the alteration is necessary, <code><link
                        linkend="element-change">&lt;change></link></code> should be used.</para>
            </section>
         </section>
         <section xml:id="structure">
            <title>Overall Structure</title>
            <para>All TAN-compliant files, no matter the type or class, follow a common basic
               structure: (1) a prolog with at least two processing instruction nodes; (2) a root
               element; and (3) a head, a body, and an optional teiHeader and tail.</para>
            <para><emphasis role="italic">Prolog and processing instruction nodes</emphasis>: The
               standard prolog of every XML file must begin the fil: <code>&lt;?xml version="1.0"
                  encoding="UTF-8"?></code> After that come two processing instructions specifying
               the two schema files required for validation<itemizedlist>
                  <listitem>
                     <para><code>&lt;?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
                           c]"?></code></para>
                  </listitem>
                  <listitem>
                     <para><code>&lt;?xml-model
                        href="[PATH]/[ROOT-ELEMENT-NAME].sch"?></code></para>
                  </listitem>
               </itemizedlist></para>
            <para>The first processing instruction node points to the RELAX-NG schema that declares
               the major, structural rules. The second points to the finely tuned rules, written in
               Schematron. Both processing instructions are required. <code>[PATH]</code> represents
               the pathname to the schema file, whether local or on a server and
                  <code>[ROOT-ELEMENT-NAME]</code> stands for the name of the root element (the
               element that is the ancestor of all other elements in the document and the descendant
               of none). It is your choice whether you use <code>.rnc</code> or <code>.rng</code> as
               the extension for the RELAX-NG schema. The former is the compact syntax and the
               latter, the XML format. They are equivalent. The schemas are written primarily in the
               compact sequence, then converted to the XML format.</para>
            <para>TAN files admit three different levels of validation: <code>terse</code>,
                  <code>normal</code>, and <code>verbose</code>. A phase may be specified with a
               pseudoattribute <code>phase</code> in the prolog, e.g., <code>&lt;?xml-model
                  href="TAN-A-div.sch" phase="verbose"?></code>. But it is customary not to specify
               the phase, since users will oftentimes wish to change the level of validation.
               Verbose takes the longest, and terse the shortest. Verbose provides the most
               feedback, terse the least. </para>
            <para><emphasis role="italic">Root element</emphasis>: The name of the root element
               identifies the type of TAN file:<table frame="all">
                  <title>Root TAN elements</title>
                  <tgroup cols="3">
                     <colspec colname="c1" colnum="1" colwidth="1.19*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.19*"/>
                     <colspec colname="newCol3" colnum="3" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Root element name</entry>
                           <entry>Type of data</entry>
                           <entry>TAN class</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-TAN-T"
                              >&lt;TAN-T></link></code></entry>
                           <entry>plain text transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>TEI transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-tok"
                                 >&lt;TAN-A-tok></link></code></entry>
                           <entry>token-based alignments</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-div"
                                 >&lt;TAN-A-div></link></code></entry>
                           <entry>division-based alignments</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-lm"
                              >&lt;TAN-A-lm></link></code></entry>
                           <entry>lexico-morphological analysis</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-mor"
                              >&lt;TAN-mor></link></code></entry>
                           <entry>part of speech / morphology patterns</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-key"
                              >&lt;TAN-key></link></code></entry>
                           <entry>glossaries</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-collection"
                                 >&lt;collection></link></code></entry>
                           <entry>catalog of TAN files</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table><note>
                  <para><code><link linkend="element-collection">&lt;collection></link></code> is
                     provided here only to complete the table. None of the material in this chapter
                     applies to this special class 3 format. See <xref linkend="catalog-files"
                     />.</para>
               </note></para>
            <para>Each root element takes a mandatory <code><link linkend="attribute-id"
                  >@id</link></code> and <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code>.  All TAN elements take the namespace
                  <code>tag:textalign.net,2015:ns</code>. In most cases, this value is placed in the
               root element. (The only exception are TAN-TEI transcription files, which take as a
               default namespace <code>http://www.tei-c.org/ns/1.0</code> everywhere but in
                  <code>/TEI/head</code>, which takes the TAN namespace.) For more about namespaces,
               see <xref linkend="namespace"/>.</para>
            <para><emphasis>Root element children:</emphasis> Most root elements take two mandatory
               children: <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code>, the latter containing data and
               the former, metadata (data about the data). TAN-TEI files take a three children:
                  <code>&lt;teiHeader></code>, <code><link linkend="element-head"
                  >&lt;head></link></code>, and <code>&lt;text></code>, because the TEI header does
               not satisfy TAN expectations. See <xref linkend="tan-tei"/>. </para>
            <para>All TAN files may take one final optional child, <link linkend="element-tail"
                     ><code>&lt;tail></code></link>, a private use element that allows any
               well-formed XML. It was introduced to facilitate more efficient validation. Nothing
               in a TAN file should be dependent upon the <link linkend="element-tail"
                     ><code>&lt;tail></code></link>. That is, if you are editing a TAN file and you
               add a <link linkend="element-tail"><code>&lt;tail></code></link>, assume that it will
               be disregarded by other users. Similarly, you may delete any TAN file's <link
                  linkend="element-tail"><code>&lt;tail></code></link> without consequence.</para>
            <section xml:id="iri_name">
               <title><code><link linkend="attribute-id">@id</link></code> and a TAN file's IRI
                  Name</title>
               <para>Every TAN file requires in its root element an <code><link
                        linkend="attribute-id">@id</link></code>. Its value, termed the TAN file's
                     <emphasis>IRI name</emphasis>, must take the form of a tag URN (see <xref
                     linkend="tag_urn"/> for syntax). The file's IRI name is the primary way other
                  TAN files will refer to it. </para>
               <para>The namespace of the current file's IRI name must match at least one namespace
                  in one <code><link linkend="element-person">&lt;person></link></code>'s
                        <code><link linkend="element-IRI">&lt;IRI></link></code> value. This helps
                  tie the responsibility for the TAN file to at least one person. The first such
                        <code><link linkend="element-person">&lt;person></link></code> is called the
                  primary agent, and is bound to the global variable <code><link
                        linkend="variable-primary-agent">$primary-agent</link></code>.</para>
               <para>In choosing a value for <code><link linkend="attribute-id">@id</link></code>
                  you might borrow the filename, but you do not have to. Indeed, it is probably not
                  a good idea, since files are frequently renamed, often with good reason. A TAN
                  file's IRI name should not be changed, especially after publication, because the
                  name is supposed to be permanent and stable. </para>
               <para>On occasion during editing, it will become clear that revisions are so deep
                  that the file is altogether a different kind of thing. If a previous version has
                  been published, then coining a new IRI name <emphasis>is </emphasis>advised, to
                  dissociate the file with its ancestry. You may always document the connection by
                  supplying a <code><link linkend="element-see-also">&lt;see-also></link></code>
                  element in the <code><link linkend="element-head">&lt;head></link></code>,
                  specifying the <code><link linkend="element-relationship"
                     >&lt;relationship></link></code> between the two.</para>
               <para>If you take someone else's data and alter it then you should <emphasis
                     role="italic">not</emphasis> change the IRI name, even the namespace. To avoid
                  suggesting that the owner of that namespace is responsible for any revisions you
                  make to the file (if you are allowed—see <code><link linkend="element-license"
                        >&lt;license&gt;</link></code>), you should add yourself as an <link
                     linkend="element-person"><code>&lt;person></code></link> and then document your
                  alterations through <link linkend="element-change"><code>&lt;change></code></link>
                  or <link linkend="attribute-ed-when"><code>@ed-when</code></link> and <link
                     linkend="attribute-ed-who"><code>@ed-who</code></link>. You should also
                  probably add a <code><link linkend="element-see-also">&lt;see-also></link></code>
                  element, pointing to a version of the file that predates your intervention.</para>
               <para>The name of the version of a TAN file is identified by the most recent date in
                  a file's <code><link linkend="attribute-when">@when</link></code>, <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, or <code><link
                        linkend="attribute-when-accessed">@when-accessed</link></code>. It is
                  important, therefore, whenever you change a TAN file that has already been
                  published to provide at least an edit stamp (<xref linkend="edit_stamp"/>) in the
                  part of the file you changed or in a <code><link linkend="element-comment"
                        >&lt;comment></link></code> or <code><link linkend="element-change"
                        >&lt;change></link></code>, so that anyone validating a TAN file dependent
                  upon yours will be warned that changes have been made. The user may then either
                  continue to process the file (the changes may be minor on inconsequential) or
                  investigate the changes before deciding what to do. </para>
               <para>Because the IRI name is stable, it is suitable for use outside of TAN, in, for
                  example, RDFa, JSON-LD, and linked open data (see <xref
                     linkend="IRIs_and_linked_data"/>).</para>
               <para>The IRI name kept at <code><link linkend="attribute-id">@id</link></code> is
                  the only metadatum positioned outside <code><link linkend="element-head"
                        >&lt;head></link></code>. It is placed as rootward in the document as
                  possible to emphasize that it names the entire document.</para>
               <para><code><link linkend="attribute-TAN-version">@TAN-version</link></code> must be
                     <code>2018</code>, indicating that the files have been made in light of the
                  development files of version one.</para>
            </section>
         </section>
         <section xml:id="metadata_head">
            <title>Metadata (<code><link linkend="element-head">&lt;head></link></code>)</title>
            <para>No matter how much one TAN format differs from another, the metadata are quite
               similar. Anyone getting a TAN file, no matter its class or type, is assumed to want
               to know, and therefore find easily and predictably, the following:<orderedlist>
                  <listitem>
                     <para>the stable name of the file;</para>
                  </listitem>
                  <listitem>
                     <para>its version;</para>
                  </listitem>
                  <listitem>
                     <para>its sources;</para>
                  </listitem>
                  <listitem>
                     <para>other files upon which it depends or otherwise have an important
                        relationship;</para>
                  </listitem>
                  <listitem>
                     <para>the most significant parts of the editorial history;</para>
                  </listitem>
                  <listitem>
                     <para>the linguistic or scholarly conventions that have been adopted in
                        creating and editing the data;</para>
                  </listitem>
                  <listitem>
                     <para>the license, i.e., who holds what rights to the data, and what kind of
                        reuse is allowed.</para>
                  </listitem>
                  <listitem>
                     <para>the persons, organizations, or entities that helped create the data, and
                        the roles played by each.</para>
                  </listitem>
               </orderedlist></para>
            <para>To answer these questions completely, consistently, and predictably the
                     <code><link linkend="element-head">&lt;head></link></code>, a mandatory child
               of the root element, takes a common pattern across <emphasis>all</emphasis> TAN
               formats, thus allowing anyone to easily and predictably work across large numbers and
               types of TAN files. The TAN <code><link linkend="element-head"
                  >&lt;head></link></code>, intended to be concise and focused, compels you to
               provide metadata for the data that is governed by <code><link linkend="element-body"
                     >&lt;body></link></code>, but it does not accommodate metadata for the
               metadata. That is, your metadata should focus on the data itself and not other
               things. For example, <code><link linkend="element-head">&lt;head></link></code>
               requires you name the people who helped create or edit the data, but you are not
               expected to tell us about them. Merely give good <code><link linkend="element-IRI"
                     >&lt;IRI></link></code>s that point to authoritative sources that provide
               background information.<note>
                  <para>The principles above explain why the TEI extension of TAN requires two
                     heads, one for TEI and the other for TAN. <code>&lt;teiHeader></code> is
                     impossible to map onto a TAN <code><link linkend="element-head"
                           >&lt;head></link></code>. But that <code>&lt;teiHeader></code> has
                     valuable, sometimes critically important, information, and should be retained,
                     or replaced with a valid but empty skeleton.</para>
               </note></para>
            <para>Detailed descriptions of <code><link linkend="element-head"
                  >&lt;head></link></code> and its components are in <xref
                  linkend="elements-attributes-and-patterns"/>. Here we provide a summary, general
               description of TAN metadata. </para>
            <para>To <emphasis role="bold">describe the current file</emphasis>, <code><link
                     linkend="element-head">&lt;head></link></code> takes one or more <code><link
                     linkend="element-name">&lt;name></link></code>s, zero or more <code><link
                     linkend="element-desc">&lt;desc></link></code>s and <code><link
                     linkend="element-master-location">&lt;master-location></link></code>s, one
                     <code><link linkend="element-license"
                     >&lt;license></link></code>.</para>
            <para>Next come a list of <emphasis role="bold">files upon which the file
                  depends</emphasis>: zero or more <code><link linkend="element-inclusion"
                     >&lt;inclusion></link></code>s, zero or more <code><link linkend="element-key"
                     >&lt;key></link></code>s, zero or more <code><link linkend="element-source"
                     >&lt;source></link></code>s, and zero or more <code><link
                     linkend="element-see-also">&lt;see-also></link></code>s.</para>
            <para>All <emphasis role="bold">editorial assumptions</emphasis> are placed in
                     <code><link linkend="element-definitions">&lt;definitions></link></code>,
               whose contents differ from one TAN format to the next.</para>
            <para>Finally comes the <emphasis role="bold">responsibility</emphasis> section stating
               who did what when: one or more <code><link linkend="element-person"
                  >&lt;person></link></code>s, <code><link linkend="element-role"
                  >&lt;role></link></code>s, and <code><link linkend="element-change"
                     >&lt;change></link></code>s, and zero or more <code><link
                     linkend="element-resp">&lt;resp></link></code>s.</para>
            <section xml:id="license">
               <title>Rights and Licenses</title>
               <para>Two TAN elements cover rights and licenses: <code><link
                        linkend="element-license">&lt;license></link></code> (mandatory in every TAN
                  file) and <code><link linkend="element-licensor">&lt;licensor&gt;</link></code>.
                  The first element defines the license under which you are releasing your data; the
                  second specifies who has licensed the data. </para>
               <para><emphasis role="bold">The license applies only to the file itself, not to its
                     sources.</emphasis> The distinction is important, and helpful. It is much
                  easier for you to decide and state the rights and license behind your own work
                  than to do so for that of others. Declaring who holds what rights over your
                  source(s) may be not only difficult but risky, and is therefore optional (see
                  below).</para>
               <para>When using a TAN file, you should investigate the entire chain of rights. If
                  you find a discrepancy between the license of a TAN file and that of its sources
                  you should respect the more restrictive one. If a TAN file has a very liberal,
                  open license for the data, this does not necessarily mean that the material upon
                  which it depends is in the public domain. The TAN file's source may be under tight
                  restrictions.</para>
               <para>If you wish to indicate what license governs a source, use <code><link
                        linkend="element-desc">&lt;desc&gt;</link></code> in <code><link
                        linkend="element-source">&lt;source&gt;</link></code>. </para>
               <para>TAN adopts the Creative Commons licenses as its default key vocabulary. See
                     <xref linkend="keywords-license"/>.</para>
            </section>
            <section xml:id="inclusions-and-keys">
               <title>Keys and Inclusions</title>
               <para>Many if not most TAN files are created alongside or in the context of a
                  project, where certain elements will be repeated. Explicit repetition from one
                  file to the next makes them prone to error. Changes might be made in one file but
                  not in another. TAN has two features—keys and inclusions—that help avoid
                  duplication, reduce the likelihood of incomplete editing, and lead to cleaner,
                  smaller files.</para>
               <para>In general, you should first work with keys. If they are not doing the job you
                  need, then try inclusions.</para>
               <section>
                  <title>Keys</title>
                  <para>Most often, an editor wants a simple, shorthand reference to an entity
                     commonly referred to from one file to the next in a single project, e.g., the
                     person who is the principle editor, roles, and division types. </para>
                  <para>Projects are advised to create their own <code><link
                           linkend="element-TAN-key">&lt;TAN-key&gt;</link></code> files populated
                     with commonly used vocabulary. </para>
                  <para>Using those files is a two-step process. First, the TAN-key file is declared
                     via <code><link linkend="element-key">&lt;key&gt;</link></code>. Second,
                     elements (normally in <code><link linkend="element-definitions"
                           >&lt;definitions&gt;</link></code>) can take <link
                        linkend="attribute-which"><code>@which</code></link> instead of the
                     customary IRI + name pattern. <link linkend="attribute-which"
                           ><code>@which</code></link> points to a <code><link
                           linkend="element-name">&lt;name&gt;</link></code> in the TAN-key
                     file.</para>
                  <para>TAN includes a number of standard TAN-key files located at <link
                        xlink:href="http://textalign.net/release/TAN-2018/TAN-key/"/> and
                     documented in <xref linkend="keywords-master-list"/>. Any element that takes
                        <link linkend="attribute-which"><code>@which</code></link> can take full
                     advantage of those files, without <code><link linkend="element-key"
                           >&lt;key&gt;</link></code>.</para>
                  <para>It is strongly recommended that you depend upon only TAN-key files you have
                     written, and not those of a different project.</para>
               </section>
               <section>
                  <title>Inclusions</title>
                  <para>More powerful than TAN-keys are inclusions. Unlike other forms of inclusion
                     you may be familiar with, TAN inclusion involves only select elements, never an
                     entire file. As with keys, TAN inclusion is a two-step process. </para>
                  <para>First, a TAN file is made available for inclusion via <code><link
                           linkend="element-inclusion">&lt;inclusion></link></code>s (inside <link
                        linkend="element-head"><code>&lt;head></code></link>). Like <code><link
                           linkend="element-key">&lt;key&gt;</link></code>, an <code><link
                           linkend="element-inclusion">&lt;inclusion></link></code> does nothing on
                     its own. It merely indicates a file that may be used for inclusions. </para>
                  <para>Second, elements that allow it make take <code><link
                           linkend="attribute-include">@include</link></code>, which points to the
                           <code><link linkend="attribute-xmlid">@xml:id</link></code> reference of
                     the <code><link linkend="element-inclusion">&lt;inclusion></link></code>. In
                     the validation process, those elements will be replaced with every element of
                     that name found in the inclusion file, checked recursively (see below), and
                     ignoring duplicated elements.</para>
                  <para><code><link linkend="element-inclusion">&lt;inclusion></link></code>s are
                     critically important to the content of the TAN file, so any file with
                           <code><link linkend="element-inclusion">&lt;inclusion></link></code>s
                     that cannot be located will be regarded as being in fatal error. Because of the
                     importance of access to included files, it is strongly recommended that
                     inclusions be limited to files locally available, in the same project.</para>
                  <para>Inclusions are recursive. If a TAN file A has <code>&lt;x
                        include='B'></code> and file B has <code>&lt;x include='C D E'></code> then
                     file A will be given all <code>&lt;x></code>s found in B, C, D, and E. </para>
                  <para>In any recursive activity, circularity is fatal. That is true for TAN
                     inclusion as well, but only within a given element name. It is perfectly legal
                     for two files to include each other, as long as they do not try to include the
                     same elements. </para>
                  <para>TAN inclusion removes elements from their original context, which means that
                     values that must be interpreted locally are converted before the elements are
                     included. For example, <link linkend="attribute-which"
                        ><code>@which</code></link> must be interpreted in light of the included
                     document's keys, not those of the including document. Similarly, different
                     numeration systems, e.g., Roman numerals, must be interpreted locally and
                     converted, before inclusion (see <xref linkend="reference_system"/>).</para>
               </section>
            </section>
            <section xml:id="source_and_see-also">
               <title>Distinguishing <code><link linkend="element-source">&lt;source></link></code>s
                  and <code><link linkend="element-see-also">&lt;see-also></link></code>s</title>
               <para>Creating and editing a class 1 TAN file frequently involves working with
                  non-TAN digital files. In the course of editing, and making the material
                  TAN-compatible, you will likely start to correct errors, to normalize conventions,
                  or to bring the transcription closer to an earlier version. At such times it may
                  unclear how to credit the digital files.</para>
               <para>To answer this, first determine a class 1 file's <code><link
                        linkend="element-source">&lt;source></link></code>. Everything else is then
                  a <code><link linkend="element-see-also">&lt;see-also></link></code>. </para>
               <para>If you find in the course of editing that you are starting to depend upon the
                  source of your source, then that earlier version should be credited as the
                        <code><link linkend="element-source">&lt;source></link></code> and the file
                  you were using should be moved to <code><link linkend="element-see-also"
                        >&lt;see-also></link></code>.</para>
            </section>
            <section xml:id="inheritable_attributes">
               <title>Attribute inheritability and priority</title>
               <para>Many attributes are not inheritable, e.g., <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>. Others are inheritable,
                  indicating something about the host element and all its descendants. When a
                  descendant has the same attribute, the default behavior is for the new attribute
                  to cancel any inherited ones, e.g., <code><link linkend="attribute-xmllang"
                        >@xml:lang</link></code>, <code><link linkend="attribute-affects-element"
                        >@affects-element</link></code>, <code><link linkend="attribute-claimant"
                        >@claimant</link></code>. In other cases, the inherited effect is additive,
                  e.g., <code><link linkend="attribute-cert">@cert</link></code>. Consult individual
                  attribute entries to understand an attribute's behavior.</para>
               <para>Some attributes in an element have priority for interpretation. <code><link
                        linkend="attribute-claimant">@claimant</link></code>, for example, has
                  priority over <code><link linkend="attribute-cert">@cert</link></code> second.
                  That is, the two attributes in the same element are to be interpreted to mean:
                        "<code><link linkend="attribute-claimant">@claimant</link></code> has
                        <code><link linkend="attribute-cert">@cert</link></code> confidence about
                  the following claim:...."</para>
            </section>
            <section xml:id="defining_tokens">
               <title>Defining Words and Tokens</title>
               <para>At the heart of interaction between class 1 and class 2 files is a reference
                  system that counts or names words. This poses a problem at the outset. The term
                     <emphasis role="italic">word</emphasis> is notoriously difficult to define, no
                  matter the language. In different contexts, for example, "New York" and "didn't"
                  can each be justifiably taken to be one or two words. Furthermore, some scholars
                  consider punctuation to be words (e.g., commas in modern prose, representing
                  "and"), whereas others ignore them as being anachronistic or capricious (e.g.,
                  ancient Greek and Latin). In the end, the number of meanings for "word" reflects
                  the rich variety of scholarly disciplines.</para>
               <para>TAN adopts the proximate term <emphasis role="italic">token</emphasis>—a word
                  that is defined not according to grammar but according to a regular expression
                  (see <xref linkend="regular_expressions"/>). </para>
               <para>A TAN token is a reference pointer, not a linguistic marker. To define a token
                  in TAN does not entail any linguistic commitments. Neither editors nor users of
                  TAN data should infer that a <link linkend="element-tok"
                     ><code>&lt;tok></code></link> points to a morpheme, a lexeme, or any other
                  linguistic entity. There will frequently be a fortuitous correlation between the
                  two, but it is not guaranteed. In TAN, a token is purely a method of
                  reference.</para>
               <para>TAN was developed in service of ancient literature, where punctuation is
                  generally ignored as being late or not central to the text. Even in contemporary
                  use, most people ignore punctuation when they count words. Therefore the default
                     <link linkend="element-token-definition"
                     ><code>&lt;token-definition></code></link> defines a token as being any
                  continuous string of word characters, the soft hyphen, the zero-width space, or
                  the zero-width joiner, formally defined:</para>
               <para>
                  <programlisting>&lt;token-definition regex="[\w&amp;#xad;&amp;#x200b;&amp;#x200d;]+"/></programlisting>
               </para>
               <para>This pattern will result in a close resemblance to what is ordinarily thought
                  of as words, but perhaps with some surprises (see above, <xref
                     linkend="regular_expressions"/>). If no <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link> is
                  explicitly given, the pattern above will be assumed.</para>
               <para>If you are working with modern texts, where punctuation might be important to
                  name and number, try the built-in keyword <code>letters and
                  punctuation</code>:</para>
               <para>
                  <programlisting>&lt;token-definition regex="\w+|[^\w\s]"/></programlisting>
               </para>
               <para>This expression defines a token as a sequence of word characters or any single
                  character that is neither a word nor a space. The string "<code>(I go!)</code>"
                  (the text inside the quotation marks) would have five tokens: <code>( I go !
                     )</code>.</para>
               <para>Above are two built-in, TAN-defined <link linkend="element-token-definition"
                        ><code>&lt;token-definition></code></link>s. You may customize your own
                     <link linkend="element-token-definition"
                     ><code>&lt;token-definition></code></link> to suit your needs. But keep in mind
                  that TAN files were meant to be shared across fields and disciplines. You are
                  encouraged to to define tokens in manner customary to users of the text.
                  Specialized definitions make it less likely that your TAN file will be able to
                  mesh well with other TAN files. Two class-2 files annotating the same class-1 file
                  cannot be easily compared or synthesized if they use different definitions of
                  token.</para>
               <para>Given those caveats, consider a specialized case, where you wish to prepare
                  your transcriptions such that certain Unicode characters precisely delimit tokens
                  that are synonymous with a particular linguistic category, say lexeme. Say, for
                  example, you use specialized control characters (e.g., U+200C ZERO WIDTH
                  NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text
                  of your class 1 file. You might then create a <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link>
                  like this:</para>
               <para>
                  <programlisting>&lt;token-definition regex="[^\p{Cf}\s]+"/></programlisting>
               </para>
               <para>The statement defines a token as any consecutive sequence of non-spacing and
                  non-control format characters.</para>
               <para>Such customized approaches may make the technique unwieldy or impossible to
                  use, thereby limiting your TAN file's interoperability and utility. It is
                  recommended that if you use control formatting characters or other special
                  characters that are invisible to use the xml entity, e.g.,
                     <code>&amp;#x200D;</code>, so they can be seen in your file.</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_1">
         <title>Class-1 TAN Files, Representations of Textual Objects (Scripta)</title>
         <para>This chapter provides general background to class 1 TAN files. For detailed
            discussion of specific elements or attributes see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri,
            stones, or any other objects with writing on them—collectively termed here
               <emphasis>scripta</emphasis> (sg. <emphasis>scriptum</emphasis>). Files of this class
            are the foundation of any project. No class 2 files (e.g., alignment, morphology) can be
            created without class 1 files. </para>
         <para>Transcriptions come in two different formats, identified by the root element.
                  <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> is a simple, generic
            format, as close as one can get to plain text. <code>&lt;TEI></code> (also referred to
            in this manual as TAN-TEI), on the other hand, can be complex and highly expressive.
            Because the two types function almost identically, the generic TAN-T format is described
            first, followed by supplemental comments on TAN-TEI.</para>
         <section xml:id="transcription_principles">
            <title>Principles and Assumptions</title>
            <section>
               <title>General</title>
               <para>(For more general principles and assumptions applying to all TAN files, not
                  just class 1, see <xref linkend="design_principles"/>.)</para>
               <para>Class 1 formats are designed for faithful but judiciously normalized digital
                  transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of
                  a single work found in a single scriptum (text-bearing object), segmented and
                  uniquely labeled with a common reference system. Editors of TAN-T(EI) files should
                  be able to read, write, and proofread texts in the languages of the
                  transcriptions. They should understand the texts well enough to segment them and
                  label them according to the conventions used for those works. They should be able
                  to distinguish the text of a primary source from its editorial apparatus. They
                  should be familiar with normalizing conventions for texts from the period,
                  language, and culture. They should know how the transcription might be used in
                  other contexts, especially translation studies or a study of quotations.</para>
               <para>Editors need not understand everything about their texts, and they need not
                  have any specialized skill in grammar or lexicography. They need not know the
                  morphology of individual words, or how individual parts of the text have been
                  translated. Those skills should be used in other TAN formats. </para>
               <para>TAN-T(EI) editors stand at the beginning of a larger workflow for text
                  alignment. It is critical that work not be published hastily, and only after
                  careful proofreading. Many transcriptions, especially those of long texts, have
                  typographical errors. Eliminating as many as possible before publication will
                  maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed
                  with the assumption that all our files have typographical errors that can and
                  should be corrected as they are found.</para>
               <para>If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate
                  alignment and annotation, which depends critically upon a stable, familiar
                  reference system. Transcription files should be segmented and labeled according to
                  a reference system that can be easily applied to other versions of the same text
                  in other languages. If possible, semantic mileposts (clauses, sentences,
                  paragraphs, chapters) should be prioritized over visual (lines, columns, pages,
                  volumes). See below on <link linkend="reference_system">reference
                  systems</link>.</para>
            </section>
            <section xml:id="domain_model">
               <title>Domain model</title>
               <para>Contributors and users of TAN files should strongly distinguish between a
                  scriptum (text-bearing object) and a conceptual work, e.g., a specific printed
                  copy of the <emphasis>Iliad</emphasis> versus the <emphasis>Iliad</emphasis>
                  concieved generally. The former has materiality (digital files are treated as
                  being material) and the latter does not. Even though both are constitutively
                  necessary for any transcription, the two are sharply differentiated in the TAN
                  format: <code><link linkend="element-source">&lt;source&gt;</link></code> and
                        <code><link linkend="attribute-src">@src</link></code> point to physical
                  exemplars; <code><link linkend="element-work">&lt;work&gt;</link></code> and
                        <code><link linkend="attribute-work">@work</link></code> to the conceptual. </para>
               <para>The distinction may remind some readers of the domain model defined by the
                  Functional Requirements for Bibliographical Records (FRBR), which identifies four
                  types of entities for what they call Group 1 (Products of intellectual &amp;
                  artistic endeavor): <emphasis>Work</emphasis>, <emphasis>Expression</emphasis>,
                     <emphasis>Manifestation</emphasis>, and <emphasis>Item</emphasis>, the first
                  pair being conceptual, non-material entities and the latter pair material ones. </para>
               <para>TAN has been designed with a slightly different domain model in mind. FRBR
                  Items are equivalent to what TAN calls <emphasis>scripta</emphasis>. Multiple
                  scripta that for all intents and purposes are indistinguishable (i.e., items
                  reproduced mechanically) are equivalent to FRBR Manifestations, but in TAN no
                  corresponding entity has been defined. It is best to think of TAN scripta as being
                  equivalent to FRBR Items, with FRBR Manifestations being sets of indistinguishable
                  TAN scripta. </para>
               <para>As for conceptual entities, TAN has been designed with the assumption that most
                  users will find the distinction between Works and Expressions to be unhelpful or
                  misleading. What one person calls a FRBR Expression another may legitimately call
                  a Work. TAN assumes that any derivation of a Work (or Works) is itself a Work,
                  which is really shorthand for <emphasis role="italic">work-version</emphasis>.
                  Thus, in this manual the term <emphasis>version</emphasis> indicates merely a type
                  of work that is known either to derive from another work or to be the basis for
                  other versions of a work. </para>
               <para>TAN avoids altogether the term <emphasis>Expression</emphasis>. Aside from the
                  issues mentioned above, the term implies a medium (without which nothing can be
                  expressed) and therefore materiality. </para>
            </section>
            <section>
               <title>One version, one work, one object, one reference system</title>
               <para><emphasis>Every TAN-T(EI) file must be restricted to a transcription of a
                     single version of a single conceptual work found on a single scriptum,
                     segmented and labeled according to a single reference system</emphasis>. </para>
               <para>This restrictive principle is critical to the the success of the network. It
                  reduces the risk of confusion, simplifies the files, and shifts markup complexity
                  from an individual transcription file to the network in which that file
                  participates.</para>
               <section xml:id="textual_objects">
                  <title>One scriptum</title>
                  <para>Each TAN-T(EI) file transcribes one and only one text-bearing object or
                     scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a
                     bottlecap. If the object you've chosen has been made mechanically and is
                     virtually indistinguishable from other objects created by the same process
                     (e.g., copies of a printed book or copies of a digital file), then the entire
                     set of copies is to be treated as a single object (an entity some librarians
                     call a manifestation). </para>
                  <para>The definition of some scripta require an editor's discernment and judgment.
                     For example, some manuscripts have been split up, their parts now residing in
                     multiple libraries around the world; others may be a composite of older
                     manuscripts. In such cases, you may need to define your scriptum in a way that
                     might not match the way others define it. But the decision is your prerogative,
                     not theirs. You have both the right and responsibility to define your object in
                     the way that you think will most benefit users of your files.</para>
                  <para>It is a good idea to name your scriptum in <code><link
                           linkend="element-source">&lt;source></link></code> with an <code><link
                           linkend="element-IRI">&lt;IRI></link></code> value in the form of an
                        <code>http</code> URL provided by a library catalogue. This way you provide
                     a way for others, perhaps through an algorithm, to retrieve extensive,
                     structured bibliographical information. You also save yourself the hassle of
                     writing a detailed bibliographical description that your users would probably
                     not be able to import into their reference management software. If a URL cannot
                     be found for <code><link linkend="element-IRI">&lt;IRI></link></code>, you may
                     simply coin a tag URN or a UUID. Alternatively, if you find another TAN file
                     that uses the same source, it would be a good idea to adopt that name.</para>
               </section>
               <section xml:id="conceptual_works">
                  <title>One work</title>
                  <para>The transcription must be restricted to a single creative work, identified
                     by <code><link linkend="element-work">&lt;work></link></code>. </para>
                  <para>Many scripta have more than one work. Identifying and defining the creative
                     work you transcribe is, once again, your prerogative. Suppose the scriptum you
                     have is a Bible. The work you choose from that object can take whatever
                     contours you wish. Perhaps you wish to encode the entire Bible and treat it as
                     a single work. Or maybe you wish to treat only the New Testament as the work,
                     or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that
                     gospel, or simply the Beatitudes. Any definition of a work is permitted, but a
                     TAN-T(EI) file should contain nothing but the work you have defined. It should
                     be a complete representation of what is found on the object, even if only
                     partially preserved, and respect as far as is practical the order of the text
                     in the scriptum.</para>
                  <para>Well-known works may have a suitable IRI name already assigned to them, say
                     by means of a <link xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>
                     entry. Most works have not been assigned IRIs or are named in IRI vocabularies
                     that are not well known. You may assign any work your own URN, through a UUID
                     or a tag URN. </para>
               </section>
               <section xml:id="work-versions">
                  <title>One version</title>
                  <para>The transcription must be restricted to a single version of the creative
                     work, identified by <code><link linkend="element-version"
                        >&lt;version></link></code> (optional). In most cases, <code><link
                           linkend="element-version">&lt;version></link></code> is unnecessary,
                     because <code><link linkend="element-work">&lt;work></link></code> in
                     conjunction with <code><link linkend="element-source">&lt;source></link></code>
                     are sufficient to identify a particular work-version. But if the source carries
                     multiple versions (e.g., a bilingual edition of a text), then <code><link
                           linkend="element-version">&lt;version></link></code> should be
                     included.</para>
                  <para>Each versions from a scriptum should have its own separate TAN-T(EI) file. </para>
                  <para>Notes should be included only if they are an integral part of the primary
                     work (i.e., by the same author, not by a later editor). If you think the notes
                     to a work are important, consider putting them in their own TAN-T(EI) file, or
                     converting them to claims in a TAN-A-div file.</para>
                  <para>If you need to specify exactly where on a scriptum a version appears,
                           <code><link linkend="element-desc">&lt;desc></link></code> or <code><link
                           linkend="element-comment">&lt;comment></link></code> should be
                     used.</para>
                  <para>Very few work-versions have their own URN names. It is advisable to assign a
                     tag URN or a UUID. If the IRI you have used for <code><link
                           linkend="element-work">&lt;work></link></code> is in a namespace that you
                     own or control, then you are entitled to modify it, and you may wish merely to
                     add a suffix to the work IRI to name the version. </para>
               </section>
               <section xml:id="reference_system">
                  <title>One reference system</title>
                  <para>Every TAN transcription must be segmented into a hierarchy of uniquely
                     labeled divisions, defined in the <code><link linkend="element-body"
                           >&lt;body></link></code> through <code><link linkend="element-div"
                           >&lt;div></link></code>s and their <code><link linkend="attribute-type"
                           >@type</link></code> and <code><link linkend="attribute-n"
                        >@n</link></code> values. </para>
                  <para>Those divisions, whenever possible, should align with the reference system
                     that prevails for the work across versions or translations, what is sometimes
                     called a canonical reference system. Because even the most familiar reference
                     system admits degrees and dispute, the term <emphasis>canonical</emphasis> is
                     problematic, so <emphasis role="italic">reference system</emphasis> is
                     preferred in these guidelines. </para>
                  <para>If you have your choice, preference should be given to systems that follow
                     the semantic contours of the work, not the physical features of a particular
                     object. Chapter, paragraph, and sentence numbers are preferable to volume,
                     page, and line numbers, because other derivative versions of a work (e.g.,
                     translations, paraphrases) will only roughly, if at all, follow an
                     object-oriented reference system. </para>
                  <para>Sometimes an object-based reference system is inescapable, or is the most
                     common reference system for a work (e.g., Porphyry's commentary on the
                        <emphasis>Categories</emphasis>). It is perfectly acceptable to adopt that
                     scheme, but it may eventually entail more labor for the alignment process. </para>
                  <para>If a given work has multiple systems (e.g., the works of Plato and
                     Aristotle, which have two reference systems—semantic- and object-oriented—both
                     of which are standard and important), then the recommended practice is to
                     encode the same text twice, placing in each file a <code><link
                           linkend="element-see-also">&lt;see-also></link></code> pointing to the
                     other and a <code><link linkend="element-relationship"
                        >&lt;relationship></link></code> with the keyword <code>alternatively
                        divided edition</code> as the value of <link linkend="attribute-which"
                           ><code>@which</code></link>. A pair of alternatively divided editions can
                     usefully serve as the basis for concordances. In fact, the pair can be used as
                     the first step in converting other versions of the work from one reference
                     system to the other.</para>
                  <para>If there is a good reference system, but the divisions are overly lengthy,
                     you may introduce subdivisions. Such subdivided texts are compatible with
                     references to the older system. But there is no guarantee that the provisional
                     subdivisions you introduce will be adopted by other editors who create or edit
                     TAN versions of the same work, and in the end editors working independently
                     upon the same text may produce discordant schemes. The TAN-A-div format was
                     designed to reconcile such differences.</para>
                  <para>If there is no reference system, or if you think that the ones that exist
                     are inadequate or misguided, create one of your own. If you develop your own
                     reference system, be sure to optimize for all versions of the work, whether
                     known or not. </para>
                  <para>In the <code><link linkend="element-definitions"
                        >&lt;definitions></link></code>, at least one <code><link
                           linkend="element-div-type">&lt;div-type></link></code> must be supplied,
                     declaring the types of divisions into which the text has been segmented, to be
                     referred to by <code><link linkend="attribute-type">@type</link></code> in each
                           <code><link linkend="element-div">&lt;div></link></code>. To declare a
                           <code><link linkend="element-div-type">&lt;div-type></link></code> does
                     not require you to use it in the transcription. It is advisable to keep the
                     abbreviation you adopt in <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code> brief but meaningful. </para>
                  <para>Well-known division types already have suitable IRI names. See <xref
                        linkend="keywords-div-type"/> for a list of core TAN vocabulary for division
                     types, both common and uncommon. If you encounter a rare division type, or one
                     that needs custom specificity, you should mint your own, either in the
                     declarations or in a separate TAN-key file.</para>
                  <para>Reference systems have as a central component numbering systems. TAN
                     supports five major numeration systems:<orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals</emphasis>. 1, 2, 3,
                              etc.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Roman numerals</emphasis>. Values up to 5000,
                              utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with
                              liberal syntactic rules (within a roman numeral, any digit preceding
                              one of a higher value is assumed to be a subtraction from the total
                              value; all others are positive values).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences</emphasis>. The
                              26-letter Roman alphabet, with numbers higher than 26 (or any multiple
                              of 26) beginning with the letter a incrementally repeated, e.g., y
                              (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase
                              allowed.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals + alphabetic
                                 sequences</emphasis>. Arabic numerals followed immediately by an
                              alphabetic sequence. The second item is to be calculated as a
                              subsequence of the first item, with the lack of a second item taking
                              highest priority. E.g., 4, 4a, 4b, 4c....</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences + Arabic
                                 numerals</emphasis>: As above, but with alphabetic sequence
                              preceding Arabic numerals.</para>
                        </listitem>
                     </orderedlist></para>
                  <para>TAN file processors will attempt to convert all values of <code><link
                           linkend="attribute-n">@n</link></code> to Arabic numerals. Some values
                     are ambiguously Roman numerals or alphabetic sequences, e.g., <code>c</code> (=
                     3 or 100). Such numerals are assumed to be roman, unless you supply a
                           <code><link linkend="element-ambiguous-letter-numerals-are-roman"
                           >&lt;ambiguous-letter-numerals-are-roman&gt;</link></code> and define it
                     as false.</para>
                  <para>There are also tools for other numeration systems, but they have not been
                     implemented in the validation process. See <code><link
                           linkend="function-letter-to-number">tan:letter-to-number</link></code>()
                     and dependencies.</para>
               </section>
            </section>
            <section xml:id="normalizing_transcriptions">
               <title>Normalizing transcriptions</title>
               <para>You should declare how you have normalized the transcription via <code><link
                        linkend="element-alter">&lt;alter></link></code> and its children, e.g.,
                        <code><link linkend="element-normalization"
                  >&lt;normalization></link></code>. (For suggestions on values of <code><link
                        linkend="element-IRI">&lt;IRI></link></code> for <code><link
                        linkend="element-normalization">&lt;normalization></link></code> see <xref
                     linkend="keywords-normalization"/>.)</para>
               <para>Generally speaking, normalization entails the suppression of things extraneous
                  to or separable from the work you have chosen. You are encouraged to omit
                  parenthetical editorial insertions (especially quotation references), stray
                  handwritten remarks, discretionary word-breaking hyphens, editorial comments,
                  inserted cross-references, and reference numerals (page numbers, section numbers,
                  etc.). If chapter 4 begins "4." or "IV" then leave out the prefatory
                  numeral—you've already indicated it in <code><link linkend="attribute-n"
                     >@n</link></code>. In addition, you should resolve ligatures and correct
                  unintended typographical errors. (Such orthographic corrections are useful to
                  those users who want to generate lexico-morphological data automatically or
                  semiautomatically.)</para>
               <para>The goal is a transcription whose text is free of the interpretive voice of
                  later editors. You should remove from the text anything that is not part of the
                  work proper and would interfere with detailed word-for-word alignment, or would
                  require extra preprocessing or postprocessing work for later users. If you are
                  segmenting a source into line breaks, and you are required to break a word between
                  divisions, you should either use the soft hyphen (<code>&amp;#xad;</code>) or the
                  zero-width joiner (<code>&amp;#x200d;</code>) at the end of the first leaf
                        <code><link linkend="element-div">&lt;div></link></code>. TAN processors
                  that handle a leaf <code><link linkend="element-div">&lt;div></link></code> will
                  automatically normalize the space in the element, then place a space between that
                  leaf <code><link linkend="element-div">&lt;div></link></code> and the next unless
                  if one of those two characters are found at the end of the first, in which case
                  the character will be deleted and the two <code><link linkend="element-div"
                        >&lt;div></link></code>s will be joined with no intervening space. For more
                  on issues regarding whitespace, see <xref linkend="whitespace"/>.</para>
               <para>In a digital source, variable lengths of spacing marks (e.g., General
                  Punctuation U+2000..U+200B) should be converted to ordinary spaces, and
                  superscript combining Roman letters (U+0363..U+036F) should probably be converted
                  to their non-combining counterparts. All Unicode must be normalized to NFC forms
                  (see <xref linkend="normalization"/>). </para>
               <para>If you are working with a text with notes, distinguish between those written by
                  the same person who wrote the work you're transcribing from those that aren't.
                  Treat the former as part of the work proper and give each note a <code><link
                        linkend="element-div">&lt;div></link></code> with a suitable <code><link
                        linkend="attribute-type">@type</link></code> and place it after the
                        <code><link linkend="element-div">&lt;div></link></code> it annotates. It
                  will be assumed by processors of the data that, absent more specific information,
                  any <code><link linkend="element-div">&lt;div></link></code> of an annotating
                        <code><link linkend="attribute-type">@type</link></code> is an annotation of
                  the last <code><link linkend="element-div">&lt;div></link></code> that is not an
                  annotation. (Alternatively, you may use the <code>&lt;note></code> feature of
                  TAN-TEI, but bear in mind that this element will be treated by users as part of
                  the leaf div to which it belongs, not separate from it.) </para>
               <para>If the notes are not part of the work per se—for example, translator's notes in
                  a translation of a primary source—you should treat them as a separate work
                  altogether, and put them in a separate TAN-T(EI) file, perhaps linking the two
                  through <code><link linkend="element-see-also">&lt;see-also></link></code>. You
                  may wish to structure that file so that it mirrors the reference system of the
                  primary source, to facilitate automatic alignment between the two. </para>
               <para>Remember that the note signals in the main text and in the footnote area are
                  metadata meant to help readers link corresponding passages of texts, and should be
                  deleted. If the connective function served by the note signal is important, create
                        <code><link linkend="element-claim">&lt;claim></link></code>s in a TAN-A-div
                  file, which supports correlating comments to specific ranges of text.</para>
               <para>This principle holds true for variants in the scriptum. For example, a
                  manuscript may have correctors' marks. Or a set of footnotes (or apparatus
                  criticus) might comment on how and why the main text differs from previous
                  readings. In those cases, each set of corrections might be wholly incorporated
                  into the <code><link linkend="element-claim">&lt;claim></link></code>s of a
                  TAN-A-div file, perhaps also with a separate TAN-T file.</para>
               <para>Overall, normalization is a difficult topic, and it is not well studied. Not
                  all decisions will be clear-cut. You may justly hesitate before normalizing
                  orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode
                  that lend themselves to varying conventions may need special consideration. You
                  may need to consider whether an unusual or rarely used Unicode character might be
                  misinterpreted or hinder other users. Document any decisions in the <code><link
                        linkend="element-alter">&lt;alter></link></code>. </para>
               <para>In some ambiguous areas, you can use TAN-TEI to your advantage. Suppose, for
                  example, a manuscript has reference numerals that are sui generis. That is, these
                  reference numbers do not correspond to the "canonical" reference scheme. On the
                  one hand, they are metadata, and should arguably be deleted; on the other, they
                  are part of the text, and witness to how a text was read and changed over time. A
                  middle-ground approach would move these references to TAN-TEI's
                     <code>&lt;milestone rend=""></code>. In that way, the numerals are removed from
                  the main text; on the other hand, the information is retained. Generally speaking
                  TEI's <code>@rend</code> is an excellent way to remove something from the main
                  text, without removing it from the file altogether.</para>
            </section>
         </section>
         <section xml:id="tan-t_data">
            <title>Transcriptions</title>
            <para>The sole purpose of the <code><link linkend="element-body">&lt;body></link></code>
               of a class 1 file is to contain a segmented transcription of a single version of a
               single work from a scriptum. <code><link linkend="element-body"
                  >&lt;body></link></code> may take <code><link linkend="attribute-in-progress"
                     >@in-progress</link></code> and must take <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> that the majority of the
               text is in. If a change in language occurs in a descendant <code><link
                     linkend="element-div">&lt;div></link></code>, ensure that its <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> value (explicity or by
               inheritance) indicates the language that is used.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> takes one or more
                     <code><link linkend="element-div">&lt;div></link></code> elements, each of
               which govern either other <code><link linkend="element-div">&lt;div></link></code>
               elements, or text (or TEI elements).</para>
            <para>The term <emphasis>leaf div</emphasis> refers to those <code><link
                     linkend="element-div">&lt;div></link></code>s that contain text and therefore
               no other <code><link linkend="element-div">&lt;div></link></code>s.</para>
            <para>Within this treelike structure of <code><link linkend="element-div"
                     >&lt;div></link></code>s, the concatenation of <code><link
                     linkend="attribute-n">@n</link></code> values, starting from the most ancestral
                     <code><link linkend="element-div">&lt;div></link></code>, provides the
                  <emphasis>flat ref</emphasis>, the reference system used by class 2 files to refer
               to parts of TAN-T(EI) files. </para>

            <section xml:id="leaf_div_uniqueness_rule">
               <title>Flattened References, and the Leaf Div Uniqueness Rule</title>
               <para>One of the most important validation rules is the <emphasis>Leaf Div Uniqueness
                     Rule</emphasis>, which states that the flat ref for each leaf <code><link
                        linkend="element-div">&lt;div></link></code> must be unique.</para>
               <para>This rule applies only to leaf <code><link linkend="element-div"
                        >&lt;div></link></code>s and not to <code><link linkend="element-div"
                        >&lt;div></link></code>s in general, since on occasion a major textual unit
                  will be broken by another. For example, chapters 24 and 30 in the book of Proverbs
                  of the Septuagint are split and interleaved (24.1–22e [22a–e are verses not extant
                  in the Hebrew]; 30.1–14; 24.23–34; and 30.15–33).</para>
            </section>
         </section>
         <section xml:id="tan-tei">
            <title>Transcriptions Using the Text Encoding Initiative (<code>&lt;TEI></code>)</title>
            <para>
               <note>
                  <para>This section is to be read in conjunction with <xref linkend="class_1"/> and
                        <xref linkend="TEI"/>, which address related technical issues.</para>
               </note>
            </para>
            <para>Some creators and editors of transcriptions will find the rather stripped-down
               TAN-T format inadequate. Some may wish to mark up the text further. Some may already
               have a library of transcriptions whose annotations are desirable to keep, even if
               uninteresting to most users. In these cases, you should use TAN-TEI, an extension to
               the Text Encoding Intiative (TEI) format, which is well known for its expressiveness,
               its stability, its flexibility, and its widespread use in scholarship.</para>
            <para>TEI was designed to be maximally expressive and flexible, to serve the detailed
               needs of humanities scholars. In serving this mission, TEI has come to define more
               than five hundred different element names, and more than two hundred attributes
               (roughly six times more than are defined in TAN). Of course, any given TEI file uses
               only a small subset of those elements and attributes, and TEI itself comes in
               different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to
               TEI All, which opens up almost the entire library. </para>
            <para>Although the TEI format is oftentimes seen as a standard, it lacks some of the
               charactistics one normally expects in a standard. It is very flexible, admits flavors
               and interpretation, and has been designed to encourage customization. Individuals and
               projects may define their own subset of TEI elements, to constrict or expand the
               allowable rules as they see fit. TAN-TEI is one of those customizations. The major
               difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to
               ensure that transcriptions are maximally likely to be interchangeable with other
               TAN-TEI files.</para>
            <para>TAN's customization of the TEI can be summarized as follows (the default namespace
               in this section is the TEI namespace,
               <code>http://www.tei-c.org/ns/1.0</code>):</para>
            <para>
               <table frame="all">
                  <title>Synopsis of TAN-TEI customization</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1*"/>
                     <colspec colname="c3" colnum="2" colwidth="3.21*"/>
                     <thead>
                        <row>
                           <entry>TEI element</entry>
                           <entry>summary of alteration</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must have <code><link linkend="attribute-id"
                                          >@id</link></code> with IRI name</para>
                                 </listitem>
                                 <listitem>
                                    <para>should take new namespace declaration,
                                          <code>xmlns:tan="tag:textalign.net,2015:ns"</code>
                                    </para>
                                 </listitem>
                                 <listitem>
                                    <para>takes a new child element, <code><link
                                             linkend="element-head">&lt;head></link></code>, placed
                                       between <code>&lt;teiHeader></code> and
                                          <code>&lt;text></code></para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code>&lt;text></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>Only the child <code><link linkend="element-body"
                                             >&lt;body></link></code> will be considered.
                                          <code>&lt;front></code> and <code>&lt;back></code> will be
                                       ignored.</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-body">&lt;body></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-xmllang"
                                             >@xml:lang</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>may take <code><link linkend="attribute-in-progress"
                                             >@in-progress</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>must take exclusively one or more <code><link
                                             linkend="element-div">&lt;div></link></code>s</para>
                                 </listitem>
                                 <listitem>
                                    <para>any elements or text between <code><link
                                             linkend="element-div">&lt;div></link></code>s will be
                                       ignored</para>
                                 </listitem>
                                 <listitem>
                                    <para>contents must be restricted to a single work</para>
                                 </listitem>
                                 <listitem>
                                    <para>any and all text nodes will be treated as part of the
                                       transcription</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-div">&lt;div></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must take either only <code><link linkend="element-div"
                                             >&lt;div></link></code>s or no <code><link
                                             linkend="element-div">&lt;div></link></code>s at
                                       all</para>
                                 </listitem>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-type"
                                             >@type</link></code> and <code><link
                                             linkend="attribute-n">@n</link></code> (or <link
                                          linkend="attribute-include"
                                       ><code>@include</code></link>)</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>Like all other TAN files, the root elements of TAN-TEI files must take an
                     <code><link linkend="attribute-id">@id</link></code>, the IRI name. See above,
                  <xref linkend="tag_urn"/>.</para>
            <para>TAN-TEI files have two heads, which may strike you as odd. The TEI head and the
               TAN head were designed for different purposes. Whereas the TAN <link
                  linkend="element-head"><code>&lt;head></code></link> is meant to be brief and
               keyed to both IRIs and human-readable data, the <code>&lt;teiHeader></code> permits
               quite an expansive range of metadata, and about matters that bear only indirectly on
               the transcription (e.g., manuscript descriptions). Further,
                  <code>&lt;teiHeader></code> was designed to be read principally by humans. </para>
            <para>Processors of TAN-TEI files will in general ignore the contents of
                  <code>&lt;teiHeader></code>, since the contents are unpredictable. If your
                  <code>&lt;teiHeader></code> has any kind of metadata relevant to TAN users, you
               will need first to create a standard TAN <link linkend="element-head"
                     ><code>&lt;head></code></link> (see <xref linkend="metadata_head"/> and <xref
                  linkend="transcription_principles"/>). This conversion needs to be performed
               manually, since the two headers are incommensurate, and writing each one requires a
               different kind of mentality.</para>
            <para>In a TAN-TEI file, the TAN <code><link linkend="element-head"
                  >&lt;head></link></code> must take the TAN namespace, i.e., <code>&lt;head
                  xmlns="tag:textalign.net,2015:ns"></code> or <code>&lt;tan:head></code> if the
               prefix <code>tan:</code> has been defined in the root element.</para>
            <para>Within any leaf <code><link linkend="element-div">&lt;div></link></code>, you may
               use whatever TEI markup you wish, to whatever level of depth or complexity. All users
               of your TAN-TEI file will be interested in the text; only a subset will care about
               any markup within leaf <code><link linkend="element-div">&lt;div></link></code>s. For
               this reason, even if you change the value of <code><link linkend="attribute-xmllang"
                     >@xml:lang</link></code> within a leaf <code><link linkend="element-div"
                     >&lt;div></link></code>, there is no guarantee that readers or processors of
               your data will take it into account. </para>
            <para>TAN-TEI should not be used to try to represent the physical appearance of the text
               on the object. </para>
            <para>You may need to prepare a TEI file to be TAN compliant. As a matter of
               practicality, it is helpful to envision the conversion process as falling in three
               steps:</para>
            <para>
               <orderedlist>
                  <listitem>
                     <para>Structure: insert new processing instructions (TAN-TEI validation files);
                        adjust root element by supplying IRI name to <code><link
                              linkend="attribute-id">@id</link></code>, TAN namespace to
                           <code>@xmlns:tan</code>.</para>
                  </listitem>
                  <listitem>
                     <para>Metadata: create new <code><link linkend="element-head"
                           >&lt;head></link></code> and populate it</para>
                  </listitem>
                  <listitem>
                     <para>Data: edit <code><link linkend="element-body">&lt;body></link></code> to
                        restrict the content to a single work; restructure <code><link
                              linkend="element-body">&lt;body></link></code> content into nesting
                              <code><link linkend="element-div">&lt;div></link></code>s with correct
                              <code><link linkend="attribute-type">@type</link></code> and
                              <code><link linkend="attribute-n">@n</link></code> values.</para>
                  </listitem>
               </orderedlist>
            </para>
            <para>It has been the experience of those who have made TEI to TAN-TEI conversions that
               step 2 is the most time-consuming. The TAN <code><link linkend="element-head"
                     >&lt;head></link></code> requires one to more carefully curate the metadata
               than does <code>&lt;teiHeader></code>. But step 3 should not be underestimated,
               either. Many people write TEI files with a focus on the original textual object, and
               they do not normalize to the level expected in a TAN file. In general, the more
               simple the TEI file the better.</para>
         </section>
      </chapter>
      <chapter xml:id="class_2">
         <title>Class-2 TAN Files, Annotations of Texts</title>
         <para>This chapter provides general background to class 2 TAN files. For detailed
            discussion of individual elements and attributes see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>TAN-A-div files provide broad, macroscopic alignment of multiple versions of any
            number of works. It also provides a place for annotating the texts through general
            claims.</para>
         <para>TAN-A-tok files provide narrow, microscopic alignment of any two class 1 files,
            identifying word-for-word or character-for-character correspondence.</para>
         <para>TAN-A-lm files support lexico-morphology (part-of-speech) for either a single class 1
            file or a language.</para>
         <para>In translation studies, it is common to use the term <emphasis>source</emphasis> (or
               <emphasis>sources</emphasis>) to refer to a translated text and the term
               <emphasis>target</emphasis> to refer to the translation. TAN, however, has been
            designed for cases where it may not be clear which is the target and which is the
            source. Further, there is a more generic use of <emphasis>source</emphasis> that takes
            precedent. In these guidelines, therefore, we avoid the term <emphasis role="italic"
               >target</emphasis> altogether, and when we use the word <emphasis role="italic"
               >source</emphasis>, we are referring only to one of the class 1 files upon which a
            class 2 alignment depends.</para>
         <section xml:id="class_2_common">
            <title>Common Elements</title>
            <para>The class 2 formats have been designed to be human readable, particularly
               references to class 1 files. In ordinary conversation, when refering to specific
               parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters,
               and so forth. We use relational words (e.g., "first"), and the very text itself. We
               might say, for example, "See page 4, second paragraph, the last four words." Or, "See
               page 4, second paragraph, first sentence, second occurence of 'pull'." </para>
            <para>Those familiar conventions are the basis for the TAN pointer syntax, and so it
               differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers
               depend upon a fourfold hierarchy of: works, divisions, word tokens, and characters.
                  <emphasis>Works</emphasis>, defined above (see <xref linkend="conceptual_works"
               />), are defined by the <emphasis>source</emphasis> (which may not have more than one
               work). <emphasis>Divisions</emphasis> are defined by the <code><link
                     linkend="element-div">&lt;div></link></code> structure of each source.
                  <emphasis>Tokens</emphasis> are words of those divisions, defined according to one
               or more tokenization rules. And <emphasis>characters</emphasis> are defined as
               non-modifying codepoints in a word token. (A modifying character is always included
               with the base character it modifies.)</para>
            <para>Parts of this fourfold hierarchy—works, divisions, tokens, and characters—normally
               have familiar names. Sources can be given a meaningful abbreviated name (e.g.,
                     <code><link linkend="attribute-xmlid">xml:id</link> = "hamlet-1741"</code>);
               divisions are named according to <code><link linkend="attribute-n">@n</link></code>;
               tokens are referred to by position, by their actual values, or both (e.g.,
                     <code><link linkend="attribute-pos">pos</link> = "1 - 5", <link
                     linkend="attribute-pos">pos</link> = "last-1 - last", <link
                     linkend="attribute-val">val</link> = "hath"</code>; see <xref
                  linkend="attr_pos_and_val"/>). Characters are always identified by number (e.g.,
                     <code><link linkend="attribute-chars">chars</link> = "2, 7"</code>).</para>
            <para>This approach not only makes the syntax human readable, it also mitigates
               disruptions from corrections to the dependencies. For example, if an incorrectly
               duplicated <code><link linkend="element-div">&lt;div></link></code> is deleted,
               disruption to the reference system is isolated and does not affect the rest of the
               document.</para>
            <section xml:id="class_2_metadata">
               <title>Class 2 Metadata (<code><link linkend="element-head"
                  >&lt;head></link></code>)</title>
               <para>Class 2 files share a few common features in their metadata, mostly to
                  facilitate the human-friendly reference system outlined above.</para>
               <para>All class 2 files have as their sources nothing other than class 1 files.
                  Therefore each <code><link linkend="element-source">&lt;source></link></code> must
                  take the <xref xlink:href="#digital_entity_metadata"/>.  </para>
               <para>Editors of class 2 files must be able to name or number word-tokens in a
                  transcription, via an optional <code><link linkend="element-token-definition"
                        >&lt;token-definition></link></code>. See <xref linkend="defining_tokens"
                  />.</para>
               <para>Inevitably, some class 1 sources will have differences. Perhaps works or div
                  types were not defined with the same IRIs, or perhaps one version follows an
                  idiosyncratic reference system. If sources need to be reconciled, alterations are
                  specified in <code><link linkend="element-alter">&lt;alter&gt;</link></code>,
                  which stipulates a set of actions that should be applied to the sources that have
                  been named. Alteration actions include:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><code><link linkend="element-skip">&lt;skip&gt;</link></code> allows
                           you to ignore specific <code><link linkend="element-div"
                                 >&lt;div&gt;</link></code>s, deeply or shallowly.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="element-rename">&lt;rename&gt;</link></code>
                           allows you to rename specific <code><link linkend="element-div"
                                 >&lt;div&gt;</link></code>s.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="element-equate">&lt;equate&gt;</link></code>
                           allows you to provide synonyms for <code><link linkend="attribute-n"
                                 >@n</link></code> values.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="element-reassign">&lt;reassign&gt;</link></code>
                           allows you to move parts of leaf <code><link linkend="element-div"
                                 >&lt;div&gt;</link></code>s elsewhere. </para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>These actions allow you to reconcile sources that are somewhat at odds. Actions
                  are applied first hierarchically and then in the sequence stated above. That is,
                  the validation routine will go level by level through a given source. Any rules
                  that are found in one level will be applied (skips taking top precedence,
                  reassigns the lowest) before moving to the next level of the source. So if you
                  wish in a given source to change chapter 1 to chapter 2, any subdivisions will be
                  collated. If you wanted to do further things with (original) 1.5, you would need
                  to refer to it as 2.5, and you would also need to realize that if original 2.5
                  exists, the action will be applied to both.</para>
               <para>Each action adds time to the validation routines. On lengthy texts these can
                  become quite time-consuming. You are advised to keep <code><link
                        linkend="element-alter">&lt;alter&gt;</link></code>s to a minimum. If a
                  source has numerous alterations, you find it less time-consimung to create a new
                  version of a source.</para>



            </section>
            <section xml:id="class_2_body">
               <title>Class 2 Data Patterns (<code><link linkend="element-body"
                     >&lt;body></link></code>)</title>
               <para>The three types of class 2 files treat different kinds of phenomena, so their
                  data structures look quite different. Nevertheless, a few elements and attributes
                  are shared by at least two class 2 formats.</para>
               <para>Many class 2 elements take <code><link linkend="attribute-src"
                     >@src</link></code> and <code><link linkend="attribute-ref">@ref</link></code>.
                        <code><link linkend="attribute-src">@src</link></code> points via ID
                  reference to one or more <code><link linkend="element-source"
                     >&lt;source></link></code>s and <code><link linkend="attribute-ref"
                     >@ref</link></code> points to one or more <code><link linkend="element-div"
                        >&lt;div></link></code>s through their <emphasis>flat ref</emphasis>,
                  perhaps substituted with their new values if <code><link linkend="element-alter"
                        >&lt;alter&gt;</link></code>s have been invoked (see <xref
                     linkend="metadata_head"/>.</para>
               <para>In the example <code><link linkend="attribute-ref">ref</link> = "1.2-4,
                     1.5"</code>, the periods are arbitrary (but the hyphen and comma, which have
                  special meanings here, are not). You may use any separating punctuation or space
                  you wish, except for hyphens and commas, which are reserved to create ranges and
                  joins. You may also use other numeral systems. </para>
            </section>
            <section xml:id="attr_pos_and_val">
               <title><code><link linkend="attribute-pos">@pos</link></code> and <code><link
                        linkend="attribute-val">@val</link></code></title>
               <para>To point to a token, one of three methods may be used.</para>
               <para>
                  <orderedlist>
                     <listitem>
                        <para><emphasis role="italic"><code><link linkend="attribute-pos"
                                    >@pos</link></code> alone</emphasis>. Under this method, one or
                           more digits, or the phrase <code>last</code> or <code>last-</code> plus a
                           digit, joined by hyphens or commas indicate one or more token numbers.
                           For example, <code>2, 4-6, last-2 - last</code> refers to the second,
                           fourth, fifth, sixth, antepenult, penult, and final tokens in passage.
                           The numerical value to which the keyword <code>last</code> resolves
                           depends upon the length of each <code><link linkend="element-div"
                                 >&lt;div></link></code>.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="italic"><code><link linkend="attribute-val"
                                    >@val</link></code> alone</emphasis>. Under this method, a
                           single token is picked by means of a string value equivalent to the
                           token. For example, <code><link linkend="attribute-val">@val</link> =
                              "bird"</code>, points to the first occurence of the token
                              <code>bird</code>.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="italic"><code><link linkend="attribute-pos"
                                    >@pos</link></code> and <emphasis role="italic"><code><link
                                       linkend="attribute-val">@val</link></code></emphasis>
                              together.</emphasis> Under this method, specific occurences of a token
                           are picked. For example, <code><link linkend="attribute-val"
                              >@val</link>="bird" <link linkend="attribute-pos">@pos</link>="2,
                              4"</code> picks the second and fourth occurences of the token
                              <code>bird</code>.</para>
                     </listitem>
                  </orderedlist>
               </para>
               <para>During validation, if <code><link linkend="attribute-pos">@pos</link></code> or
                        <code><link linkend="attribute-val">@val</link></code> are missing, they are
                  supplied with their default values, <code>1</code> and <code>.+</code>
                  respectively. That is, <code><link linkend="attribute-pos">@pos</link></code> by
                  default points to the first instance and <code><link linkend="attribute-val"
                        >@val</link></code> by default points to any string.</para>
               <para><code><link linkend="attribute-pos">@pos</link></code> and <code><link
                        linkend="attribute-val">@val</link></code> must be used carefully. For
                  example, the attribute combination <code>val="bird" pos="last-5"</code> will
                  produce an error if the word token <code>bird</code> does not occur at least six
                  times.</para>
               <para>It is advisable to use <code><link linkend="attribute-val">@val</link></code>,
                  and not merely <code><link linkend="attribute-pos">@pos</link></code>. If the
                  editor makes corrections to your source texts, references are more likely to
                  become corrupt, and less likely to be traceable, if there is no <code><link
                        linkend="attribute-val">@val</link></code> . </para>
            </section>
         </section>
         <section xml:id="tan-a-div">
            <title>Division-Based Annotations and Alignments (<code><link
                     linkend="element-TAN-A-div">&lt;TAN-A-div></link></code>)</title>
            <para>TAN-A-div is the format for macroscopic, division-based alignment, and is
               dedicated to aligning any number of versions of any number of works on the basis of
                     <code><link linkend="element-div">&lt;div></link></code>s, or even smaller, ad
               hoc segments in the sources invoked. </para>
            <para>A TAN-A-div file allow you to make general claims about a work, or a particular
               version of a work. </para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a TAN division-based alignment file is <code><link
                        linkend="element-TAN-A-div">&lt;TAN-A-div></link></code>.</para>
               <para>TAN-A-div's <code><link linkend="element-head">&lt;head></link></code> has one
                  or more <code><link linkend="element-source">&lt;source></link></code>s.</para>
               <para>Any concepts that will be mentioned in the <code><link linkend="element-claim"
                        >&lt;claim&gt;</link></code>s need to be supplied in <code><link
                        linkend="element-definitions">&lt;definitions></link></code>.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-div
                  file takes, in addition to the customary optional attributes (see <code><link
                        linkend="attribute-in-progress">@in-progress</link></code> and <xref
                     linkend="edit_stamp"/>), <code><link linkend="attribute-claimant"
                        >@claimant</link></code>, <code><link linkend="attribute-object"
                        >@object</link></code>, <code><link linkend="attribute-subject"
                        >@subject</link></code>, or <code><link linkend="attribute-verb"
                        >@verb</link></code>, stipulating the default values for any claims to
                  come.</para>
               <para>The rest of the body consists of <code><link linkend="element-claim"
                        >&lt;claim&gt;</link></code>s whose model is inspired by the Resource
                  Description Framework (RDF; see <xref linkend="rdf_and_lod"/>). RDF depends upon a
                  simple data model, where each datum consists of three items termed a subject, a
                  predicate, and an object. The first and third are thought of as nodes, and the
                  second as a connector between the nodes.<note>
                     <para>A connector, our preferred term, is frequently elsewhere called an edge,
                        but that term elicits a metaphor that is confusing and misleading. A
                        cylinder, for example, has two edges, but they don't connect anything.
                        Furthermore, "edge" implies that what's really of interest is the void
                        beyond the surface of a three-dimensional object.</para>
                  </note></para>
               <para>TAN was designed to serve scholars, who normally find RDF-like sentences
                  unsatisfactory. They lack context or qualifiers. It is unclear who made them, or
                  when, or if they were uttered with any doubt or nuance. Sometimes we wish to claim
                  a bare negation, e.g., "Aristotle was not the author of <emphasis>De
                     mundo</emphasis>"—an assertion not possible to express in RDF.</para>
               <para>A TAN <code><link linkend="element-claim">&lt;claim&gt;</link></code> adds some
                  of this nuance and complexity to RDF. Every claim must be assigned to a claimant
                  (and claims can  be recursive, e.g., X claims that Y claims that Z claims
                  that...). The RDF terminology subject + predicate + object is adjusted by TAN RDF
                  to subject + verb + object. A <code><link linkend="element-claim"
                        >&lt;claim&gt;</link></code> may be be restricted to a particular date or
                  place, or it may be tempered by certainty and modified with adverbs. If the object
                  is data, the data type can be restricted to a specific type and lexical form.
                  Despite being somewhat more complex than RDF, TAN-c syntax is more human readable. </para>
               <para><code><link linkend="element-claim">&lt;claim></link></code> may be used for a
                  variety of things, e.g.,:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para>to list quotations and allusions;</para>
                     </listitem>
                     <listitem>
                        <para>to indicate which passages deal with what general subjects and
                           topics;</para>
                     </listitem>
                     <listitem>
                        <para>to connect commentary or notes from one source with another;</para>
                     </listitem>
                     <listitem>
                        <para>to indicate where other scripta have different readings (apparatus
                           criticus).</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>These assertions are made in <code><link linkend="element-claim"
                        >&lt;claim></link></code>s whose <code><link linkend="element-subject"
                        >&lt;subject></link></code> or <code><link linkend="element-object"
                        >&lt;object></link></code> points to passages of text. Any textual
                        <code><link linkend="element-subject">&lt;subject></link></code> or
                        <code><link linkend="element-object">&lt;object></link></code> may take
                        <code><link linkend="attribute-work">@work</link></code> or <code><link
                        linkend="attribute-src">@src</link></code>. The former takes a single
                  reference to a <code><link linkend="element-source">&lt;source&gt;</link></code>,
                  but adopts the reference as a proxy to make a claim applicable to all versions of
                  the same work. <code><link linkend="attribute-src">@src</link></code> restricts
                  the claim to specific versions, not to the work as a whole.</para>
            </section>
         </section>
         <section xml:id="tan-a-tok">
            <title>Token-Based Annotations and Alignments (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>)</title>
            <para>TAN-A-tok files provide a microscopic view of how two sources relate to each
               other. The format is intended to allow you to specify exactly where, how, and why two
               transcriptions align, and to do so on the most granular level possible. TAN-A-tok
               files also allow you to express levels of confidence or alternative opinions.</para>
            <para>Creators and editors of TAN-A-tok files should be able to read the languages of
               their sources and to explain as precisely as possible the relationship between the
               two sources. They should be prepared to think about and specify types of textual
               reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A-div
               files are because they reflect work that is more detailed, and therefore more
               time-consuming, than simple en masse alignment of sources.</para>
            <para>Because of the detailed nature of the inquiry, token alignment is restricted to
               two texts, referred to jointly as a <emphasis role="italic">bitext</emphasis>. Each
               half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources
               share some special relationship, direct or indirect, and relate through one or more
               types of textual reuse: translation, paraphrase, commentary, and so forth. Some of
               these bitexts, such as literal translations, may line up quite nicely word for word.
               Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in
               places, not at all. So annotating a bitext is oftentimes not easy, and requires you
               to think hard about assumptions you have made in two key areas: the relationship that
               holds between two scripta and the types of reuse that was involved in turning one
               version into the other (or a common ancestor into both).</para>
            <para><emphasis role="bold">Relationship of sources' scripta</emphasis>. What is the the
               physical relationship or history that connects the two sources' scripta? Is one a
               direct descendant (copy) of the other? If not, what common ancestor do they share?
               Here you consider the material aspect of the bitext, because you are trying to answer
               how object A's text relates to object B's.</para>
            <para><emphasis role="bold">Types of reuse</emphasis>. What categories of text reuse do
               you consider operative? Such a declaration tells users of your data what paradigm you
               bring to your analysis. You may wish to keep your categories nondescript and somewhat
               vague, using loosely defined concepts such as <emphasis>translation</emphasis>,
                  <emphasis>paraphrase</emphasis>, <emphasis>quotation</emphasis>, and so forth
               without much specificity. On the other hand, you may subscribe to a detailed view of
               text reuse. Perhaps you have adopted field-specific categories such as
                  <emphasis>obligatory explicitation</emphasis>, <emphasis>optional
                  explicitation</emphasis>, <emphasis>pragmatic explicitation</emphasis>, or
                  <emphasis>translation-inherent explicitation</emphasis>. You may also wish to
               declare secondary types of reuse, such as <emphasis role="italic">scribal
                  omission</emphasis> or <emphasis role="italic">dittography</emphasis>, to declare
               secondary types of reuse that may have intervened. You must declare at least one type
               of reuse. Or you may use those that are built into the TAN format. See <xref
                  xlink:href="#keywords-reuse-type"/>.</para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a token-based alignment file is <code><link
                        linkend="element-TAN-A-tok">&lt;TAN-A-tok></link></code>.</para>
               <para>The TAN-A-tok header builds upon the core and class 2 headers (see <xref
                     linkend="metadata_head"/> and <xref linkend="class_2_metadata"/>).</para>
               <para>TAN-A-tok files take exactly two <code><link linkend="element-source"
                        >&lt;source></link></code>s. The sequence is arbitrary. Each <code><link
                        linkend="element-source">&lt;source></link></code> must take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>.</para>
               <para><code><link linkend="element-definitions">&lt;definitions></link></code> takes,
                  in addition to all the elements allowed in class 2 files (see <xref
                     linkend="class_2_metadata"/>), two elements unique to TAN-A-tok: <code><link
                        linkend="element-bitext-relation">&lt;bitext-relation></link></code> and
                        <code><link linkend="element-reuse-type">&lt;reuse-type></link></code>. The
                  former describes the genealogical relationship between each source's scripta. The
                  second attends to the qualitative aspect of the bitext relationship.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-tok
                  file takes, in addition to the customary optional attributes (see <code><link
                        linkend="attribute-in-progress">@in-progress</link></code> and <xref
                     linkend="edit_stamp"/>), required <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code> and
                        <code><link linkend="attribute-reuse-type">@reuse-type</link></code>, which
                  take one or more id references from <code><link linkend="element-bitext-relation"
                        >&lt;bitext-relation></link></code> and <code><link
                        linkend="element-reuse-type">&lt;reuse-type></link></code>, indicating the
                  default values that govern the alignment. </para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-align">&lt;align></link></code>s,
                  each of which collects sets of <code><link linkend="element-tok"
                     >&lt;tok></link></code>s from one or both sources, known collectively as a
                     <emphasis role="italic">token cluster</emphasis>. Clusters may overlap, to
                  handle translations in which words fall in one-to-one, one-to-many, many-to-one,
                  and many-to-many relationships. The independence of token clusters allows you to
                  register differences of opinion about the same set of tokens. An <code><link
                        linkend="element-align">&lt;align></link></code> may take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>, to facilitate external
                  discussions about an assertion.</para>
               <para>Nothing should be inferred from silence in a TAN-A-tok file. Unmentioned tokens
                  in either source do not represent gaps in a translation. All that can be inferred
                  is that the creators and editors of the TAN-A-tok file have said nothing about the
                  tokens. </para>
               <para>If you wish to declare that one or more words in one source were left out of a
                  translation or inserted into one—that is, words in one source have no match in the
                  other—you must do so through a <emphasis role="italic">half-null
                     alignment</emphasis>, i.e., a token cluster that has tokens from only one
                  source. A half-null alignment implies insertions or omissions.</para>
               <para>A fully aligned bitext may result in a TAN-A-tok file with a very long
                        <code><link linkend="element-body">&lt;body></link></code> (in contrast to
                  the typical TAN-A-div file). That does not mean, however, that everything in a
                  source <emphasis>must </emphasis>be encoded or described. In writing and editing a
                  TAN-A-tok file you do not commit you to saying everything possible about the
                  bitext. You might choose to encode only a few token clusters.</para>
               <para>If there are multiple IDs in <code><link linkend="attribute-reuse-type"
                        >@reuse-type</link></code> or <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code>, the
                  intersection, not the union, of those values is to be understood. For example,
                     <code>reuse-type="trans para"</code> would indicate that the token cluster
                  results from a combination of translation and paraphrase. If you wish to claim
                  that the token cluster might be a translation or it might be a paraphrase, then
                  you should create two separate <code><link linkend="element-align"
                        >&lt;align></link></code>s, and add <code><link linkend="attribute-code"
                        >@cert</link></code>.</para>
            </section>
         </section>
         <section xml:id="tan-a-lm">
            <title>Lexico-Morphology</title>
            <para>TAN-A-lm files are used to associate words or word fragments with lexemes and
               morphological categories. </para>
            <para>These files have two kinds of dependencies: a class 1 source (optional) and the
               grammatical rules defined in one or more TAN-mor files. Therefore this section should
               be read in close conjunction with its companion: <xref linkend="TAN-mor"/>).</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>Editors of TAN-A-lm files should understand the vocabulary and grammar of the
                  chosen languages. They should have a good sense of the rules established by the
                  lexical and grammatical authorities adopted. They should be familiar with the
                  conventions and assumptions of the TAN-mor files you have adopted.</para>
               <para>Although you must assume the point of view of a particular grammar and lexicon,
                  you need not hold to a single one. In addition, you may bring to lexical analysis
                  your own expertise and supply lexical headwords unattested in printed
                  authorities.</para>
               <para>Although TAN-A-lm files are simple, they can be laborious to write and edit, more
                  than other types of TAN files. They can also be hard to read if the underlying
                  TAN-mor files use cryptic codes. It is customary for an editor of a TAN-A-lm file to
                  use tools to help create and edit the data.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a lexico-morphological file is TAN-A-lm.</para>
               <para>TAN-A-lm files are either source-specific or language-specific. In the case of
                  the former, <code><link linkend="element-source">&lt;source></link></code> points
                  to the one and only TAN-T(EI) file that is the object of analysis. In the case of
                  the latter, <code><link linkend="element-for-lang">&lt;for-lang></link></code> is
                  used to indicate the languages that are covered.</para>
               <para><code><link linkend="element-definitions">&lt;definitions></link></code>
                  takes the elements common to class 2 files (see <xref linkend="class_2_metadata"
                  />. It takes two other elements unique to TAN-A-lm: <code><link
                        linkend="element-lexicon">&lt;lexicon></link></code> (optional) and
                        <code><link linkend="element-morphology">&lt;morphology></link></code>
                  (mandatory). Any number of lexica and morphologies may be declared; the order is
                  inconsequential. </para>
               <para>There is, at present, no TAN format for lexica and dictionaries, although this
                  may change in the future. So even if a digital form of a dictionary is identified
                  through the <xref linkend="digital_entity_metadata"/>, validation tests do not
                  take this element into account. </para>
               <para>Because you or other TAN-A-lm editors are likely to be authorities in your own
                  right, <code><link linkend="element-person">&lt;person&gt;</link></code> can be
                  treated as if a <code><link linkend="element-lexicon">&lt;lexicon></link></code>,
                  and be referred to by <code><link linkend="attribute-lexicon"
                     >@lexicon</link></code> in the <code><link linkend="element-body"
                        >&lt;body&gt;</link></code> .</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-lm
                  file takes, in addition to the customary optional attributes found in other TAN
                  files (see <code><link linkend="attribute-in-progress">@in-progress</link></code>
                  and <xref linkend="edit_stamp"/>), <code><link linkend="attribute-lexicon"
                        >@lexicon</link></code> and <code><link linkend="attribute-morphology"
                        >@morphology</link></code>, to specify the default lexicon and
                  grammar.</para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-ana">&lt;ana></link></code>s
                  (short for analysis), each of which matches one or more tokens (<code><link
                        linkend="element-tok">&lt;tok&gt;</link></code>) to one or more lexemes or
                  morphological assertions (<code><link linkend="element-lm"
                     >&lt;lm&gt;</link></code>, which takes <code><link linkend="element-l"
                        >&lt;l&gt;</link></code>s and <code><link linkend="element-m"
                        >&lt;m&gt;</link></code>s). </para>
               <para>If due to tokenization a linguistic token must occupy more than one <code><link
                        linkend="element-tok">&lt;tok></link></code>, you may use <code><link
                        linkend="element-group">&lt;group></link></code> to group <code><link
                        linkend="element-tok">&lt;tok></link></code>s together. </para>
               <para>Elements within an <code><link linkend="element-ana">&lt;ana&gt;</link></code>
                  are distributed. That is, every combination of <code><link linkend="element-l"
                        >&lt;l&gt;</link></code> and <code><link linkend="element-m"
                        >&lt;m&gt;</link></code> (governed by <code><link linkend="element-lm"
                        >&lt;lm&gt;</link></code>) is asserted to be true for every <code><link
                        linkend="element-tok">&lt;tok></link></code>. </para>
               <para>Many TAN-A-lm files will be populated by a stylesheet or other algorithm that
                  automatically lists all possible morphological values of each token. It is advised
                  that such automatically calculated results always include <code><link
                        linkend="attribute-cert">@cert</link></code> with weighted values.</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_3">
         <title>Class-3 TAN Files, Varia</title>
         <para>This chapter provides general background to the elements and attributes that are
            unique to all class 3 TAN files, which are devoted to formats that do not fit the other
            two classes. For detailed discussion of specific elements and attributes, see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <section xml:id="tan-key">
            <title>Keyword Vocabulary (<code>TAN-key</code>)</title>
            <para>All too often, a project has a set of vocabulary it draws from time and again. To
               repeat the <xref xlink:href="#pattern-iri_and_name"/> can be both tedious and
               treacherous. If a project with hundreds of TAN files sdecides to change or augment
               its vocabulary it could take a long time to find and make all the changes.</para>
            <para>The TAN-key format is intended to allow a project to define the IRI + name
               patterns for things that it regularly names, to be applied to any element that takes
                  <link linkend="attribute-which"><code>@which</code></link>. For example, it is a
               suitable way to gather the IRI + name patterns for the people who worked on a
               project, or to define special kinds of div types. </para>
            <para>TAN-key files are a core part of the TAN schema, defining commonly used concepts
               in <code><link linkend="element-token-definition"
               >&lt;token-definition></link></code>, <link linkend="element-div-type"
                     ><code>&lt;div-type></code></link>s, and so forth. For a complete list of
               predefined TAN keywords, see <xref linkend="keywords-master-list"/></para>
            <para>For more details on how this format relates to other TAN formats, see <xref
                  linkend="inclusions-and-keys"/>.</para>
            <section>
               <title>Root Element and Head</title>
               <para>A TAN-key file has <code><link linkend="element-TAN-key"
                     >&lt;TAN-key&gt;</link></code> as the root element.</para>
               <para>The <code><link linkend="element-definitions"
                     >&lt;definitions&gt;</link></code> of a TAN-key file will be empty, or have
                        <code><link linkend="element-group-type">&lt;group-type&gt;</link></code>s. </para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-key
                  file consists simply of <code><link linkend="element-item"
                     >&lt;item&gt;</link></code>s, perhaps gathered into groups via <code><link
                        linkend="element-group">&lt;group&gt;</link></code> or <code><link
                        linkend="attribute-group">@group</link></code>. These groups have, at
                  present, no effect upon other TAN files that import them. They have been useful,
                  however, in more advanced uses of the format, particularly in the case of the
                  standard TAN-key file for <code><link linkend="element-div-type"
                        >&lt;div-type&gt;</link></code> (<link
                     xlink:href="../TAN-key/div-types.TAN-key.xml"/>), where common types of
                  divisions have been given a rudimentary typology suitable for transformations into
                  other formats.</para>
               <para>Most frequently, a TAN-key file will contain items that have the IRI + name
                  pattern. The only exception is when it contains <code><link
                        linkend="element-token-definition"
                  >&lt;token-definition&gt;</link></code>s.</para>
            </section>
         </section>
         <section xml:id="TAN-mor">
            <title>Morphological Concepts and Patterns (<code>TAN-mor</code>)</title>
            <para>TAN-mor files are used to describe the grammatical morphological features of a
               given language, to assign codes to those features, and to define rules governing the
               application of those codes. The format allows specificity, flexibility, and
               responsiveness. Assertions in the format may be doubted, rules may be expressed as
               contingent upon other conditions, and warnings and error messages may be sent to
               users who have used a pattern incorrectly, or not in accordance with best
               practices.</para>
            <para>The TAN-mor format is a kind of Schematron for the grammar of human languages. You
               specify the categories and codes for a given language, then you may create tests to
               define invalid uses of those codes. Those tests are attached to reports and
               assertions allowing editors of TAN-A-lm files to see not only if the rules have been
               violated, but why, and exactly where.</para>
            <para>This chapter should be read in close conjunction with <xref linkend="tan-a-lm"
               />.</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>Certain assumptions and recommendations are made regarding morphology files,
                  complementing the more general ones; see <xref linkend="design_principles"
                  />.</para>
               <para>TAN-mor files are restricted exclusively to describing the categories and rules
                  for the grammar of a natural language. Editors of these files should be well
                  versed with the grammar of the languages they are describing.</para>
               <para>The TAN-mor format has been designed with the assumption that patterns of word
                  inflection and formation can be categorized, classified, named, and described. It
                  has also been assumed that scholars may reasonably differ, perhaps radically, on
                  how categories are defined and applied. TAN-mor is meant to allow those
                  differences to be declared. It is up to other users to decide whether or not to
                  adopt them.</para>
               <para>The TAN-mor format has also been designed to cater to two different approaches
                  to morphological codes: categorized or uncategorized. </para>
               <para>Codes that are categorized are interpreted according not only to code but to
                  position. For example, the categorized codes adopted by Perseus for morphological
                  analysis of Greek, Latin, and other highly inflected languages stiplate ten
                  categories, with the first two being the major and minor parts of speech, and the
                  subsequent categories devoted to person, number, tense, and so forth. Each word
                  that is analyzed must have a value, even if null, and the position of the code is
                  important.</para>
               <para>Uncategorized codes simply give each each grammatical feature a unique code, to
                  be applied in any permitted sequence and combination. This approach is viable for
                  any language (including highly inflected ones such as Greek or Latin), but it is
                  most often found in tagging sets for languages that are not highly inflected,
                  e.g., the Brown and Penn sets for English.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a morphological rule file is <code><link
                        linkend="element-TAN-mor">&lt;TAN-mor></link></code>.</para>
               <para>Zero or more <code><link linkend="element-source">&lt;source></link></code>
                  elements describe the grammars or related works that account for the rules
                  declared in the TAN file. If the rules are not based upon any published work, then
                        <code><link linkend="element-source">&lt;source></link></code> may be
                  omitted. Any TAN-mor file without a source will assume to be based upon the
                  personal knowledge of the <code><link linkend="element-person"
                     >&lt;person></link></code>s who edited the file.</para>
               <para><code><link linkend="element-definitions">&lt;definitions></link></code> is
                  populated with the grammatical <link linkend="element-feature"
                        ><code>&lt;feature></code></link>s that are considered operative. If a
                  particular discipline customarily uses codes that are not allowed in <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>, you may wish to create an
                        <code><link linkend="element-alias">&lt;alias&gt;</link></code>. </para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-mor
                  file takes the customary optional attributes found in other TAN files (see
                        <code><link linkend="attribute-in-progress">@in-progress</link></code> and
                     <xref linkend="edit_stamp"/>). </para>
               <para>The children of <code><link linkend="element-body">&lt;body></link></code>
                  begin with one or more <code><link linkend="element-for-lang"
                     >&lt;for-lang></link></code>s, followed by any number of <code><link
                        linkend="element-where">&lt;where&gt;</link></code>s (containing <code><link
                        linkend="element-assert">&lt;assert></link></code>s or <code><link
                        linkend="element-report">&lt;report></link></code>s) or <code><link
                        linkend="element-category">&lt;category></link></code>s (if relying upon
                  structured codes). </para>
               <para><code><link linkend="element-category">&lt;category></link></code>, used for
                  structured codes, sorts <link linkend="element-feature"
                     ><code>&lt;feature></code></link>s into groups, assigning them <code><link
                        linkend="attribute-code">@code</link></code> values that are unique within
                  the <code><link linkend="element-category">&lt;category></link></code>.</para>
               <para><code><link linkend="element-assert">&lt;assert></link></code>s and <code><link
                        linkend="element-report">&lt;report></link></code>s are used to declare
                  rules that must be followed, or must never be followed, by any dependent TAN-A-lm
                  file. </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> and
                        <code><link linkend="element-report">&lt;report></link></code> will be
                  checked only if the conditions in the enclosing <code><link
                        linkend="element-where">&lt;where&gt;</link></code> are met in the context
                  of a given <code><link linkend="element-m">&lt;m></link></code> in a dependent
                  TAN-A-lm file:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><code><link linkend="attribute-m-matches">@m-matches</link></code>:
                                 <code><link linkend="element-m">&lt;m></link></code> matches the
                           pattern (regular expression). </para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-tok-matches"
                           >@tok-matches</link></code>: one of the values of <code><link
                                 linkend="element-tok">&lt;tok></link></code> in the given
                                 <code><link linkend="element-ana">&lt;ana&gt;</link></code> matches
                           the pattern (regular expression).</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-m-has-features"
                              >@m-has-features</link></code>: <code><link linkend="element-m"
                                 >&lt;m></link></code> has the specified features.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-m-has-how-many-features"
                                 >@m-has-how-many-features</link></code>: <code><link
                                 linkend="element-m">&lt;m></link></code> has the given number of
                           features.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> also has one
                  or more of the truth conditions above. If the test proves false in a given
                        <code><link linkend="element-m">&lt;m></link></code> then the <code><link
                        linkend="element-m">&lt;m></link></code> will be marked as erroneous and the
                  message included by the <code><link linkend="element-assert"
                     >&lt;assert></link></code> should be returned.</para>
               <para><code><link linkend="element-report">&lt;report></link></code> has the same
                  effect, but the role of the test is the opposite: the error and message will be
                  returned only if the test proves true.</para>
            </section>
         </section>
         <section xml:id="catalog-files">
            <title>TAN Catalog Files (<code>collection</code>)</title>
            <para>TAN catalog files are intended to facilitate the discovery of relevant TAN files
               and to support the XSLT function <code>collection()</code>. They catalog or index any
               or all TAN files within a local directory and perhaps its subdirectories. </para>
            <para>These catalog files must always be named <code>catalog.tan.xml</code>. They depart
               from all other TAN files in their structure. They have no namespace. They have
               neither body nor head. Rather, they are patterned off the catalog.xml description
               provided by Saxonica (<link
                  xlink:href="https://www.saxonica.com/documentation9.5/sourcedocs/collections.html"
               />), to .</para>
            <para>Any XML file passed to the stylesheet <code>/do things/populate/populate TAN
                  catalog file.xsl</code> will automatically generate one of these files.</para>
            <para>The root element of a catalog file is <code><link linkend="element-collection"
                     >&lt;collection></link></code>, with children <code><link linkend="element-doc"
                     >&lt;doc></link></code>s that hold simple metadata about the TAN files that are
               in a directory and its subdirectories. Only TAN files may be registered in a
                     <code><link linkend="element-doc">&lt;doc></link></code>.</para>
         </section>
      </chapter>
      <xi:include href="inclusions/elements-attributes-and-patterns.xml"/>
      <xi:include href="inclusions/keywords.xml"/>
   </part>
   <part xml:id="working_with_tan">
      <title>Working with the Text Alignment Network</title>
      <chapter>
         <title>Best Practices in Working with TAN Files</title>
         <para>In this chapter we discuss ways to manage, create, edit, and share TAN files. The
            material discussed here is non-normative. That is, these are suggestions based upon the
            experience of TAN users. </para>
         <section>
            <title>Local Setup</title>
            <para>TAN files may be set up in any kind of structure one wishes, but because those
               files are meant to be shared and interlinked, it is beneficial to use similar local
               conventions, so that relative URLs remain intact from one person's system to another.
               It is especially important that collections be able to "talk" to each other via local
               URLs in <code><link linkend="attribute-href">@href</link></code>, so it is a good
               idea to name collection subdirectories as predictably as possible.</para>
            <para>Below is one way to organize the subdirectories of a typical setup for local TAN
               work:</para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para><code>library-</code>[abbreviated name of creator 1]</para>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para>[abbreviated name of collection 1]—TAN-T(EI) files here</para>
                              <para>
                                 <itemizedlist>
                                    <listitem>
                                       <para><code>TAN-A-div</code> (for TAN-A-div files)</para>
                                    </listitem>
                                    <listitem>
                                       <para><code>TAN-A-tok</code> (for TAN-A-tok files)</para>
                                    </listitem>
                                    <listitem>
                                       <para>[etc.]</para>
                                    </listitem>
                                 </itemizedlist>
                              </para>
                           </listitem>
                           <listitem>
                              <para>[abbreviated name of collection 2]</para>
                           </listitem>
                           <listitem>
                              <para>[etc.]</para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </listitem>
                  <listitem>
                     <para><code>library-</code>[abbreviated name of creator 2]</para>
                  </listitem>
                  <listitem>
                     <para><code>output</code>—saved results from transformations, tests</para>
                  </listitem>
                  <listitem>
                     <para><code>pre-TAN</code>—third-party files to be used to populate TAN files,
                        or to be converted into them</para>
                  </listitem>
                  <listitem>
                     <para><code>TAN-2018</code> —the core TAN files, downloaded from the website
                        or the Git repository</para>
                  </listitem>
                  <listitem>
                     <para><code>stylesheets</code>—stylesheets you have created</para>
                  </listitem>
                  <listitem>
                     <para><code>tools</code>—third-party tools</para>
                  </listitem>
               </itemizedlist>
            </para>
            <para>Under this approach, you create a library subdirectory for each provider or
               creator (including one for yourself). For any TAN corpus you publish, you should
               advise what name should be used for the library subdirectory. Likewise, for any TAN
               corpus you download, you should use the library name suggested by the provider. </para>
            <para>Any time you create or download a collection of TAN files, you save them in a
               subdirectory within the creator's library subdirectory. Once again, you should advise
               on the name to be used, and use the names that are advised. </para>
            <para>If you use Git, it is advisable to make each collection its own Git repository. If
               you use GitHub, it is advisable to use your username for the library
               subdirectory.</para>
            <para>This two-step approach to subdirectories anticipates cases where different people
               will want to encode the same body of texts, particularly heavily quoted collections
               that will commonly be given very brief, descriptive names, e.g., <code>bible</code>,
                  <code>quran</code>.</para>
            <para>When you name class 1 files (the filename, not the IRI name; see <xref
                  xlink:href="#iri_name"/>), it is a good idea to start with an acronym or
               abbreviation for the work, followed by the language code, the editor's last name, the
               date when the source scriptum was created or published. If a work lends itself to
               multiple reference schemes, you may need to include that in the filename. Some examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.grc.1949.minio-paluello-sem.xml</code> (Aristotle's
                        Categories, in Greek, 1949, edition by Minio Paluello, following a reference
                        system based on semantic units [paragraphs, sentences, independent
                        clauses]).</para>
                  </listitem>
                  <listitem>
                     <para><code>apocr.eng.kjv.1760.xml</code> (apocrypha, English, King James
                        Version, 1760 edition)</para>
                  </listitem>
                  <listitem>
                     <para><code>tlg0059.tlg031.perseus-grc1-Pl.Ti.xml</code> (Plato's Timaeus in
                        Greek)</para>
                  </listitem>
               </itemizedlist></para>
            <para>Class 2 files are tougher. Because they bring two or more files or concepts
               together, filenames could become very long or unpredictably structured. At this time,
               the best recommendation is to make sure that each class 2 file is put into a
               subdirectory, separate from class 1 files, given a brief but meaningful name that
               points to the research question that motivated its creation. Some examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml</code>
                        (lexico-morphology for Aristotle's Categories, in Greek)</para>
                  </listitem>
                  <listitem>
                     <para><code>nt.grc-syr.selections.TAN-A-tok.xml</code> (word-for-word
                        correspondences between the Syriac and Greek New Testaments)</para>
                  </listitem>
                  <listitem>
                     <para><code>plato.TAN-A-div.xml</code></para>
                  </listitem>
               </itemizedlist></para>
            <para>Class 3 are a bit easier. It is recommended that TAN-mor files begin with the
               language code then an acronym for the person or group responsible for creating the
               features. TAN-key files are written generally to serve a specific project or
               collection, so the collection name and the TAN type should suffice. Examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.TAN-key.xml</code></para>
                  </listitem>
                  <listitem>
                     <para><code>eng.kalvesmaki.com,2014.1.xml</code> (tagging scheme #1 for
                        English)</para>
                  </listitem>
               </itemizedlist></para>
            <para>If you have a local copy of someone else's TAN collection, and you wish to create
               TAN files that depend on them, you are in all likelihood going to use relative URLs
               to copies of the files stored on your local drive. It is recommended that you also
               include absolute URL through secondary <code><link linkend="element-location"
                     >&lt;location></link></code>s. The validation routine checks only the first
               document available. From time to time, you might comment out the first <code><link
                     linkend="element-location">&lt;location></link></code> and run the validation
               process again. If you share your dependent TAN file with someone else who does not
               have a local copy of the collection, the second <code><link
                     linkend="element-location">&lt;location></link></code>, with the absolute URL,
               will point to the original copy of the document.</para>
            <para>In a given project, you are likely to repeat basic information, particularly
                     <code><link linkend="element-person">&lt;person></link></code>, <code><link
                     linkend="element-role">&lt;role></link></code>, and <code><link
                     linkend="element-work">&lt;work&gt;</link></code>. such as elements with the
                  <xref linkend="pattern-iri_and_name"/>, consider moving those to a TAN-key file.
               It is almost always preferable to develop TAN-keys before resorting to <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code>s. Sorting out lines of
               inclusion can be confusing.</para>
         </section>
         <section>
            <title>Creating and populating TAN files</title>
            <para>TAN is a representational format. Every TAN file models some source.</para>
            <para>If those sources are non-digital, it is a relatively straightforward task to
               create and populate a TAN file. You just start editing everything by hand. In some
               cases, you might get a head start through a rough computer algorithm. For example,
               optical character recognition (OCR) on an edition might give you a dirty but useful
               start for a TAN-T file. Or OCR on an index might get you the outlines of a TAN-A-div
               file that indexes all quotations. Despite the computer's assistance, the majority of
               the task is converting non-digital claims into digital ones, and the manual effort is
               central.</para>
            <para>In many other cases, you are trying to take something that already exists
               digitally and convert it into the TAN format. In these cases, it is advised to think
               of the problem computationally, and do your best to resist the urge to manually edit
               anything.</para>
            <para>Suppose you find a Word file, a web page, or plain text that can serve as the
               basis for a TAN file. A common first impulse is to copy the desired content, paste it
               into the body of our TAN file, and then begin to manually correct and change things.
               You may find that you made a major mistake that cannot, at that point be undone.
               Perhaps you have accidentally deleted all punctuation when you didn't mean to. Or you
               eliminated line breaks that were useful signals about where <code><link
                     linkend="element-div">&lt;div></link></code>s should be separated. Even if all
               goes well, after all that hard work you might be find out that the pre-TAN data
               source has been updated, with errors corrected. If any significant time has elapsed,
               you may have forgotten what procedure you followed to convert the data. And even if
               you remember, you have to repeat the steps again, and plan for the next time when the
               pre-TAN source is updated. Or you find yourself making piecemeal corrections.</para>
            <para>For all these reason, it is recommended that you set up an XSLT-based workflow to
               convert the data to TAN. When you find mistakes such as those described above, no
               harm is done. You can adjust your algorithm and re-run the process as often as you
               need, each time getting better and better results. This approach requires extra
               initial work. That is, you will need to get to know XSLT (or an alternative) well.
               Establishing a good transformation process can be time consuming. But the investment
               pays off in the long run. The routines you write for one set of files might save you
               some work for the next.</para>
            <para>Under this method, you should begin the process by creating a template TAN file
               that resembles, even if skeletally, your desired output. You then write XSLT-based
               rules that (1) make alterations to the input, (2) infuse the altered input into the
               template, then (3) save the new file. This method has been used successfully to
               handle several different kinds of conversion, including ones where the source files
               are updated very frequently. In such cases, the traditional cut-paste-and-edit method
               is not only unproductive; it is foolish.</para>
            <para>Writing transformations may seem laborious at first, because of how difficult it
               is to think how how best to handle and manipulate a TAN file. But there is a good
               chance that the labor you have in mind has already been done for you in the built-in
               TAN functions (see <xref xlink:href="#variables-keys-functions-and-templates"/>). See
               also the files provided under the subdirectory <code>/do things</code>.</para>
         </section>
         <section>
            <title>Sharing TAN files</title>
            <para>TAN files have been designed to be shared. Although individual TAN files are
               likely to be valuable on their own, even when removed from their context (e.g., via
               an email attachment), they may be critically crippled without their dependencies. As
               a result, TAN files are most likely to be distributed or published in groups, as
               collections.</para>
            <para>One way to distribute a collection is by making it available as a repository via
               Git or some other version control software (VCS). This approach has many advantages.
               The files become available to whomever wants them, and the editorial history is
               preserved. VCS features and tools are extremely fast and useful, and they allow users
               to modify TAN collections without impacting the original source.</para>
            <para>Collections may also be distributed through shared syncing services (e.g., Drive,
               Box, or Dropbox). Or put on a server. In the latter case, it may be difficult for
               users to browse a collection. In that case, you may wish to expose the collection as
               a compressed ZIP archive. This saves on your own bandwidth, and it still exposes the
               files for XML processing. But a ZIP archive is not suitable for linking from one TAN
               file to another, nor is it appropriate as a <code><link
                     linkend="element-master-location">&lt;master-location></link></code>. Unpacking
               a compressed file requires writing to the disk, which is treated as a security risk
               during validation, and so is disallowed. Such zipped archives are good ways to
               distribute a collection, but they should not be treated as a primary
               repository.</para>
         </section>
         <section xml:id="tan-stylesheets-and-function-library">
            <title>Doing things with TAN files</title>
            <para>The TAN format is not an end in itself. Indeed, there is no point to any file
               format, unless you can do things with it. TAN was designed to allow users to do
               unusual and interesting things. <code>/do things</code>, a major subdirectory in the
               project file, is populated with folders named with actions you might want to perform
               on a TAN file, and they contain XSLT stylesheets that fall into that area of
               activity.</para>
            <para>Those stylesheets are the front end of a long process that begins with TAN
               validation. Whenever you validate a TAN file, the Schematron validation file (the
               companion to the RELAX-NG validation file) is invoked. But that Schematron file is
               small, and the majority of the work is done by a very large library of XSLT
               stylesheets that resolve and expand the document, and marking its errors along the
               way. </para>
            <para>That extensive library of XSLT we call here the <emphasis>function
                  library</emphasis> (we use both words, to distinguish the collection from
               individual, generic functions). The function library provides definitive
               interpretations of the TAN format, marking parts that are in error. The function
               library is also an important step to creating your own tools or stylesheets,
               anticipating, as it does, many things you might want to do with a TAN file. Certain
               considerations that have been put into the design of the function library are worth
               noting.</para>
            <para>First, the function library has a structure similar to that of the RELAX-NG
               schemas. That is, the primary access point is through one of the XSLT files named
               after a primary TAN format. You may also wish to include (or import) the extra
               functions, <link
                  xlink:href="http://textalign.net/release/TAN-2018/functions/TAN-extra-functions.xsl"
               />.</para>
            <para>During Schematron validation, it is quite common for the computer to calculate all
               global variables, even those that are unused. Therefore the function library defines
               only those global variables that are central to the validation process. </para>
            <para>The most complex and important global variables are the two principal
               transformations to the TAN file itself, <code><link linkend="variable-self-resolved"
                     >$self-resolved</link></code> and <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code>. </para>
            <para><code><link linkend="variable-self-resolved">$self-resolved</link></code> is the
               result of changing the TAN file through some key steps, including (1) stamping the
               original uri of the file <code>@base-uri</code> in the root element, (2) converting
               all numeration systems to Arabic numerals, (3) replacing all elements that have <link
                  linkend="attribute-include"><code>@include</code></link> with resolved forms of
               the element, (4) replacing elements with <link linkend="attribute-which"
                     ><code>@which</code></link> with their resolved IRI + name form, (5) stamping
               elements with <code>@q</code> and a number representing the nth place of that element
               relative to its original siblings (included elements are given the <code>@q</code> of
               their host element). If any errors arise, the relevant information is placed in the
               resolved file as an <code>&lt;error></code> or <code>&lt;warning></code>, based upon
               the <link xlink:href="../functions/errors/TAN-errors.xml">master list of
                  errors</link>. <code>@q</code>, <code>@base-uri</code>, and other newly introduced
               attributes and elements are not defined by the TAN schema.</para>
            <para><code><link linkend="variable-self-expanded">$self-expanded</link></code> is the
               result of putting the file through a series of expansions. As noted earlier, there
               are three levels of Schematron validation—terse, normal, and verbose—and there are
               three corresponding levels of <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code>. Expansion is intended chiefly to support
               validation, and so checks for errors. It does so by normalizing the text, converting
               each attribute to one or more elements (one per value), checking id references, and
               doing a number of other activities.</para>
            <para>For a class 2 file, <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code> includes not only an expansion of itself, but an
               expansion of its dependencies (TAN-T or TAN-mor). When taken to the verbose level, a
               TAN-A-div file will include in its <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code> special documents with a root element
                  <code>&lt;TAN-T-merge></code>. Each work has one TAN-T-merge file, a collation
               into a single reference structure all the relevant sources.</para>
            <para>All these expansions provide an excellent starting point for conversion into other
               formats.</para>
            <para>The next most important global variables deal with referred files:</para>
            <para>
               <table frame="all">
                  <title>Global variables for referred files</title>
                  <tgroup cols="4">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                     <colspec colname="newCol4" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry/>
                           <entry>Raw (first document available)</entry>
                           <entry>Resolved</entry>
                           <entry>Expanded</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-inclusion"
                                 >&lt;inclusion></link></code></entry>
                           <entry><code><link linkend="variable-inclusions-1st-da"
                                    >$inclusions-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-inclusions-resolved"
                                    >$inclusions-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-key">&lt;key></link></code></entry>
                           <entry><code><link linkend="variable-keys-1st-da"
                                 >$keys-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-keys-resolved"
                                 >$keys-resolved</link></code></entry>
                           <entry><code><link linkend="variable-keys-expanded"
                                 >$keys-expanded</link></code></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-source"
                              >&lt;source></link></code></entry>
                           <entry><code><link linkend="variable-sources-1st-da"
                                    >$sources-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-sources-resolved"
                                    >$sources-resolved</link></code></entry>
                           <entry><code><link linkend="variable-self-expanded"
                                 >$self-expanded</link>[position() gt 1]</code></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-see-also"
                              >&lt;see-also></link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-1st-da"
                                    >$see-alsos-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-resolved"
                                    >$see-alsos-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>The column labeled "raw" lists variables that hold the first documents available,
               without alteration. Variables in the next column hold the resolved form, following
               the same process described above for <code><link linkend="variable-self-resolved"
                     >$self-resolved</link></code>. The resolved forms of <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                     linkend="element-key">&lt;key></link></code> are sufficient for validation,
               therefore they do not have expanded versions. Expanded sources are always found after
               the first document in <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code>.</para>
            <para>These global variables have been described above very generally. To understand
               better how their values are calculated, please consult the function library.</para>
            <para>The other components of the function library—the functions, keys, and
               templates—cannot be described conveniently or succinctly here. But they are critical
               parts of building successful stylesheets that transform TAN files. The next chapter
               provides a comprehensive, detailed view of how they work.</para>
         </section>
      </chapter>
      <xi:include href="inclusions/variables-keys-functions-and-templates.xml"/>
      <xi:include href="inclusions/errors.xml"/>
   </part>
</book>
