<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<book xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:xi="http://www.w3.org/2001/XInclude" version="5.0">
   <info>
      <title>Guide to the Text Alignment Network, Version 2020</title>
      <legalnotice>
         <info>
            <title>Text Alignment Network: Official Guidelines</title>
            <copyright>
               <year>2015-present</year>
               <holder>Joel Kalvesmaki</holder>
            </copyright>
            <author>
               <personname>Joel Kalvesmaki</personname>
               <email>kalvesmaki@gmail.com</email>
            </author>
         </info>
         <remark>All software, code, and dependencies (/applications, /functions, /schemas,
            /vocabularies) are released under a GNU General Public License, <link
               xlink:href="https://opensource.org/licenses/GPL-3.0"
               >https://opensource.org/licenses/GPL-3.0</link>.</remark>
         <remark>All other materials (such as this document), unless otherwise specified, are
            licensed under a Creative Commons Attribution 4.0 International License: <link
               xlink:href="http://creativecommons.org/licenses/by/4.0/"
               >http://creativecommons.org/licenses/by/4.0/</link>
         </remark>
      </legalnotice>
      <revhistory>
         <info>
            <releaseinfo>Latest stable version: <link
                  xlink:href="http://textalign.net/release/TAN-2020/guidelines/"
                  >http://textalign.net/release/TAN-2020/guidelines/</link>.</releaseinfo>
            <releaseinfo>Development version: <link
                  xlink:href="https://github.com/textalign/TAN-2020/tree/dev"
                  >https://github.com/textalign/TAN-2020/tree/dev</link></releaseinfo>
         </info>
         <revision>
            <revnumber>Version 2020 (alpha)</revnumber>
            <date>2020-08-13</date>
            <revdescription>
               <para>Formats: <link
                     xlink:href="http://textalign.net/release/TAN-2020/guidelines/xhtml/index.xhtml"
                     >HTML</link> • <link
                     xlink:href="http://textalign.net/release/TAN-2020/guidelines/pdf/TAN-2020-guidelines.pdf"
                     >PDF</link> • <link
                     xlink:href="http://textalign.net/release/TAN-2020/guidelines/main.xml"
                     >Docbook</link> (master)</para>
               <warning>
                  <para>In case of contradictions, apparent or not, between these guidelines and the
                     core TAN files, priority should be given first to the RELAX-NG schemas (compact
                     syntax), then to the functions, and finally to these guidelines.</para>
               </warning>
            </revdescription>
         </revision>
      </revhistory>
   </info>
   <part xml:id="general_overview">
      <title>General Overview</title>
      <chapter>
         <title>Introduction</title>
         <section xml:id="tan_definition">
            <title>Definition and purpose </title>
            <para>The Text Alignment Network (TAN) is a suite of highly regulated XML formats
               designed to maximize the syntactic and semantic interoperability of texts,
               annotations, and language resources.</para>
            <para>TAN is particularly suited to aligning texts with multiple versions (copies,
               translations, paraphrases), and to annotating quotations, translation clusters
               (word-to-word), and lexicomorphological features. Simple, modular, and networked, the
               TAN format allows users, working independently and collaboratively, to find, create,
               edit, study, align, and share their texts and annotations. The extensive validation
               rules are integrated into a library of functions that definitively interpret the
               format and provide a foundation for third-party tools and applications.</para>
            <para>Although expressive of scholarly nuance and complexity, the TAN format has been
               designed to benefit everyone, scholars and non-scholars alike, and can be used
               broadly for reading, teaching, publishing, research, analysis, and language learning.
            </para>
         </section>
         <section>
            <title>Rationale and Purpose</title>
            <para>Scholars working with texts frequently need to work with numerous versions. Some
               texts have been lost in their original form and can be studied only through later
               translations, paraphrases, or fragmentary quotations. Even when an original survives,
               its later versions are often worth study, revealing as they do something of how
               words, concepts, and works were preserved, altered, or combined by generations and
               cultures who created, read, and circulated the versions.</para>
            <para>Such textual comparison requires texts whose words, sentences, paragraphs, and
               other segments are aligned. Such alignment can be challenging. Some versions might be
               defective, or follow an idiosyncratic sequence. One editor may have divided the text
               according to a system not easily applied to other versions. Identifying which words
               or phrases in a translation and its original correspond to each other might result in
               complex, overlapping spans. And even larger segments such as sentences and paragraphs
               may not line up well. Further, every version of a text is part of a much larger,
               complex history of text reuse, and a complete study of that context requires
               engagement with other works and other languages, and collaboration across projects
               and fields of study.</para>
            <para>Text Alignment Network (TAN) XML facilitates the exchange of multiple versions of
               texts and annotations on those texts. TAN syntax is suitable for humans to read and
               edit, expressive enough to allow scholars to register doubt and nuance, and
               sufficiently structured to permit complex computer-based queries across independent
               datasets. TAN is not a single format, but rather a suite of formats, built modularly.
               Each format is dedicated to a particular task, requiring editors to declare their
               views or assumptions about language and texts in a structured manner, so that other
               users of the data (whether human or computer) can decide whether the data meets their
               needs. Because nearly all TAN data must be expressed in way that computers can parse,
               the information can be used in semantic web applications (see <xref
                  linkend="rdf_and_lod"/>).</para>
            <para>TAN has been designed to support two kinds of scholarly activity: <emphasis
                  role="bold">creation</emphasis> and <emphasis role="bold"
               >research</emphasis>.</para>
            <para>When we <emphasis role="bold">create</emphasis> our primary sources or analyze
               them, we normally want what we create to be useful to our colleagues. TAN was
               designed to assist scholarly creative activities such as:</para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>Creating and sharing a transcription of a particular version of a textual
                        work that it is more likely to align with any other TAN version of that text
                        created by someone else;</para>
                  </listitem>
                  <listitem>
                     <para>Creating an index of quotations that is semantically rich and can be
                        applied to any other version of the quoting or quoted works;</para>
                  </listitem>
                  <listitem>
                     <para>Specifying exactly (e.g., word-for-word) where a source and its
                        translation correspond, even with overlapping or ambiguous relationships, or
                        where doubt or alternative possibilities of alignment need to be
                        expressed;</para>
                  </listitem>
                  <listitem>
                     <para>Listing the grammatical features of every word in a text or a language in
                        a way that allows it to be compared easily against other languages and
                        texts.</para>
                  </listitem>
               </itemizedlist>
            </para>
            <para>Shared TAN files form a decentralized, interoperable corpus of texts, a kind of
               Internet of primary sources and annotations. As this TAN-compliant corpus spreads
               into different linguistic, chronological, and geographical regions, third-party tools
               and applications can expand the repertoire of <emphasis role="bold"
                  >research</emphasis> questions beyond any single corpus, to help scholars
               fruitfully investigate broader, comparative questions such as:<itemizedlist>
                  <listitem>
                     <para>For classical Greek texts, how were words with the root -ιστημι ("stand")
                        translated into ancient Latin? In what specific ways did the vocabulary of
                        technical terms shift from pre-Christian translations into later, Christian
                        ones?</para>
                  </listitem>
                  <listitem>
                     <para>How do the reformed Chinese translation technique of Sanskrit Buddhist
                        texts, attested by Dao An (312-385 CE), compare to reforms in the seventh
                        and eighth centuries of Syriac translations of Greek texts?</para>
                  </listitem>
                  <listitem>
                     <para>How do Arabic translations of Greek texts from the Abbasid period differ
                        from contemporaneous translations from Sanskrit into Arabic?</para>
                  </listitem>
                  <listitem>
                     <para>Can an anonymous English translation of a modern French novel be
                        identified with known translators from that period?</para>
                  </listitem>
                  <listitem>
                     <para>How do present-day translations of official United Nations documents
                        differ across languages?</para>
                  </listitem>
               </itemizedlist></para>
            <para>Neither the TAN format nor its applications answer such questions. But they can be
               used to start to answer such questions. </para>
            <para>TAN differs from other text formats such as HTML, Microsoft Word, PDF, or Docbook.
               Each of those formats are interoperable only in the sense that any file can be
               reliably opened and displayed by the same software. Despite such software
               compatibility, the content, structured by each user, looks very different from one
               file to the next. If you receive from different people two versions of a particular
               literary work in the same formet, there would be little likelihood that you could
               align them without a lot of extra work. These are presentation formats, designed to
               let the creator use his or her imagination to shape, structure, and present the
               material in highly stylized, creative ways. The formats are laissez faire, concerned
               mainly to ensure that each component is rendered properly, without regard for the
               meaning of those components. </para>
            <para>Creating a text in TAN is like opening a word processor and telling it, "I don't
               care how the text looks. I want to ensure that it is in a meaningful structure that
               corresponds to any other version of that text. The appearance, which could take
               thousands of directions, can be worried about later." The closest analogue to the TAN
               formats is the XML format developed by the Text Encoding Initiative, whose design
               catalyzed and continues to inspire the development of TAN. TAN adopts and extends the
               TEI validation rules, to make them more rigorous and penetrating, to support
               cross-project interoperability. One of the TAN formats <emphasis>is</emphasis>
               modestly customized TEI. (For more on comparisons between TAN and TEI see <xref
                  linkend="TEI"/>.)</para>
            <para>Some other caveats:<itemizedlist>
                  <listitem>
                     <para>Although TAN comes with an extensive library of functions and templates,
                        it is not what most people think of as a tool or application. It does not
                        provide a graphic interface to create, edit, or display TAN-compliant files,
                        nor does it dictate how such tools should behave. Rather, it allows
                        programmers (especially XML developers) to create customized applications
                        and tools. If you are working with an XML editor like oXygen, your editing
                        experience will be greatly enhanced by the TAN function library.</para>
                  </listitem>
                  <listitem>
                     <para>The TAN formats are specialized. They are not meant to replace other
                        common text formats such as TEI, Docbook, and so forth, or other alignment
                        formats such as XLIFF or TMX. Converting a TAN file into these formats is
                        usually straightforward, but will usually entail loss. Conversely, most
                        conversions from one of these formats into TAN will not entail loss, but
                        will be imperfect or incomplete, because the TAN format requires data that
                        will be missing, or not easily identifiable. Conversion must be given
                        careful thought, and can only be semiautomated.</para>
                  </listitem>
                  <listitem>
                     <para>Each TAN format has a restricted field of inquiry, defined and explained
                        in these guidelines. TAN is not suitable for unsupported research interests,
                        e.g., marking a transcription to imitate its presentation in a particular
                        print edition.</para>
                  </listitem>
                  <listitem>
                     <para>TAN files are optimized for legibility and readability, and may be
                        inefficient in certain contexts and applications. The extensive TAN
                        validation routines—essential to aiding interoperability—can be taxing to
                        run on numerous or large files. There are work-arounds, explained in the
                        guidelines. Many applications will perform better when TAN files are
                        pre-processed. See <xref xlink:href="#tan-applications"/>.</para>
                  </listitem>
               </itemizedlist></para>
         </section>
         <section xml:id="tan_participation">
            <title>Participation</title>
            <para>Changes are made regularly to TAN, mainly in its <link
                  xlink:href="https://github.com/textalign/TAN-2020/tree/dev">development
                  branch</link>. If you have a TAN library, sharing it with other participants,
               particularly via Git, will help developers test any changes that have been made to
               the function library, and encourage others to contribute to your project.</para>
            <para>Participants in testing, using, and developing the Text Alignment Network are
               welcome. Our core purpose is to develop and maintain, in ascending order of
               importance, the schemas, functions, guidelines, and applications. Inquiries about
               participation should be sent to the project director, <link
                  xlink:href="http://kalvesmaki.com/">Joel Kalvesmaki</link>, by email: director at
               textalign.net.</para>
            <para>Official announcements are made by <link
                  xlink:href="http://groups.google.com/group/textalign?hl=en">email (Google
                  Group)</link> and by <link xlink:href="https://twitter.com/textalign"
                  >Twitter</link>.</para>
         </section>
      </chapter>
      <chapter xml:id="gentle_guide">
         <title>Starting off with the TAN Format</title>
         <para>If you are new to markup languages, or unfamiliar or uncomfortable with acronyms and
            techincal terms such as <emphasis role="italic">XML</emphasis>, <emphasis role="italic"
               >RDF</emphasis>, <emphasis role="italic">XPath</emphasis>, and
               <emphasis>Unicode</emphasis>, you should start with this chapter, which uses a simple
            example to illustrate the steps typically taken to create and and edit TAN files, and to
            gently introduce important technical terms. By the end of this chapter, you will have a
            sense of how to create and edit a small collection of TAN transcriptions and
               alignments.<note xml:id="transcription_and_transliteration">
               <para>In the TAN system, a <emphasis>transcription</emphasis> is a plain digital text
                  that replicates a text found somewhere else, usually reproducing its script and
                  spelling. The following—"In pluribus unum"—is a (partial) transcription of a
                  United States dollar. The term should be distinguished from a
                     <emphasis>transliteration</emphasis>, which is a transcription rendered in a
                  script other than the original. For example, εν πλουριμπυς ουνεμ, would be a Greek
                  transliteration of the previous transcription.</para>
            </note></para>
         <para>The chapter touches on a number of general concepts that are discussed only briefly.
            If you find the concept new or confusing, follow the prompts for further reading, to get
            better grounded in a particular topic or technology. If you are already familiar with
            basic markup concepts, you should nevertheless at least skim through the chapter,
            because some familiar concepts get handled by TAN in its own special way.</para>
         <section>
            <title>Creating TAN Transcription and Alignment Data</title>
            <para>Let us take a simple example, that of aligning two English versions of the nursery
               rhyme <emphasis role="italic">Ring-a-ring-a-roses</emphasis>, sometimes known as
                  <emphasis role="italic">Ring around the Rosie</emphasis>. Our goal here is to
               publish two versions of the nursery rhyme in the TAN format so that they are most
               likely alignable with any other TAN version of the poem that might appear.<note>
                  <para>Although the TAN examples below look much like files in the
                        <code>examples</code> subdirectory of the TAN library, they have been
                     adjusted, to explain the formats better.</para>
               </note></para>
            <para>We begin by finding previously published versions that haven't been digitized. In
               this case we have taken an interest in the versions published in <link
                  xlink:href="http://lccn.loc.gov/12032709">1881</link> and <link
                  xlink:href="http://lccn.loc.gov/87042504">1987</link> (one published in the UK and
               the other, the US). Each of these books have other rhymes, but we've decided to focus
               upon one nursery rhyme, so we type up (transcribe) that poem and nothing else:<table
                  frame="all">
                  <title>Ring around the Rosie</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <thead>
                        <row>
                           <entry>1881 (U.K.) version</entry>
                           <entry>1987 (U.S.) version</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>
                              <para>Ring-a-ring-a-roses,</para>
                              <para>A pocket full of posies;</para>
                              <para>Hush! Hush! Hush! Hush!</para>
                              <para>We're all tumbled down.</para>
                           </entry>
                           <entry>
                              <para>Ring-a-round the rosie,</para>
                              <para>A pocket full of posies,</para>
                              <para>Ashes! Ashes!</para>
                              <para>We all fall down.</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table></para>
            <para>We must be sure to save each of the two transcriptions as plain text. Do not
               bother with a word processor (Word, OpenOffice, Google Docs, and so forth), which is
               too fancy for our needs. Word processors sometimes generate erroneous data, even when
               you export to plain text. And we are not concerned with italics, colors, fonts,
               margins, and so forth. We would be better off with a <link
                  xlink:href="http://en.wikipedia.org/wiki/Text_editor">text editor</link>, which
               opens and saves only text. But even those do not check to see if the rules of the TAN
               format have been followed. So the best tool is an <link
                  xlink:href="http://en.wikipedia.org/wiki/XML_editor">XML editor</link>, which like
               a text editor takes and creates only text. An XML editor is designed to follow the
               rules of XML, and so saves a lot of typing, and prevents many errors. More important,
               an XML editor will tell us when our TAN file is invalid, and will provide important
               help as we edit.<note>
                  <para>Software suitable for your needs comes in many styles and prices. In
                     addition to the links in the paragraph above, you may wish to visit the
                     comparative lists published on Wikipedia for both <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_text_editors">text
                        editors</link> and <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_XML_editors">XML
                        editors</link>. TAN was developed using <link
                        xlink:href="https://www.oxygenxml.com">oXygen</link>, which is very
                     powerful. If you are a new user, you are likely to find it overwhelming. Take
                     advantage of tutorials and documentation associated with the XML editor you
                     have chosen. </para>
               </note></para>
            <para>Our first task is to get these two versions into separate files with the
               appropriate markup. Each TAN transcription file has two major parts: a head and a
               body. For now, we focus on only the second part, the body, as well as a few of the
               necessary preliminary lines that stand at the opening of the file, before both the
               head and the body. First, the 1881 (U.K.) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
    type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" 
    type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body xml:lang="eng">
        &lt;div type="line" n="1"></emphasis>Ring-a-ring-a-roses,<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="2"></emphasis>A pocket full of posies;<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="3"></emphasis>Hush! Hush! Hush! Hush!<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="4"></emphasis>We're all tumbled down.<emphasis role="bold">&lt;/div>
    &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
               And now the 1987 (U.S.) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
   type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" 
   type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
   id="tag:parkj@textalign.net,2015:ring02">
   &lt;head>
   . . . . . . .
   &lt;/head>
   &lt;body xml:lang="eng">
      &lt;div type="l" n="1"></emphasis>Ring-a-round the rosie,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="2"></emphasis>A pocket full of posies,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="3"></emphasis>Ashes! Ashes!<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="4"></emphasis>We all fall down.<emphasis role="bold">&lt;/div>
   &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
            </para>
            <para>The examples above are <emphasis role="bold">eXtensible Markup Language</emphasis>
                  (<emphasis role="bold">XML</emphasis>). XML lets you take a text or a collection
               of data and structure it with angle brackets, <code>&lt;</code> and <code>></code>.
               In the examples above, such markup is in boldface.</para>
            <para>Each file begins with a <emphasis role="bold">prolog</emphasis>, the first few
               lines that begin with <code>&lt;?</code>. The first line simply states that what
               follows is an XML document. The next two lines in each example are <emphasis
                  role="bold">processing instructions</emphasis> that point to the <emphasis
                  role="bold">schemas</emphasis>: files that will be used to check to see whether or
               not our XML follows TAN rules, a process called <emphasis role="bold"
                  >validation</emphasis>. We will skip the details of those first five lines. They
               will be identical, or nearly so, from one TAN file to the next. We can simply cut and
               paste them when we want to start a new TAN file.</para>
            <para>After the prolog comes an <emphasis role="bold">opening tag</emphasis>, signified
               by an angle bracket followed by a letter, here <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. That opening tag, <code>&lt;TAN-T...></code> is
               answered by a <emphasis role="bold">closing tag</emphasis>, <code>&lt;/TAN-T></code>,
               the last line. An opening tag and a closing tag mark the beginning and the end of one
               of the most important parts of an XML document, the <emphasis role="bold"
                  >element</emphasis>. For now, you can think of an element as a chunk of data.
               Every element is marked by a pair of tags. iI this example. <code><link
                     linkend="element-head">&lt;head></link></code> is answered by
                  <code>&lt;/head></code>, <code><link linkend="element-body"
                  >&lt;body></link></code> by <code>&lt;/body></code> and each
                  <code>&lt;div...></code> by <code>&lt;/div></code>. Any element that has an
               opening tag must have a closing tag. If an element doesn't have anything between its
               opening and closing tags, the two of them can be collapsed into a single tag. That
               is, <code>&lt;a>&lt;/a></code> can be simplified to <code>&lt;a/></code> (such empty
               elements are illustrated below).</para>
            <para>Elements and processing instructions are two of the seven basic XML ingredients,
               called <emphasis role="bold">nodes</emphasis>. The other five node types are text,
               comment, attribute, namespace, and document, some of which we will meet below. The
               element is arguably the most important type of node, because you will see it most
               often, and it absolutely required for something to be XML. Every XML file must have
               at least one element.</para>
            <para>Elements nest within or beside each other, but they never overlap or interlock.
               That is, you <emphasis>cannot</emphasis> have
               <code>&lt;a>&lt;b>&lt;/a>&lt;/b></code>. The prohibition on overlapping elements is
               one of the cardinal rules of XML, and is one of its aspects most discussed. The
               no-overlap rule keeps XML files tidy, and makes it easier for developers to write
               efficient applications. </para>
            <para>Any two nearby elements relate to each other, either by one nesting inside the
               other, or by one being adjacent to the other. Because of this, every XML file can be
               thought of as a tree, with the root at the trunk and the nested elements as branches,
               terminating in metaphorical leaves—the elements that do not contain elements. It is
               helpful to use the tree metaphor when we describe the path we take, toward either the
               leaves or the root. In these guidelines, we may use the terms <emphasis role="italic"
                  >rootward</emphasis> and <emphasis role="italic">leafward</emphasis> when we want
               to trace movement up and down the levels of hierarchy in an XML document (you may
               also hear the corresponding terms <emphasis>outermost</emphasis> and
                  <emphasis>innermost</emphasis>). The metaphor is strengthened by the XML rule that
               there can be but only one <emphasis role="bold">root element</emphasis>, i.e., the
               element that contains all other elements and is contained by none. In our examples
               above the root element is <code>TAN-T</code>.</para>
            <para>An XML document tree can also be profitably thought of as a family. Family names
               provide the most common terminology to describe how elements relate to each other. In
               our examples above, <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> is
               the <emphasis role="bold">parent</emphasis> of <code><link linkend="element-body"
                     >&lt;body></link></code>, and <code><link linkend="element-body"
                     >&lt;body></link></code> is the parent of the four <code><link
                     linkend="element-div">&lt;div></link></code> elements. Likewise, each
                     <code><link linkend="element-div">&lt;div></link></code> is the <emphasis
                  role="bold">child</emphasis> of <code><link linkend="element-body"
                     >&lt;body></link></code>, and <code><link linkend="element-body"
                     >&lt;body></link></code> is the child of <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. Distant parental relationships can be described with
               the terms <emphasis role="bold">ancestor</emphasis> and <emphasis role="bold"
                  >descendant</emphasis>. <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is the ancestor of every element it encompasses, and
               every element encompassed by <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is its descendant. Paratactic relationships are also
               important. <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code> are <emphasis role="bold"
                  >siblings</emphasis> to each other, and every <code><link linkend="element-div"
                     >&lt;div></link></code> is a sibling to every other <code><link
                     linkend="element-div">&lt;div></link></code>. The terms "following" and
               "preceding" are the most common ways to describe the relationship of one sibling to
               another.</para>
            <para>Inside of the opening tags for the <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>, <code><link linkend="element-body"
                  >&lt;body></link></code>, and <code><link linkend="element-div"
                  >&lt;div></link></code> elements are stretches of text: a word followed by an
               equals sign, then something within quotation marks. These stretches of text are
               called <emphasis role="bold">attributes</emphasis>. On the left side of the equals
               sign is the attribute name, and on the right side, within the quotation marks, is the
               attribute value. <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> has
               three attributes, <code>@xmlns</code>, <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code>, and <code><link linkend="attribute-id"
                  >@id</link></code> (when in prose we talk about an attribute, we normally preface
               the name with <code>@</code>). We will skip <code>@xmlns</code> for now. It looks
               like an attribute, but it's really a pseudo-attribute, because it specifies the
                  <emphasis role="bold">namespace</emphasis> of the XML file. Namespaces are an
               important but advanced topic, not discussed in this chapter. (See <xref
                  xlink:href="#namespace"/>.)</para>
            <para>The value of <code><link linkend="attribute-TAN-version"
                  >@TAN-version</link></code> indicates that the 2020 version of TAN is being used. </para>
            <para><code><link linkend="attribute-id">@id</link></code> is quite important. Every TAN
               file has an <code><link linkend="attribute-id">@id</link></code> that uniquely names
               and permanently identifies the document itself. It should not be changed, even if we
               make edits. If you change the filename or a copy of it winds up being incorporated
               into another project, a stable <code><link linkend="attribute-id">@id</link></code>
               will be quite important for finding it. An <code><link linkend="attribute-id"
                     >@id</link></code> should be unique. The only time it should be repeated in a
               file is when you are referring to another version of the same file.</para>
            <para>The value of <code><link linkend="attribute-id">@id</link></code> must always be
               what is called a tag uniform resource name (tag URN). A tag URN begins with
                  <code>tag:</code>, followed by an email address or domain name that we own or
               owned. It is okay to use an obsolete address or domain; its purpose is to allow users
               to identify you, perhaps centuries from now, not to contact you, although that might
               be a nice side benefit. After that email address or domain name comes a comma (no
               spaces) and a date on which we owned it, in the form of numbers for the year, year +
               month, or year + month + date, each item joined by hyphens, e.g., 2014-12-31. If we
               leave off a day value, it is assumed to be the first of the month; if we leave off
               the month value it is assumed to be January. </para>
            <para>In the examples above, <code>parkj@textalign.net,2015</code> points to our fictive
               self, Jenny Park, who owned that particular email address on the stroke of midnight
               (Coordinated Universal Time) January 1, 2015. After that comes a colon, and then any
               name we wish to assign to the file. </para>
            <para>We have anticipated a simple collection of texts, so we've called the files
                  <code>ring01</code> and <code>ring02</code>. If we run out of names, or want to
               restart, we can simply use a new email-date preface, e.g.,
                  <code>parkj@textalign.net,2015-01-02</code>. Or we could change the way we build
               our tag URNs.</para>
            <para>Tag URNs are very useful. You do not need permission to create a tag URN. You
               don't need to register them with anyone. Hundreds of years from now, when that email
               will be defunct or perhaps owned by someone else, users might still be able to
               identify who was responsible for creating the file. And that email address or domain
               can be recycled by the new owners, decades from now, to create their own tag
               URNs.</para>
            <para>The element <code><link linkend="element-body">&lt;body></link></code> contains
               our transcription. <code><link linkend="attribute-xmllang">@xml:lang</link></code>,
               required, specifies the principal language of the transcribed text. We use the
               standard 3-letter abbreviation for English. We could have used <code>en</code>, but
               2-letter abbeviations support only a relative handful of languages. (See <xref
                  xlink:href="#language"/> for more.) </para>
            <para>Our transcription has been divided into four <code><link linkend="element-div"
                     >&lt;div></link></code> elements. How we divide up the work is entirely up to
               us. But we must make sure that every bit of text is enclosed by a leaf <code><link
                     linkend="element-div">&lt;div></link></code> (i.e., one that contains no other
                     <code><link linkend="element-div">&lt;div></link></code>). Every <code><link
                     linkend="element-div">&lt;div></link></code> must be the parent of only other
                     <code><link linkend="element-div">&lt;div></link></code>s, or none at all. No
                     <code><link linkend="element-div">&lt;div></link></code> may mix text and other
               elements. An exception is made for text that is nothing but space (the space bar, the
               tab, or the new line). Space-only text can be mixed with elements as needed, which
               means that a TAN file can be indented as you like without changing its meaning. </para>
            <para>The values of <code><link linkend="attribute-type">@type</link></code> and
                     <code><link linkend="attribute-n">@n</link></code> indicate, respectively, the
               type of division and the name of the division. We have used <code>line</code> in the
               first example, but we could easily have also used <code>l</code> (as we did in the
               second) or <code>ln</code> or any other phrase that we think will make intuitive
               sense to other users. The value is arbitrary, but leads to meaning that is not
               arbitrary (we will see how and why below). We have used arabic numerals for the
               values of <code><link linkend="attribute-n">@n</link></code>, but the value, once
               again, could have been anything. Here we've opted for a reference system that seems
               intuitive and will most likely apply to multiple versions of the work. But the Arabic
               numerals are not required. We could have used Roman numerals, or some other numbering
               or naming scheme that is standard in the field.</para>
            <para>Aside from the <code><link linkend="element-head">&lt;head></link></code> element
               (discussed later), that's all we need in the TAN-T transcription. We can now move to
               alignment and annotation.</para>
            <para>The TAN-A format allows us to align and annotate as many transcriptions as we
               wish, and to make claims about them. Let's begin, once again temporarily skipping
                     <code><link linkend="element-head">&lt;head></link></code>. Significant
               differences from the previous two TAN-T files are
               emphasized:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/<emphasis role="bold">TAN-A.rnc</emphasis>" 
    type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/<emphasis role="bold">TAN-A.sch</emphasis>" 
    type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;<emphasis role="bold">TAN-A</emphasis> xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">ring-alignment</emphasis>">
    &lt;head>
    . . . . . . .
    &lt;/head>
    <emphasis role="bold">&lt;body/>
</emphasis>&lt;/TAN-A></programlisting></para>
            <para>In the prolog, the first line is identical to the first line of our transcription
               files. The second and third lines, the processing instructions, are identical, except
               that <code>href</code> points to the validation files specific to the TAN-A format.
               Even the fourth line looks like the two TAN-T files, other than the new name for the
               root element, <code><link linkend="element-TAN-A">&lt;TAN-A></link></code>, and the
               new value for <code><link linkend="attribute-id">@id</link></code>.</para>
            <para>The penultimate line, <code>&lt;body/></code>, is an empty element, and is
               equivalent to an opening tag immediately followed by a closing tag, i.e., <code><link
                     linkend="element-body">&lt;body></link>&lt;/body></code>. The alternative form,
                  <code>&lt;body/></code>, is a shorter and easier way to indicate that an element
               contains nothing. It will become apparent, when we discuss <code><link
                     linkend="element-head">&lt;head></link></code> below, why our <code><link
                     linkend="element-body">&lt;body></link></code> can be empty.</para>
            <para>The other kind of alignment, TAN-A-tok, takes a bit more work, because we must
               first identify words that correspond with each other. Even before we do that, we need
               to decide what kind of relationship holds between the two texts. Let us pretend, for
               the sake of example, that the 1987 version is a direct descendant (and therefore
               variation) of the 1881 one. So our task is to show exactly what words or phrases in
               the the older version correspond to those of the newer one. We will simplify in this
               case, and assume an interest only in words with letters, and not punctuation (some
               linguists legitimately treat punctuation as words in their own right). The term word
               is notoriously difficult to define, so we will call them <emphasis>tokens</emphasis>,
               to avoid false connotations (hence the name of the file, TAN-A-tok, to refer to
               alignment of tokens).</para>
            <para>We now create a TAN-A-tok
               file:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/<emphasis role="bold">TAN-A-tok.rnc</emphasis>" 
    type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/<emphasis role="bold">TAN-A-tok.sch</emphasis>" 
    type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?>
&lt;<emphasis role="bold">TAN-A-tok</emphasis> xmlns="tag:textalign.net,2015:ns" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">TAN-A-tok,ring01+ring02</emphasis>">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body <emphasis role="bold">reuse-type="general_adaptation" bitext-relation="B-descends-from-A"</emphasis>>
        <emphasis role="bold">&lt;!-- Examples of picking tokens by number -->
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="1"/>
            &lt;tok src="ring1987" ref="1" pos="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="2"/>
            &lt;tok src="ring1987" ref="1" pos="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="3"/>
            &lt;tok src="ring1987" ref="1" pos="3"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="4"/>
            &lt;tok src="ring1987" ref="l" pos="4"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="5"/>
            &lt;tok src="ring1987" ref="1" pos="5"/>
        &lt;/align>
        &lt;!-- Examples of picking tokens by value -->
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="A"/>
            &lt;tok src="ring1987" ref="2" val="A"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="pocket"/>
            &lt;tok src="ring1987" ref="2" val="pocket"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="full"/>
            &lt;tok src="ring1987" ref="2" val="full"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="of"/>
            &lt;tok src="ring1987" ref="2" val="of"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="posies"/>
            &lt;tok src="ring1987" ref="2" val="posies"/>
        &lt;/align>
        &lt;!-- Examples of picking ranges of tokens -->
        &lt;align>
            &lt;tok src="ring1881" ref="3" pos="1, 2"/>
            &lt;tok src="ring1987" ref="3" pos="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="3" pos="3 - 4"/>
            &lt;tok src="ring1987" ref="3" pos="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="1"/>
            &lt;tok src="ring1987" ref="4" pos="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="3"/>
            &lt;tok src="ring1987" ref="4" pos="2"/>
        &lt;/align>
        &lt;!-- examples of using "last" -->
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="last-1"/>
            &lt;tok src="ring1987" ref="4" pos="last-1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="last"/>
            &lt;tok src="ring1987" ref="4" ord="last"/>
        &lt;/align></emphasis>
    &lt;/body>
&lt;/TAN-A-tok></programlisting></para>
            <para>Once again, the first four lines, the prolog and root element, should look
               familiar, with the only significant changes being the names of the validation files,
               the name of the root element (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>), and the value of <code><link
                     linkend="attribute-id">@id</link></code>.</para>
            <para>The heart of the data is <code><link linkend="element-body"
                  >&lt;body></link></code>, which has two key attributes, <code><link
                     linkend="attribute-reuse-type">@reuse-type</link></code>, which describes the
               activity that was performed to change one version into the other, and <code><link
                     linkend="attribute-bitext-relation">@bitext-relation</link></code>, which
               specifies how one book relates to the other. Our two values,
                  <code>general_adaptation</code> and <code>B-descends-from-A</code>, are arbitrary
               names that we define in the <code><link linkend="element-head"
                  >&lt;head></link></code> (discussed later). (To understand the concepts behind
               reuse types and bitext relations, see <xref linkend="tan-a-tok"/>).</para>
            <para>You will also notice some lines that begin <code>&lt;!--</code> and end
                  <code>--></code>. These are <emphasis role="bold">comments</emphasis>, and can be
               placed within or beside any element, and can enclose any text we like, including line
               breaks.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> is the parent of one or
               more <code><link linkend="element-align">&lt;align></link></code> elements, each of
               which correlates a set of tokens in each of the two texts, pointed to by its
                     <code><link linkend="element-tok">&lt;tok></link></code> children. Each
                     <code><link linkend="element-tok">&lt;tok></link></code> has, in this example,
               three attributes. <code><link linkend="attribute-src">@src</link></code> takes a
               nickname (an <code><link linkend="attribute-id">@id</link></code> reference) that
               points to one of the two transcriptions; we have used <code>ring1881</code> and
                  <code>ring1987</code> for our two texts, but we could have just as easily used
               anything else such as <code>a</code> and <code>b</code>, or <code>uk</code> and
                  <code>us</code>. <code><link linkend="attribute-ref">@ref</link></code> has a
               value that points to a specific <code><link linkend="element-div"
                  >&lt;div></link></code> in the source TAN-T transcription; and <code><link
                     linkend="attribute-pos">@pos</link></code> or <code><link
                     linkend="attribute-val">@val</link></code> specify which token is intended,
               either by word number (<code><link linkend="attribute-pos">@pos</link></code>) or
               text of the actual word (<code><link linkend="attribute-val">@val</link></code>).
               Either technique is fine, and <code><link linkend="attribute-pos">@pos</link></code>
               and <code><link linkend="attribute-val">@val</link></code> can be mixed, as in the
               example. It is generally a good idea to use <code><link linkend="attribute-val"
                     >@val</link></code>, because if the underlying transcription changes in that
               location, <code><link linkend="attribute-val">@val</link></code> might help someone
               repair it; with <code><link linkend="attribute-pos">@pos</link></code> alone, you
               can't. You may also notice that the comma and hyphen can be used in <code><link
                     linkend="attribute-pos">@pos</link></code> to point to multiple words within
               the same <code><link linkend="element-div">&lt;div></link></code>, and that
                  <code>last</code> and <code>last-X</code> (where <code>X</code> is a digit) can be
               used to point to a token by position counting from the end of a <code><link
                     linkend="element-div">&lt;div></link></code>.</para>
            <para>Each <code><link linkend="element-align">&lt;align></link></code> can establish
               one-to-one, one-to-many, many-to-one, or many-to-many relationships between tokens
               from the two texts. A token may feature in multiple <code><link
                     linkend="element-align">&lt;align></link></code> elements. And if an
                     <code><link linkend="element-align">&lt;align></link></code> has <code><link
                     linkend="element-tok">&lt;tok></link></code> elements belonging to only one
               source, such as in the fourth-to-last <code><link linkend="element-align"
                     >&lt;align></link></code> above, we have what is called, in these guidelines, a
                  <emphasis>one-sided alignment</emphasis>. This one-sided alignment indicates that
               the second word of line four of the 1881 version is excluded from the act that we
               have called <code>adaptation</code>. If this were a translation, it would be as if we
               were saying that this word was excluded from the translation. (A one-sided alignment
               containing tokens only of the later source might point to words that the translator
               added, i.e., what in translation studies is called
               <emphasis>explicitation</emphasis>.) </para>
            <para>A one-sided alignment should not be confused with silence. As creators of this
               file, we make no claim to providing an exhaustive account, and we are under no
               obligation to indicate every word-for-word correspondence. If we fail to mention
               certain words, all that can be implied is that we opted not to say anything about
               them.</para>
            <para>We could have aligned the two texts in different ways. Perhaps further study will
               reveal that we were in error to associate the second "ring" with "round" in line 1.
               We can make corrections, even after publication, and notify other users of our data
               about the change. There are also ways to express doubt or alterative opinions, and to
               credit (or blame) the person making the assertion. We can even correlate fragments of
               tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are
               discussed at <xref xlink:href="#tan-a-tok"/>.</para>
         </section>
         <section>
            <title>The Principles of TAN Metadata (<code><link linkend="element-head"
                     >&lt;head></link></code>)</title>
            <para>At this point, we have finished four TAN files: two transcriptions (TAN-T), one
               macro-alignment file (TAN-A), and one micro-alignment file (TAN-A-tok). We've avoided
               discussing the <code><link linkend="element-head">&lt;head></link></code> in each of
               them until now. Before getting into details, some important concepts need to be
               covered first.</para>
            <para>Unlike <code><link linkend="element-body">&lt;body></link></code>, which carries
               the raw data, <code><link linkend="element-head">&lt;head></link></code> contains
               what is oftentimes called <emphasis role="bold">metadata</emphasis>. That is,
                     <code><link linkend="element-head">&lt;head></link></code> contains data about
               the data that is in <code><link linkend="element-body">&lt;body></link></code>.
               Because the TAN format is intended primarily to serve scholars, and because the
               format is heavily regulated (that is, there are numerous validation rules that
               supplement the standard XML ones), the metadata requirements are stricter than they
               are for Word documents, HTML, TEI, or other formats you might know better. Scholars
               who find our file expect to know some things about it before they can responsibly use
               it. For example, what are the sources we have used? Who produced the data? When? What
               changes or adjustments have been made? What licenses govern the use of the data? The
               questions are not difficult to answer, but they require thought, care, and some time
               to answer.</para>
            <para>Some metadata questions apply only to one TAN format. For example, in a TAN-A-tok
               file, we ask what relationship holds between the two sources. But that question makes
               no sense for a TAN-T file, which is merely a transcription. Some questions apply
               universally across all TAN files, no matter what kind of data. The TAN formats have
               been designed so that <code><link linkend="element-head">&lt;head></link></code>
               handles common metadata consistently across each format. This reduces potential
               confusion, and helps other people using our data to find the information they want.
               More important, what we write in one file can be referenced by another, without
               duplication, and so will reduce the chance of errors.</para>
            <para>Another TAN principle is that each <code><link linkend="element-head"
                     >&lt;head></link></code> should focus exclusively upon scope of the data in
                     <code><link linkend="element-body">&lt;body></link></code>, and not on other
               things. For example, in a TAN-T file, we are concerned only about the transcription,
               so our metadata too should be concerned only with the transcription. We should
               indicate its source, but because our file is not about the source itself, so we don't
               need to describe it further. We are not library catalogers, nor should we be. A TAN-T
               file is for transcribing, not for curating bibliographical data. Our obligation is
               merely to point a reader to complete and authoritative information, found
               elsewhere.</para>
            <para>TAN was also designed under the principle that all metadata should be useful to
               both humans and computers. For our example above, we must describe the work we have
               chosen (<emphasis role="italic">Ring around the Rosie</emphasis>) in a way that is
               comprehensible not just to the reader but to the computer.</para>
            <para>Take for example the 1881 book we have used for our first transcription. For the
               human reader we can write something like "Kate Greenaway, <emphasis>Mother
                  Goose</emphasis>, New York, G. Routledge and sons [1881]". But this human-readable
               string is too complex and syntactically opaque for computers and algorithms. A more
               computer-friendly identifier would be international standard book numbers (ISBNs),
               which distinguish the 1984 version of <emphasis>Mother Goose</emphasis> illustrated
               by Kayoko Okumura from the one of the same year illustrated by William Joyce. The
               ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be
               converted into a machine-actionable string called <emphasis role="bold">universal
                  resource names</emphasis> (<emphasis role="bold">URNs</emphasis>), in this case
                  <code>urn:isbn:0-671493159</code> and <code>urn:isbn:0-394865340</code>. (Our 1881
               version was published before the ISBN program was introduced. We will see below
               another way to name it.)</para>
            <para>There are different URNs for different things: journals (via ISSNs,
                  <code>urn:issn:...</code>), articles (DOIs, <code>urn:doi:...</code>), movies
               (ISANs, <code>urn:isan:...</code>), and so forth, which means that anyone can use
               them to refer unambiguously to a particular kind of thing. URN naming schemes must be
               registered with the Internet Assigned Numbers Authority (IANA) to ensure permanent,
               persistent, unique names for various types of things. (See <link
                  xlink:href="https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml"
                  >IANA's registry</link> and <xref xlink:href="#variable-official-urn-namespaces"/>
               for a complete list of official URN schemes.)</para>
            <para>All URNs are simply names. They don't tell you where an object is. To provide a
               unique <emphasis role="italic">location</emphasis>, however, we have the perhaps more
               familiar <emphasis role="bold">universal resource locators</emphasis> (<emphasis
                  role="bold">URLs</emphasis>), e.g., <code>http://academia.edu</code>. Like URNs,
               URLs are also centrally regulated, with individuals or organizations buying the
               rights to domain names from a central registry (usually through a third-party
               vendor).</para>
            <para>Both URNs and URLs can be thought of as the same type of thing, namely, a
                  <emphasis role="bold">universal resource identifier</emphasis> (<emphasis
                  role="bold">URI</emphasis>), sometimes called an <emphasis role="bold"
                  >international resource identifier</emphasis> (<emphasis role="bold"
                  >IRI</emphasis>). An IRI is a type of URN that allows any alphabet in Unicode, not
               just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and
               URLs. These four acronyms are easily confused and conflated, even by veterans. URIs
               and IRIs are basically the same thing, and they encompass URNs and URLs, a
               relationship and function that can be remembered by the last letter in each acronym:
                  UR<emphasis role="bold">I</emphasis>s/IR<emphasis role="bold">I</emphasis>s
                  <emphasis role="bold">I</emphasis>ncorporate both <emphasis role="bold"
                  >L</emphasis>ocators (UR<emphasis role="bold">L</emphasis>) and <emphasis
                  role="bold">N</emphasis>ames (UR<emphasis role="bold">N</emphasis>).</para>
            <para>If those acronyms are confusing, don't worry. For our purposes, they are pretty
               much all the same, and from this point onward we'll stick with the term IRI (unless
               we really mean a location to find a file, which we'll call a URL).</para>
            <para>IRIs are essential to a system frequently called the <emphasis role="bold"
                  >semantic web</emphasis> or <emphasis role="bold">linked (open) data</emphasis>,
               which relies upon IRIs as the basis for a simple universal data model. The semantic
               web allows people to make assertions in a way that computers can "understand." If
               people, working independently, happen to use the same IRIs to describe the same
               things, then computers can be programmed to make associations between disparate,
               heterogenous datasets. For example, if one scholar claims through IRIs that X is the
               mother of Y, and another claims in a different dataset that Y is the mother of Z, a
               computer can infer that X is the grandmother of Z, without the two scholars being
               aware of each other's work. When many scholars begin to use IRIs in their data, the
               result is a network that allows us or anyone else to discover connections across
               disciplines and projects, and make inferences that transcend any single
               project.</para>
            <para>TAN has been designed to be semantic-web friendly, and so requires in its
                     <code><link linkend="element-head">&lt;head></link></code> almost all data to
               be not just human-readable but also computer-readable, normally as an IRI. </para>
            <para>Our first task, then, in writing the <code><link linkend="element-head"
                     >&lt;head></link></code> sections of our four TAN files is to look for IRI
               vocabulary that will be familiar to those most likely to use our files. In trying to
               find suitable IRIs, we will find that the persons, things, and concepts we want to
               describe will range from the highly familiar to the unfamiliar.</para>
            <para><emphasis role="italic">Highly familiar</emphasis>: The two books that provide the
               basis of our transcription are catalogued and generally well known. A number of
               services provided by librarians provide controlled IRI vocabularies that can be used
               by anyone to unambiguously identify a particular version of a book. <link
                  xlink:href="http://www.worldcat.org">WorldCat</link> (run by OCLC) and the <link
                  xlink:href="http://catalog.loc.gov">Library of Congress</link> are good examples.
               In our case, we have found Library of Congress IRIs for both editions of
                  <emphasis>Mother Goose</emphasis>: <code>http://lccn.loc.gov/12032709</code> and
                  <code>http://lccn.loc.gov/87042504</code>. Observe that these two IRIs are also,
               perhaps confusingly, URLs (locations). If we paste these strings into our Web
               browser, we retrieve a record that describes the book. This locator does not lead us
               to the book itself, only to information <emphasis role="italic">about</emphasis> the
               book. Nevertheless, the Library of Congress has decided to make this URL also a name
               for the book, which means that it does double duty, both as a location for a Web page
               and a name for a book. Anyone who owns a domain name can designate a URL as a name
               for an object, a practice that can easily confuse anyone new to the semantic web,
               because such URLs name in reality two types of things: an entity and a web resource
               to learn more about that entity. The idea is that hundreds of years from now, when
               the web page no longer exists, the name will still be valid. </para>
            <para>In the TAN system, you can apply as many IRIs to a concept as you like. In fact,
               it is a good practice to find and add as many IRIs as you think worthwhile, just in
               case someone can't figure out what you're trying to identify. Just make sure that any
               IRI you copy unambiguously points to the thing you have in mind.</para>
            <para>We now have IRIs for the sources. Let's now find an IRI for the work, <emphasis
                  role="italic">Ring around the Rosie</emphasis>. The work is widely known, and even
               has a <link xlink:href="http://en.wikipedia.org/wiki/Ring_a_Ring_o%27_Roses"
                  >Wikipedia entry</link>. That Wikipedia entry is a benefit. The Universities of
               Leipzig and Mannheim and Openlink Software have collaborated on a project called
                  <link xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>, which provides a
               unique URN for every Wikipedia entry in the major languages. The DBPedia IRI in this
               case is <code>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</code>. Once again,
               this is both a name and a locator. It names a specific, intangible, abstract work,
               namely, a nursery rhyme that we've called <emphasis>Ring around the Rosie</emphasis>,
               no matter what specific version. But if you put that IRI into your browser, you will
               get back more information about that named object.</para>
            <para><emphasis role="italic">Familiar to specialists</emphasis>: We will need to have
               IRIs for some of the people who edited the file. Here we're not interested in the
               authors of the books we transcribed. We are interested in identifying the people who
               helped make the TAN file itself. Most people who write and edit TAN files will not be
               well-known, public figures. If they are, and if they are famous enough to have a
               Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors are
               also published authors, there is a good chance that they are listed in the databases
               of either <link xlink:href="http://viaf.org">VIAF</link> or <link
                  xlink:href="http://isni.org">ISNI</link>, both of which publish unique IRIs for
               authors, editors, and other persons central to the publications held in the world's
               libraries. </para>
            <para>Most contributors to TAN files, however, will not be listed in these databases. In
               those cases, we can name these participants with an IRI that we "own." We have
               already done something like this by assigning tag URNs to our four TAN files (the
               value of <code><link linkend="attribute-id">@id</link></code> in the root element).
               Our editors can do the same thing. If a student Robin Smith has been helping with
               proofreading, Robin can take an email address (even one that doesn't work any more)
               and a date when the email address was used and construct a tag URN such as
                  <code>tag:smith.robin@example.com,2012:self</code>. This has a slight drawback in
               that we cannot type this string into our browser to find out more about this
               particular Robin, but it at least allows us to assign a name that will not be
               confused as another Robin Smith, for example the one identified by ISNI as
                  <code>http://isni.org/isni/0000000043306406</code>. (If we want to go a step
               further, Robin could mint a URN from a domain name that she owns, and set up a linked
               data service that offers more information, human- and computer-readable. But this is
               not required, and it can be a hassle to set up and maintain.)</para>
            <para>Let's take a more difficult challenge for locating an IRI, that of describing the
                     <code><link linkend="attribute-bitext-relation">@bitext-relation</link></code>
               in our TAN-A-tok file. <code><link linkend="attribute-bitext-relation"
                     >@bitext-relation</link></code> draws from the discipline of stemmatology,
               which studies how manuscripts were copied from each other, and tries to place these
               manuscripts in a chain of transmission, a kind of historical stemma (tree). We have
               to find an IRI that describes the relationship that we claim holds between two
               text-bearing objects. Making that clear is important, because our perspective about
               the relationship between the two books affects the decisions we make when we align
               words, and other scholars using our files will want to know the assumptions we had
               when we aligned the two texts. </para>
            <para>For the sake of illustration we posit that the version published in the 1987
                  <emphasis>Mother Goose</emphasis> is a direct but not immediate descendant of the
               1881 version. Because no suitable IRI vocabulary yet exists for the relationships
               between texts, TAN itself has coined an IRI that can be used by anyone wishing to
               declare that, given two ordered sources, the second descends from the first through
               an unknown number of intermediaries:
                  <code>tag:textalign.net,2015:bitext-relation:a/x+/b</code>. (The arbitrary symbol
                  <code>/</code> signifies a step from one version to the next, and the
                  <code>x+</code> represents one or more intermediate versions.) We'll use that one
               for now.</para>
            <para>We face a similar issue when thinking about text reuse, <code><link
                     linkend="attribute-reuse-type">@reuse-type</link></code>. Here we are concerned
               with creative activities such as translation, paraphrase, adaptation, and so forth.
               We generally consider the 1987 version to be an adaptation of the 1881 version. And
               there are no stable, well-published IRI vocabularies for text reuse. So we adopt an
               IRI that is part of TAN's standard vocabulary,
                  <code>tag:textalign.net,2015:reuse-type:adaptation:general</code>.</para>
            <para>In the previous two cases, we could have come up with our own vocabulary. But the
               idea behind the semantic web is to use common, familiar vocabulary whenever possible.
               That's the same principle that drew us to structure and label the poem in four
               consecutively numbered lines. We adopt conventions we expect others will likely
               follow. The built-in TAN vocabulary simply gives us a convenient lingua franca for
               describing some important but abstract concepts. For other examples of IRIs coined by
               TAN, see <xref linkend="vocabularies-master-list"/>.</para>
            <para><emphasis role="italic">Generally unfamiliar</emphasis>: Some things or concepts
               will be unknown to very few people, perhaps even us. If we plan to refer to that
               thing or concept often, it is preferable to coin a tag URN, as described above. But
               in some cases, we might find that a tag URN we minted for some concept or thing was,
               in hindsight, misleading or poorly constructed, because we had only superficially
               thought about the category. If we wish to avoid such situations, we can assign a
               randomly generated IRI called a universally unique identifier (UUID), e.g.,
                  <code>urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0</code>. UUID URNs are very
               useful. The likelihood that a randomly generated UUID will be identical to any
               existing UUID is astronomically improbable, making them reliably unique names for
               anything (barring someone copying and reusing that UUID URN to name some other object
               or concept). Numerous free UUID generators can be found online.</para>
            <para>To humans, a UUID on its own is meaningless, unmemorable, and rather ugly. But it
               is a start. We always have the option, later, of supplementing it with other IRIs.
               It's perfectly fine to assign multiple IRIs to one object or concept. But the reverse
               is never true. One should never use one IRI to identify more than one object or
               concept.</para>
         </section>
         <section>
            <title>Creating TAN Metadata (<code><link linkend="element-head"
               >&lt;head></link></code>)</title>
            <para>Now that we have explored various IRI vocabularies for concepts related to our
               files concerning <emphasis>Ring-a-ring-a-roses</emphasis>, we can now complete the
               metadata in our four TAN files. Let us start with the TAN-T file of the 1881
               version:<programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
        <emphasis role="bold">&lt;name>TAN transcription of Ring a Ring o' Roses&lt;/name>
        &lt;master-location 
            href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
        &lt;license licensor="park">
            &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
            &lt;name>Attribution 4.0 International&lt;/name>
        &lt;/license>
        &lt;work>
            &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
            &lt;name>"Ring a Ring o' Roses" or "Ring Around the Rosie"&lt;/name>
        &lt;/work>
        &lt;source>
            &lt;IRI>http://lccn.loc.gov/12032709&lt;/IRI>
            &lt;name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]&lt;/name>
        &lt;/source>
        &lt;vocabulary-key>
            &lt;person xml:id="park">
                &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
                &lt;name>Jenny Park&lt;/name>
            &lt;/person>
            &lt;div-type xml:id="line">
                &lt;IRI>http://dbpedia.org/resource/Line_(poetry)&lt;/IRI>
                &lt;name>line of poetry&lt;/name>
            &lt;/div-type>
            &lt;role xml:id="creator">
                &lt;IRI>http://schema.org/creator&lt;/IRI>
                &lt;name xml:lang="eng">creator&lt;/name>
            &lt;/role>
        &lt;/vocabulary-key>
        &lt;file-resp who="park"/>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2014-08-13" who="park">Started file&lt;/change>
        &lt;to-do/></emphasis>
    &lt;/head>
    . . . . . . .
&lt;/TAN-T></programlisting></para>
            <para><code><link linkend="element-name">&lt;name></link></code>, the human readable
               counterpart to the <code><link linkend="attribute-id">@id</link></code> that is
               inside the root element, can be anything. And we can supply more than one <code><link
                     linkend="element-name">&lt;name></link></code>, in case we wish to provide
               alternative names of the file in different spellings or languages.</para>
            <para>One or more <code><link linkend="element-master-location"
                     >&lt;master-location></link></code>s provide URLs where master versions of the
               file are kept (and maintained). We provide this as a courtesy to others who might be
               using our data. Anyone who validates their local copy of the file will be warned if
               it does not match the master version, and they will be told of the most recent
               changes. This lets us silently and conveniently notify other users of changes. We do
               not have to keep track of the users of our file, and users do not have to pester us
               with questions about what changed when.</para>
            <para><code><link linkend="element-master-location">&lt;master-location></link></code>
               is mandatory only if we are finished with our to-do list, which is specified at
                     <code><link xlink:href="#xml">&lt;to-do></link></code>. If that element is
               empty, then we imply that we do not know of anything further that should be done to
               the file. Conversely, any elements in <code><link xlink:href="#xml"
                  >&lt;to-do></link></code> specify what remains to be done, and details will be
               returned to other users. That way you can release data that is useful but not
               completely perfect, and let users know about its deficiencies.</para>
            <para>One day the link in <code><link linkend="element-master-location"
                     >&lt;master-location></link></code> will be dead. But perhaps a copy of our
               file will be in circulation in other quarters. The document <code><link
                     linkend="attribute-id">@id</link></code> in the root element provides a way to
               identify and find files, independent of links.</para>
            <para><code><link linkend="element-license">&lt;license></link></code> specifies the
               license under which we are releasing our data. This element has nothing to do with
               the copyright of the source we have used (although, having been published in 1881,
               the book is clearly in the public domain). That is, we are specifying what rights are
               attached to the data, not its source, i.e., if we have placed additional strictures
               on the content in <code><link linkend="element-body">&lt;body></link></code>. In this
               example, we have released the data under a creative commons license. The child
               element <code><link linkend="element-IRI">&lt;IRI></link></code> specifies a Creative
               Commons IRI, and <code><link linkend="element-name">&lt;name></link></code> is the
               human-readable form.</para>
            <para><code><link xlink:href="#attribute-licensor">@licensor</link></code> specifies who
               has granted the license, in this case our fictive Jenny Park (see below).</para>
            <para>The conjunction of <code><link linkend="element-IRI">&lt;IRI></link></code> and
                     <code><link linkend="element-name">&lt;name></link></code>, the <emphasis
                  role="bold">IRI + name pattern</emphasis>, recurs throughout TAN files. They are
               used provide identifiers for <emphasis role="bold">vocabulary items</emphasis>. In an
               element that takes the IRI + name pattern, we may include as many children
                     <code><link linkend="element-IRI">&lt;IRI></link></code>s or <code><link
                     linkend="element-name">&lt;name></link></code>s as we like. But if we do so, we
               are stating that they are synonymous, i.e., that they all name the same thing. (Once
               again, an IRI is unique, so it should never be used to identify more than one
               thing.)</para>
            <para><code><link linkend="element-work">&lt;work></link></code> uses the IRI + name
               pattern to name the work we have chosen to transcribe. <code><link
                     linkend="element-source">&lt;source></link></code> points, through its IRI +
               name pattern, to a computer- and human-readable description of the book we have
               chosen. </para>
            <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
               contains vocabulary that we are using in our file. Inside, we can place more
               vocabulary items, and attach locally unique ids. For example, an IRI + name pattern
               is used for <code><link linkend="element-person">&lt;person></link></code>, which
               identifies through a tag URN Jenny Park. The value of <code><link
                     linkend="attribute-xmlid">@xml:id</link></code> allows us to use
                  <code>park</code> any time we want to mention Jenny. In fact, we already have, at
                     <code><link xlink:href="#attribute-licensor">@licensor</link></code>. Any
               mention of <code>park</code> will point to the appropriate item in <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>.</para>
            <para>There are a few other parts of <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code>. <code><link linkend="element-div-type"
                     >&lt;div-type></link></code> specifies an IRI + name pattern for line
               divisions, and the value of <code><link linkend="attribute-xmlid"
                  >@xml:id</link></code> means that we can use <code>line</code> any time we want to
               invoke the concept. Similarly we have a <code><link linkend="element-role"
                     >&lt;role></link></code>. The <code><link linkend="element-IRI"
                  >&lt;IRI></link></code> value of <code><link linkend="element-role"
                     >&lt;role></link></code> comes from the vocabulary of <link
                  xlink:href="http://schema.org">schema.org</link>, which is maintained by Bing,
               Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated
               to universal Internet standards), but we could have used Dublin Core or some other
               IRI vocabulary describing behaviors, responsibilities, and roles.</para>
            <para>After the <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code>, we get into parts of the file that specify who
               did what, when. First is a <code><link xlink:href="#element-file-resp"
                     >&lt;file-resp></link></code>, whose value of <code><link
                     linkend="attribute-who">@who</link></code>, <code>park</code>, indicates that
               Jenny Park is the one primarily responsible for the file. <code><link
                     linkend="element-resp">&lt;resp></link></code> specifies further who was
               responsible for doing what.<note>
                  <para>If you decide to modify someone else's TAN file, you should credit / blame
                     yourself for the changes. Your first point of order should be to add a
                           <code><link linkend="element-person">&lt;person></link></code> to the
                           <code><link linkend="element-vocabulary-key"
                        >&lt;vocabulary-key></link></code>, identifying yourself. You can then
                     either add a <code><link linkend="element-change">&lt;change></link></code>
                     (see below) or a <code><link linkend="element-resp">&lt;resp></link></code>
                     (you might need to specify a <code><link linkend="element-role"
                           >&lt;role></link></code> in the <code><link
                           linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>). You
                     should not change the document's <code><link linkend="attribute-id"
                        >@id</link></code>, unless your changes are so significant that it becomes
                     altogether a new document. TAN does not try to broker the age-old problem of
                     determining when a thing that undergoes changes becomes something altogether
                     different. Use your best intuition.</para>
               </note></para>
            <para>Remember that <code><link linkend="element-head">&lt;head></link></code> is
               focused on the data, not its sources, so the claim that Jenny Park is the creator
               pertains only to the data. No inference should be made about who was responsible for
               the printed source. If someone wants to know anything about the book, they should
               pursue the IRI identifier we have provided under <code><link linkend="element-source"
                     >&lt;source></link></code>.</para>
            <para><code><link linkend="element-change">&lt;change></link></code> has attributes
                     <code><link linkend="attribute-when">@when</link></code> and <code><link
                     linkend="attribute-who">@who</link></code> to specify who made the change and
               when. The value of <code><link linkend="attribute-when">@when</link></code> is always
               a date or a date + time, formatted according to the ISO standard syntax:
                  <code>[YYYY]-[MM]-[DD]</code> or <code>[YYYY]-[MM]-[DD]T[HH]:[MM]:[SS]</code>.
                     <code><link linkend="attribute-who">@who</link></code> always carries an IDref
               that points to a person or organization. <code><link linkend="element-change"
                     >&lt;change></link></code> does not take the IRI + name pattern, or even any
               children at all.</para>
            <para>So now we have finished one transcription file's metadata. The next one will look
               similar, but we'll take a couple of
               shortcuts:<programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring02">
    &lt;head>
      &lt;name>TAN transcription of <emphasis role="bold">Ring around the Rosie</emphasis>&lt;/name>
      &lt;master-location>ring-o-roses.eng.<emphasis role="bold">1987.xml</emphasis>&lt;/master-location>
      &lt;license <emphasis role="bold">which="by 4.0"</emphasis> licensor="park"/>
      &lt;work>
         &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
         &lt;name>Ring around the Rosie&lt;/name>
      &lt;/work>
      &lt;source>
         &lt;IRI><emphasis role="bold">http://lccn.loc.gov/87042504</emphasis>&lt;/IRI>
         &lt;name><emphasis role="bold">Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</emphasis>&lt;/name>
      &lt;/source>
      <emphasis role="bold">&lt;adjustments>
         &lt;normalization which="no hyphens"/>
      &lt;/adjustments></emphasis>
      &lt;vocabulary-key>
         <emphasis role="bold">&lt;div-type xml:id="l" which="line (verse)"/></emphasis>
         &lt;person xml:id="park" roles="creator">
            &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
            &lt;name xml:lang="eng">Jenny Park&lt;/name>
         &lt;/person>
      &lt;/vocabulary-key>
      &lt;resp roles="creator" who="park"/>
      &lt;change when="2014-10-24" who="park">Started file&lt;/change>
      <emphasis role="bold">&lt;comment when="2014-10-24" who="park">See p. 39 of source.&lt;/comment></emphasis>
      &lt;to-do/>
   &lt;/head>
   . . . . . .
&lt;/TAN-T></programlisting></para>
            <para>In this example, <code><link linkend="element-name">&lt;name></link></code>,
                     <code><link linkend="element-master-location"
                  >&lt;master-location></link></code>, and <code><link linkend="element-source"
                     >&lt;source></link></code> have been modified to describe this file. Note, we
               haven't had to change <code><link linkend="element-work"
               >&lt;work></link></code>.</para>
            <para><code><link linkend="element-license">&lt;license></link></code> looks different,
               but in reality it is identical to our previous example, and that is because the IRI +
               name pattern has been replaced with <link linkend="attribute-which"
                     ><code>@which</code></link>. You may replace any IRI + name pattern with <link
                  linkend="attribute-which"><code>@which</code></link>; its value should match a
                     <code><link linkend="element-name">&lt;name></link></code> in customized or
               standard vocabulary (a TAN-voc file). In TAN's standard vocabulary for licenses (see
                  <xref xlink:href="#vocabularies-licenses"/>) is the following item:</para>
            <para>
               <programlisting>&lt;<emphasis role="bold">TAN-voc</emphasis> xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
   id="tag:textalign.net,2015:<emphasis role="bold">tan-voc:licenses</emphasis>">
    . . . . . . .
   &lt;body <emphasis role="bold">affects-element="license"</emphasis>>
      <emphasis role="bold">&lt;item>
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
         &lt;IRI>tag:textalign.net,2015:license:by/4.0/&lt;/IRI>
         &lt;name>by 4.0&lt;/name>
         &lt;desc>attribution 4.0 international&lt;/desc>
      &lt;/item></emphasis>
    . . . . . . .
   &lt;/body>
&lt;/TAN-voc></programlisting>
            </para>
            <para>Because the validation rules for TAN-voc files require every <code><link
                     linkend="element-name">&lt;name></link></code> to be unique, that element can
               be treated as a unique identifier, similar to <code><link linkend="attribute-xmlid"
                     >@xml:id</link></code>. We could have repeated the <code><link
                     linkend="element-license">&lt;license></link></code> from the previous TAN-T
               file. But the <link linkend="attribute-which"><code>@which</code></link> method is
               much quicker and cleaner.</para>
            <para>Before <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code> comes a new element, <code><link
                     linkend="element-adjustments">&lt;adjustments></link></code>, which contains a
                     <code><link linkend="element-normalization">&lt;normalization></link></code>
               statement whose <link linkend="attribute-which"><code>@which</code></link> says
                  <code>no hyphens</code>. That too points to a standard TAN vocabulary for
               normalizations that provides an item with an IRI + name pattern for eliminating
               discretionary hyphens (see <xref xlink:href="#vocabularies-normalizations"/>):</para>
            <para>
               <programlisting>&lt;<emphasis role="bold">TAN-voc</emphasis> xmlns="tag:textalign.net,2015:ns" TAN-version="2020" id="tag:textalign.net,2015:<emphasis role="bold">tan-voc:normalizations</emphasis>">
    . . . . . . .
   &lt;body <emphasis role="bold">affects-element="normalization"</emphasis>>
      <emphasis role="bold">&lt;item>
         &lt;IRI>tag:textalign.net,2015:normalization:hyphens-discretionary-removed&lt;/IRI>
         &lt;name>no hyphens&lt;/name>
         &lt;desc>Discretionary word-break line-end hyphens have been deleted.&lt;/desc>
      &lt;/item></emphasis>
    . . . . . . .
   &lt;/body>
&lt;/TAN-voc></programlisting>
            </para>
            <para>As you might have inferred, the element <code><link
                     linkend="element-normalization">&lt;normalization></link></code> specifies how
               we have changed the data, namely, that we have opted to remove word-break line-end
               hyphenation. In other transcriptions we could use <code><link
                     linkend="element-normalization">&lt;normalization></link></code> to declare
               other kinds of changes we felt compelled to make, such as removing editorial comments
               or footnote signals. A healthy list of <code><link linkend="element-normalization"
                     >&lt;normalization></link></code>s is a courtesy to users of our data, some of
               whom might passionately care about keeping or removing line-end hyphenation. </para>
            <para>Back to our example. <code><link linkend="element-div-type"
                  >&lt;div-type></link></code> has a new value for <code><link
                     linkend="attribute-xmlid">@xml:id</link></code>, the letter <code>l</code>, and
               in it too the IRI + name pattern has been replaced by <link linkend="attribute-which"
                     ><code>@which</code></link>, whose value, <code>line (poetry)</code>, is a
               standard vocabulary item (see <xref xlink:href="#vocabularies-div-types"/>.</para>
            <para>There is a also new <code><link linkend="element-comment"
                  >&lt;comment></link></code> element, which is built much the same as <code><link
                     linkend="element-change">&lt;change></link></code>. (A <code><link
                     linkend="element-change">&lt;change></link></code>, after all, is just a
               comment about what has been changed.)</para>
            <para>That seems to be all there is. But if you've been attentive, you will have noticed
               that <code><link linkend="element-role">&lt;role></link></code> from our first TAN-T
               file (inside <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code>) is missing. That's because we don't need it,
               based on the same principle that lets us resolve <link linkend="attribute-which"
                     ><code>@which</code></link>. A vocabulary <code><link linkend="element-name"
                     >&lt;name></link></code> can be invoked not only in <link
                  linkend="attribute-which"><code>@which</code></link>, but in any attribute that
               points to values of <code><link linkend="attribute-xmlid">@xml:id</link></code>, in
               this case <code><link linkend="attribute-roles">@roles</link></code>. There is
               already a standard TAN vocabulary item with the <code><link linkend="element-name"
                     >&lt;name></link></code>
               <code>creator</code>, so we can use it directly without having to go through an
               intermediate vocabulary item with an <code><link linkend="attribute-xmlid"
                     >@xml:id</link></code>. If we had defined something else in <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> with a
                     <code><link linkend="attribute-xmlid">@xml:id</link></code> of
                  <code>creator</code>, that item would take precedence and override the built-in
               TAN vocabulary. But we haven't, so the standard TAN vocabularies are the
               default.</para>
         </section>
         <section>
            <title>Building TAN Vocabulary</title>
            <para>The first TAN-T transcription had a longer <code><link linkend="element-head"
                     >&lt;head></link></code> than the second one did, and that is because for the
               former we used an explicit method, that of specifying every IRI and name, and then in
               the latter adopted shortcuts that took advantage of TAN vocabulary. TAN vocabularies
               are meant not merely to be a convenience; they are intended to avoid problems that
               beset projects that create many files with repeated data patterns. When (not if) you
               make changes to one file you have to remember all the other places where you might
               need to make the same changes. The old programmer's adage "Don't repeat yourself"
               (DRY) is operative here. If there is a repeating data pattern, put it in one master
               place, and let the other files point to that pattern. When we make changes, we do so
               only at a single place.</para>
            <para>The previous examples drew from standard TAN vocabulary, which is written in one
               of the other TAN formats, TAN-voc. There is a whole collection of standard TAN-voc
               files in the project subdirectory called <code>vocabularies</code>. We can write our
               own TAN-voc files, to collect the vocabulary items that we will use repeatedly from
               one file to the next. For example:</para>
            <para>
               <programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="../../schemas/<emphasis role="bold">TAN-voc.rnc</emphasis>" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="../../schemas/<emphasis role="bold">TAN-voc.sch</emphasis>" type="application/xml" 
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">TAN-voc:standard</emphasis>">
    &lt;head>
        <emphasis role="bold">&lt;name>Keywords for TAN files edited by Jenny Park&lt;/name></emphasis>
        &lt;license licensor="park" which="by 4.0"/>
        &lt;vocabulary-key>
            <emphasis role="bold">&lt;person which="Jenny Park" xml:id="park"/></emphasis>
        &lt;/vocabulary-key>
        &lt;file-resp who="park"/>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2019-10-08" who="park">Started file&lt;/change>
        &lt;to-do>
            <emphasis role="bold">&lt;comment when="2020-01-04" who="park">Need to check files for new vocabulary items.&lt;/comment></emphasis>
        &lt;/to-do>
    &lt;/head>
    &lt;body>
        <emphasis role="bold">&lt;group affects-element="person">
            &lt;item>
                &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
                &lt;name xml:lang="eng">Jenny Park&lt;/name>
            &lt;/item>
        &lt;/group></emphasis>
        <emphasis role="bold">&lt;item affects-element="work">
            &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
            &lt;name>Ring a Ring o' Roses&lt;/name>
            &lt;name>Ring Around the Rosie&lt;/name>
        &lt;/item></emphasis>
    &lt;/body>
&lt;/TAN-voc></programlisting>
            </para>
            <para>In this example case, updates have been made to <code><link linkend="attribute-id"
                     >@id</link></code> and <code><link linkend="element-name"
                  >&lt;name></link></code>, and a <code><link linkend="element-comment"
                     >&lt;comment></link></code> has been added to <code><link xlink:href="#xml"
                     >&lt;to-do></link></code>. The most significant difference is the <code><link
                     linkend="element-body">&lt;body></link></code>, which has two <code><link
                     linkend="element-item">&lt;item&gt;</link></code>s, one of which is wrapped in
               a <code><link linkend="element-group">&lt;group></link></code>. Each <code><link
                     linkend="attribute-affects-element">@affects-element</link></code> specifies
               one or more names of elements that the enclosed items affect, and the <code><link
                     linkend="element-item">&lt;item&gt;</link></code>s have the standard IRI + name
               pattern. <code><link linkend="element-group">&lt;group></link></code>s may nest as
               you like.</para>
            <para>The difference between a grouped and ungrouped <code><link linkend="element-item"
                     >&lt;item&gt;</link></code> is purely a matter of taste and convenience. The
               example above illustrates both methods.</para>
            <para>The <code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
               has a <code><link linkend="element-person">&lt;person></link></code> whose <link
                  linkend="attribute-which"><code>@which</code></link> points to the body of the
               first <code><link linkend="element-item">&lt;item&gt;</link></code>. That is, a
               TAN-voc file can use its own vocabulary, without repeating it in <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>.</para>
            <para>Let's return to the <code><link linkend="element-head">&lt;head></link></code>s of
               our two TAN-T files, and see how to incorporate our new TAN-voc vocabulary
               file.</para>
            <para>
               <programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
        &lt;name>TAN transcription of Ring a Ring o' Roses&lt;/name>
        &lt;master-location 
            href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
<emphasis role="bold">        &lt;license which="by 4.0" licensor="park"/>
        &lt;work which="Ring around the Rosie"/>
</emphasis>        &lt;source>
            &lt;IRI>http://lccn.loc.gov/12032709&lt;/IRI>
            &lt;name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]&lt;/name>
        &lt;/source>
<emphasis role="bold">        &lt;vocabulary>
           &lt;IRI>tag:parkj@textalign.net,2015:TAN-voc:standard&lt;/IRI>
           &lt;name>Vocabulary for TAN files edited by Jenny Park&lt;/name>
           &lt;location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/>
        &lt;/vocabulary>
</emphasis>        &lt;vocabulary-key><emphasis role="bold">
            &lt;person xml:id="park" which="Jenny Park"/>
            &lt;div-type xml:id="line" which="line (verse)"/>
</emphasis>        &lt;/vocabulary-key>
        &lt;file-resp who="park"/>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2014-08-13" who="park">Started file&lt;/change>
        &lt;to-do/>
    &lt;/head>
    . . . . . . .
&lt;/TAN-T></programlisting>
            </para>
            <para>
               <programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring02">
    &lt;head>
      &lt;name>TAN transcription of Ring around the Rosie&lt;/name>
      &lt;master-location>ring-o-roses.eng.1987.xml&lt;/master-location>
      &lt;license which="by 4.0" licensor="park"/>
      <emphasis role="bold">&lt;work which="Ring around the Rosie"/></emphasis>
      &lt;source>
         &lt;IRI>http://lccn.loc.gov/87042504&lt;/IRI>
         &lt;name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.&lt;/name>
      &lt;/source>
      <emphasis role="bold">&lt;vocabulary>
         &lt;IRI>tag:parkj@textalign.net,2015:TAN-voc:standard&lt;/IRI>
         &lt;name>Vocabulary for TAN files edited by Jenny Park&lt;/name>
         &lt;location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/>
      &lt;/vocabulary></emphasis>
      &lt;adjustments>
         &lt;normalization which="no hyphens"/>
      &lt;/adjustments>
      &lt;vocabulary-key>
         <emphasis role="bold">&lt;div-type xml:id="l" which="line (verse)"/></emphasis>
         <emphasis role="bold">&lt;person xml:id="park" which="Jenny Park"/></emphasis>
      &lt;/vocabulary-key>
      &lt;resp roles="creator" who="park"/>
      &lt;change when="2014-10-24" who="park">Started file&lt;/change>
      &lt;comment when="2014-10-24" who="park">See p. 39 of source.&lt;/comment>
      &lt;to-do/>
   &lt;/head>
   . . . . . .
&lt;/TAN-T></programlisting>
            </para>
            <para>In each TAN-T file, a new <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code> points to the project TAN-voc vocabulary file
               we have just created. Along with the customary IRI + name pattern is a new element,
                     <code><link linkend="element-location">&lt;location></link></code>, which
               specifies where the digital file was accessed and when (through <code><link
                     linkend="attribute-accessed-when">@accessed-when</link></code>). We may include
               as many of these <code><link linkend="element-location">&lt;location></link></code>
               elements as we wish, with the most preferred or reliable one at the top. The
               validation process will consult only the first one that leads to an available
               document. The <code><link linkend="attribute-accessed-when"
                  >@accessed-when</link></code> value is important, because the validator will look
               for changes in the file since we last accessed it, and if any changes are found a
               warning with a summary of the changes will be returned. It is then up to us to
               determine if the alterations merit any action on our part.</para>
            <para>Similarly, anyone using or dependending upon our file will be notified of any
               changes we make, through the same validation process.</para>
            <para>Once the <code><link linkend="element-vocabulary">&lt;vocabulary&gt;</link></code>
               is in place, we can draw from our predefined vocabulary. Hence, these revised
               versions of the <code><link linkend="element-head">&lt;head></link></code>s are a bit
               more compact and easier to read. The longer the TAN file, the more noticable the
               improvement. And when our library grows into dozens of files, we'll be grateful that
               a change that affects all the files needs to be made only once.</para>
            <para>Now that we have created the metadata for our transcriptions, let's turn to the
               alignment files. Those <code><link linkend="element-head">&lt;head></link></code>s
               will look slightly different, because they are not concerned with transcriptions per
               se. We start with the TAN-A
               file:<programlisting>&lt;TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">ring-alignment</emphasis>">
    &lt;head>
       &lt;name><emphasis role="bold">div-based alignment of multiple versions of Ring o Roses</emphasis>&lt;/name>
       &lt;master-location href="<emphasis role="bold">http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml</emphasis>"/>
       &lt;license which="by_4.0" licensor="park"/>
       <emphasis role="bold">&lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
       &lt;/source></emphasis>
       &lt;vocabulary-key>
          &lt;person xml:id="park" which="Jenny Park"/>
       &lt;/vocabulary-key>
       &lt;resp who="park" roles="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
       &lt;to-do>
          &lt;comment when="2018-08-09-04:00" who="park">Finish file.&lt;/comment>
       &lt;/to-do>
    &lt;/head>
    . . . . . .
&lt;/TAN-A></programlisting></para>
            <para>Much of the code above will look similar to the previous two examples. The file's
                     <code><link linkend="element-name">&lt;name></link></code> and <code><link
                     linkend="element-master-location">&lt;master-location></link></code> are
               updated. Just like TAN-T files have <code><link linkend="element-source"
                     >&lt;source></link></code>s, so TAN-A files do as well, except that those
               sources are always TAN-T transcription files, and they take the IRI + name + location
               pattern we saw above in <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code>. Because alignment files take only TAN
               transcription files as sources, each <code><link linkend="element-source"
                     >&lt;source></link></code>'s <code><link linkend="element-IRI"
                  >&lt;IRI></link></code> always takes the <code><link linkend="attribute-id"
                     >@id</link></code> value of the target TAN-T transcription file. <code><link
                     linkend="element-name">&lt;name></link></code> is arbitrary. It may replicate
               exactly the title found in the transcription file, or it may be modified, perhaps to
               harmonize better with the descriptions of the other source names. Our TAN-A file
               could have any number of <code><link linkend="element-source"
                  >&lt;source></link></code>s, and not necessarily for the same work. The order in
               which we put the <code><link linkend="element-source">&lt;source></link></code>s does
               not necessarily mean anything. </para>
            <para>This <code><link linkend="element-head">&lt;head></link></code> explains why the
                     <code><link linkend="element-body">&lt;body></link></code> of our TAN-A file is
               allowed to be empty. We have already specified which sources are to be aligned and
               where they are to be found. Any user or processor of a TAN-A file may assume that
               every <code><link linkend="element-div">&lt;div></link></code> in every source should
               be automatically aligned upon the basis of shared values of <code><link
                     linkend="attribute-n">@n</link></code>.</para>
            <para>Meanwhile we turn to our fourth file, TAN-A-tok, whose <code><link
                     linkend="element-head">&lt;head></link></code> might look like
               this:<programlisting>&lt;TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    &lt;head>
        &lt;name><emphasis role="bold">token-based alignment of two versions of Ring o Roses</emphasis>&lt;/name>
        &lt;master-location href="<emphasis role="bold">http://textalign.net/release/TAN-2020/examples/TAN-A-tok/ringoroses.01+02.token.1.xml</emphasis>"/>
        &lt;license which="<emphasis role="bold">by-nc-nd_4.0</emphasis>" rights-holder="park"/>
        <emphasis role="bold">&lt;token-definition src="ring1881 ring1987" which="letters"/></emphasis>
        &lt;source xml:id="eng-uk">
            &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
            &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
            &lt;location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
        &lt;/source>
        &lt;source xml:id="eng-us">
            &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
            &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
            &lt;location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
        &lt;/source>
        &lt;vocabulary-key>
            <emphasis role="bold">&lt;bitext-relation xml:id="B-descends-from-A" which="a/x+/b"/></emphasis>
            <emphasis role="bold">&lt;token-definition src="ring1881 ring1987" which="letters"/></emphasis>
            &lt;person xml:id="park" which="Jenny Park"/>
        &lt;/vocabulary-key>
        &lt;change when="2015-01-20" who="park">Started file&lt;/change>
    &lt;/head>
    . . . . . .
&lt;/TAN-A-tok></programlisting></para>
            <para>The TAN-A-tok <code><link linkend="element-head">&lt;head></link></code> looks
               similar to the previous examples, except that <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> has some new
               content.</para>
            <para><code><link linkend="element-bitext-relation">&lt;bitext-relation></link></code>
               states through <link linkend="attribute-which"><code>@which</code></link> or an IRI +
               name pattern the stemmatic relationship we think holds between the two sources. We
               have used <link linkend="attribute-which"><code>@which</code></link> and the value
                  <code>a/x+/b</code>, pointing to a standard TAN vocabulary item for <link
                  xlink:href="#vocabularies-bitext-relations">bitext relations</link>:</para>
            <para>
               <programlisting>&lt;TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
   id="tag:textalign.net,2015:tan-voc:bitext-relation">
. . . . . .
        &lt;item>
            &lt;IRI>tag:textalign.net,2015:bitext-relation:a/x+/b&lt;/IRI>
            <emphasis role="bold">&lt;name>a/x+/b&lt;/name></emphasis>
            &lt;desc>direct descent, B descends from A, one or more mediaries&lt;/desc>
        &lt;/item>
. . . . . .
&lt;/TAN-voc></programlisting>
            </para>
            <para><code><link linkend="element-token-definition">&lt;token-definition></link></code>
               specifies how we have defined our word tokens. <code><link linkend="attribute-src"
                     >@src</link></code> has more than one value, specifying that the same
               tokenization rule should be applied to both sources. <link linkend="attribute-which"
                     ><code>@which</code></link> points to this standard TAN vocabulary item:</para>
            <para>
               <programlisting>&lt;TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
   id="tag:textalign.net,2015:tan-voc:tokenizations">
. . . . . .
        &lt;item>
            <emphasis role="bold">&lt;token-definition pattern="[\w&amp;#xad;​&amp;#x200b;&amp;#x200d;]+"/></emphasis>
            &lt;name>letters&lt;/name>
            &lt;name>letters only&lt;/name>
            &lt;name>general word characters only&lt;/name>
            &lt;name>general ignore punctuation&lt;/name>
            &lt;name>gwo&lt;/name>
            &lt;desc>General tokenization pattern for any language, words only. Non-letters 
                such as punctuation are ignored.&lt;/desc>
        &lt;/item>
. . . . . .
&lt;/TAN-voc></programlisting>
            </para>
            <para>Up until now, all vocabulary items have taken the IRI + name pattern. The one
               above does not have an IRI, only a <code><link linkend="element-token-definition"
                     >&lt;token-definition></link></code> with a <code><link
                     linkend="attribute-pattern">@pattern</link></code>. The value of <code><link
                     linkend="attribute-pattern">@pattern</link></code>, which may look like
               gibberish, is a <emphasis role="bold">regular expression</emphasis>. "Regular" here
               does not mean ordinary; rather it derives from the Latin <emphasis>regula</emphasis>,
               rule. Regular expressions are rule-based patterned text searches. This particular
               pattern says that a token is defined as any contiguous string of word characters
                  (<code>\w</code>), soft hyphens (<code>&amp;#xad;</code>), zero-width spaces
                  (<code>&amp;#x200b;</code>), or zero-width joiners (<code>&amp;#x200d;</code>).
               This is TAN's default tokenization pattern, and it will be assumed for any TAN-A-tok
               file that lacks a <code><link linkend="element-token-definition"
                     >&lt;token-definition></link></code>. TAN adopts this default because in
               ordinary conversation, when we refer to the nth word in a sentence, we most often
               ignore punctuation marks. For more on token definitions see <xref
                  xlink:href="#defining_tokens"/> and <xref
                  xlink:href="#vocabularies-token-definitions"/>. See also <xref
                  xlink:href="#regular_expressions"/>.</para>
            <para>In our <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code> we could have also included a <code><link
                     linkend="element-reuse-type">&lt;reuse-type></link></code>, but we have
               intentionally omitted it here, because we have <code>&lt;body
                  bitext-relation="B-descends-from-A" reuse-type="general_adaptation"></code>. The
               value for <code><link linkend="attribute-reuse-type">@reuse-type</link></code>,
                  <code>general_adaptation</code>, corresponds to a <code><link
                     linkend="element-name">&lt;name></link></code> in a standard TAN vocabulary
               item for reuse types. We don't need to invoke a <code><link
                     linkend="element-reuse-type">&lt;reuse-type></link></code> in the <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> because we
               have opted not to give it an <code><link linkend="attribute-xmlid"
                  >@xml:id</link></code>. Notice that <code>general_adaptation</code> has an
               underscore instead of a space. That's because <code><link
                     linkend="element-reuse-type">&lt;reuse-type></link></code> can take multiple
               values, which are signified by spaces. We could have used a hyphen instead of an
               underscore, if we preferred. The values of <code><link linkend="element-name"
                     >&lt;name></link></code> are never case-sensitive, and the space, hyphen, and
               underscore are treated as equivalent.</para>
         </section>
         <section>
            <title>Aligning across Projects</title>
            <para>We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A
               alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc
               file for vocabulary shared across the files. </para>
            <para>Let us imagine what it might be like to connect our TAN collection to a TAN file
               made by someone else. Let us assume that we have found elsewhere, in a German
               project, a TAN transcription of a work that looks quite similar to our
               own:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
   type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" 
   type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel">
   &lt;head>
      &lt;name>TAN Transkription, Ringelreihen mit Riederfallen&lt;/name>
      &lt;master-location>http://beispiel.com/TAN-T/ringel.xml&lt;/master-location>
      &lt;license>
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
         &lt;name>Creative Commons Namensnennung 4.0 International Lizenz&lt;/name>
         &lt;desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0
            International Lizenz.&lt;/desc>
      &lt;/license>
      &lt;licensor who="schmidt"/>
      &lt;work>
         &lt;IRI>tag:beispiel.com,2014:texte:holderbusch&lt;/IRI>
         &lt;name>"Die Kinder auf dem Holderbusch"&lt;/name>
      &lt;/work>
      &lt;version>
         &lt;IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e&lt;/IRI>
         &lt;name>zweite Version&lt;/name>
      &lt;/version>
      &lt;numerals priority="letters"/>
      &lt;source>
         &lt;IRI>http://www.worldcat.org/oclc/4574384&lt;/IRI>
         &lt;name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen 
            aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. 
            Leipzig, 1897.&lt;/name>
      &lt;/source>
      &lt;adjustments>
         &lt;normalization>
            &lt;IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off&lt;/IRI>
            &lt;name>Keine Bindestriche&lt;/name>
         &lt;/normalization>
      &lt;/adjustments>
      &lt;vocabulary-key>
         &lt;div-type xml:id="Zeile">
            &lt;IRI>http://dbpedia.org/resource/Gedichtzeile&lt;/IRI>
            &lt;name>Gedichtzeile&lt;/name>
         &lt;/div-type>
         &lt;div-type which="poem" xml:id="Gedicht"/>
         &lt;person xml:id="schmidt" roles="Produzent">
            &lt;IRI>tag:hans@beispiel.com,2014:selbst&lt;/IRI>
            &lt;name xml:lang="eng">Hans Schmidt&lt;/name>
         &lt;/person>
         &lt;role xml:id="Produzent">
            &lt;IRI>http://schema.org/producer&lt;/IRI>
            &lt;name xml:lang="eng">Produzent&lt;/name>
         &lt;/role>
      &lt;/vocabulary-key>
      &lt;file-resp who="schmidt"/>
      &lt;resp who="schmidt" roles="Produzent"/>
      &lt;change when="2014-08-13" who="schmidt">Anfang&lt;/change>
      &lt;comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht&lt;/comment>
      &lt;to-do/>
   &lt;/head>
   &lt;body xml:lang="deu">
      &lt;div type="Gedicht" n="1">
         &lt;div type="Zeile" n="a">Ringel, Ringel, Reihe!&lt;/div>
         &lt;div type="Zeile" n="b">Sind der Kinder dreie,&lt;/div>
         &lt;div type="Zeile" n="c">Sitzen auf dem Holderbuch,&lt;/div>
         &lt;div type="Zeile" n="e">Schreien alle: husch, husch, husch!&lt;/div>
      &lt;/div>
   &lt;/body>
&lt;/TAN-T></programlisting></para>
            <para>It seems that this 19th-century German version is quite similar to our two English
               versions. We have some alignment options open to us. Two more sets of word-for-word
               alignments would be interesting, but remember, just because we find a text that
               nicely aligns with others does not mean that we <emphasis role="italic"
                  >must</emphasis> align them, or that for a given alignment we must align
                  <emphasis>everything</emphasis>. In this case, we choose not to worry about
               word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for
               example, we can use the built-in TAN application to display the three versions in
               parallel, to study more closely the relationships between them.</para>
            <para>To that end, we first observe some differences between this transcription and our
               other two. First, the value of <code><link linkend="element-work"
                  >&lt;work></link></code> is not the one we have given our two versions. Second,
                     <code><link linkend="element-numerals">&lt;numerals></link></code> specifies by
               its value for <code><link linkend="attribute-priority">@priority</link></code> that
               any ambiguous numerals should be interepreted as letter numerals, not Roman (that's
               important, e.g., for a <link linkend="element-div"><code>&lt;div></code></link> with
               an <code><link linkend="attribute-n">@n</link></code> value <code>c</code>, which
               could be interpreted to mean 3 or the Roman numeral for 100). Next, the lines are
               wrapped in a <link linkend="element-div"><code>&lt;div></code></link> for the whole
               poem (<code>Gedicht</code>) and they have been lettered instead of numbered. And
               last, the editor seems to have made a typographical error, making the last line
                  <code>e</code> instead of the expected <code>d</code>). These five differences
               typify inconsistencies one commonly finds in digital texts from different projects of
               the same work.<note>
                  <para>There are a few other differences in this third transcription that do not
                     affect our alignment. <code><link linkend="element-version"
                        >&lt;version></link></code> is used to distinguish different versions of the
                     same work found on the same text-bearing object. That is, if we are
                     transcribing a bilingual edition, we can use <code><link
                           linkend="element-version">&lt;version></link></code> to specify which of
                     the two versions we are encoding. Notice that the <code><link
                           linkend="element-IRI">&lt;IRI></link></code> value is a UUID. In this
                     case the editor was not prepared to deploy a formal IRI naming scheme (perhaps
                     using a tag URN) that would be satisfactory for work-versions. Also, the
                           <code><link linkend="element-div-type">&lt;div-type></link></code> is
                     defined as <code>http://dbpedia.org/resource/Gedichtzeile</code> (Gedichtzeile
                     = line of poetry), so it doesn't intersect with our IRIs for the vocabulary
                     item <code>line</code>. But <code><link linkend="element-div-type"
                           >&lt;div-type></link></code> is not used to align versions, and
                     validation isn't affected, so we do not concern ourselves here with trying to
                     reconcile the different IRIs. </para>
               </note></para>
            <para>These are points we can easily reconcile in our TAN-A file, which we now expand to
               include the German version. We make the following adjustments
               (emphasized):<programlisting>&lt;TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2020" 
    id="tag:parkj@textalign.net,2015:ring-alignment">
    &lt;head>
       &lt;name>div-based alignment of multiple versions of Ring o Roses&lt;/name>
       &lt;master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/>
       &lt;license which="by_4.0" licensor="park"/>
       &lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
       &lt;/source>
       <emphasis role="bold">&lt;source xml:id="ger">
          &lt;IRI>tag:beispiel.com,2014:ringel&lt;/IRI>
          &lt;name>Transcription of an ancestor of Ring around the roses in German&lt;/name>
          &lt;location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml&lt;/location>
          &lt;location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml&lt;/location>
       &lt;/source>
       &lt;adjustments src="ger">
          &lt;skip div-type="Gedicht"/>
          &lt;rename n="e" by="-1"/>
       &lt;/adjustments></emphasis>
       &lt;vocabulary-key>
          &lt;person xml:id="park" which="Jenny Park"/>
          <emphasis role="bold">&lt;alias id="ring" idrefs="ger eng-us"/></emphasis>
       &lt;/vocabulary-key>
       &lt;resp who="park" roles="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
       <emphasis role="bold">&lt;change when="2014-08-22" who="park">Added German version.&lt;/change></emphasis>
       &lt;to-do>
          &lt;comment when="2018-08-09-04:00" who="park">Finish file.&lt;/comment>
       &lt;/to-do>
    &lt;/head>
    . . . . . .
&lt;/TAN-A></programlisting></para>
            <para>The first major change is the insertion of a third <code><link
                     linkend="element-source">&lt;source></link></code>, pointing to the new file
               and specifying its name and IRI. Note that two <code><link linkend="element-location"
                     >&lt;location></link></code>s have been provided, one for the original and
               another for a local copy we have saved. Validation will take into account only the
               first document available. If we wanted to work primarily off our local copy, we would
               have put that <code><link linkend="element-location">&lt;location></link></code>
               first. By placing it second, we allow the validation engine to work primarily off the
               master version and therefore look for updates and changes. If that version is
               unavailable, validation will be made against second, local copy.</para>
            <para><code><link linkend="element-adjustments">&lt;adjustments></link></code> specifies
               through its <code><link linkend="attribute-src">@src</link></code> that only the
               German version should be adjusted by the contained instructions. The enclosed
                     <code><link linkend="element-skip">&lt;skip&gt;</link></code> says, in effect,
               to ignore the wrapping <link linkend="element-div"><code>&lt;div></code></link> for
               purposes of alignment. The <code><link linkend="element-rename"
                  >&lt;rename></link></code> takes care of the apparent typographical error, and
               anchors the German version to the U.S. one. Note that the German version uses
                  <code>e</code>, but we have used <code>5</code>. But we could have used
                  <code>e</code>, or even the Roman numeral <code>v</code>, had we wished to. Every
               TAN file's numeration system is evaluated locally, independent of any external files.
               We need not reconcile the <code>a</code>, <code>b</code>, and <code>c</code>
               <code><link linkend="attribute-n">@n</link></code> values in the German version,
               because these will be automatically treated as equivalent to <code>1</code>,
                  <code>2</code>, and <code>3</code>. The TAN format supports four numeration
               systems other than Arabic numerals: Roman numerals (uppercase or lowercase),
               alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations
               (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two
               systems are interpreted as a two-tier numbering system.</para>
            <para>The second major change, to address the German version's different value of
                     <code><link linkend="element-work">&lt;work></link></code>, is the addition of
               an <code><link linkend="element-alias">&lt;alias></link></code>, which allows us to
               assign one or more vocabulary items a common id. Wherever the value <code>ring</code>
               is used, it stands in for <code>ger</code> and <code>eng-us</code>, which point to
               the two TAN-T files. Every TAN-T file has only one work and only one written source.
               So if you wish to make a claim about a particular work or source, you can use a
               TAN-T's id as a surrogate. So if we make claims in our TAN-A file about a written
               source or a work, <code>ring</code> would assert the claim to be true for the works
               pointed to by the German and the U.S. version. (We do not need to specifically
               mention <code>eng-uk</code> in the <code><link linkend="element-alias"
                     >&lt;alias></link></code>, since it has the same work IRI as the U.S. version
               does.) <note>
                  <para>Alternatively, instead of <code><link linkend="element-alias"
                           >&lt;alias></link></code>, we could simply have adjusted our TAN-voc
                     file, adding the German version's <code><link linkend="element-IRI"
                           >&lt;IRI></link></code> value to the appropriate vocabulary item, and use
                     that id.</para>
               </note></para>
            <para>The last major insertion is a new <code><link linkend="element-change"
                     >&lt;change></link></code>, documenting when we made the alterations. Its
                     <code><link linkend="attribute-when">@when</link></code> effectively updates
               the version of our TAN-A file.</para>
            <para>With these additions, the German version is now aligned with the other two. We
               could have made our work simpler just by directly modifying our local copy of the
               German version. But such a changes would not have affected the master copy. What
               happens when the owner of the German file makes changes? At that point we would
               struggle to integrate the changes in our forked copy. And we would have to repeat
               that exercise every time the German file was updated. By keeping our local copy of
               the German file unchanged, and making simple adjustments in our TAN-A file, we can
               keep our local copy synchronized with the master file and yet make the adjustments
               needed to coordinate with ours.</para>
            <para>The purpose statement in these guidelines says that TAN was "designed to <emphasis
                  role="bold">maximize</emphasis> the syntactic and semantic interoperable alignment
               and exchange of texts, annotations, and language resources across projects." Here we
               see the importance of the qualifier "maximize." In no world will there ever be (nor
               should there be, it seems) a single, standard, canonical way to divide a given work.
               The TAN format does not change that reality. Rather, it provides a convergent
               ecosystem in which different practices can be easily reconciled, to help editors and
               authors enhance cross-project interoperability without artificially forcing
               conformity, or suppressing legitimately different outlooks.</para>
            <para>Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g.,
               through his tag URN). We do so, and we suggest that he modify the version to make it
               align better. Perhaps he has reasons for labeling the lines with letters, and perhaps
               he is reluctant to explicitly identify this poem with <emphasis role="italic">Ring
                  around the Rosie</emphasis>. That is within his rights. But the conversation might
               lead to our pointing out that <code>n="e"</code> should probably be
                  <code>n="d"</code> and that there is an apparent typographic error in the last
               line. Or perhaps we're the ones in error. (The original, printed book has the poem
               twice on page 438, one with the spelling "Holderbuch" at line 3, the other,
               "Holderbusch".) If Schmidt chooses to correct his master file, he can add a new
                     <code><link linkend="element-change">&lt;change></link></code>, and thereby
               tacitly notify anyone else using the file that corrections have been made.</para>
            <para>At this point we have a network of six TAN files, five from our collection and one
               from outside. Although simple and small, this network could be extended to address
               some creative and complex research questions. Applications based on XSLT stylesheets
               could be used to automatically align the versions for reading and study, or to
               perform statistical analysis. </para>
            <para>What you've read so far is only a cursory introduction to TAN features. Study the
               rest of these guidelines, as well as example TAN libraries, and you will find
               numerous ways to develop TAN files, and to use them to enhance your research,
               teaching, and writing.</para>
         </section>
      </chapter>
   </part>
   <part xml:id="detailed_description">
      <title>Detailed Description</title>
      <partintro>
         <para>This part of the guidelines provides a detailed description of the design and
            structure of the formats of the Text Alignment Network. The material follows the
            organization of the schema files (kept in the <code>schemas</code> subdirectory), so
            both can be read in tandem.</para>
         <para><xref linkend="concepts_common"/> outlines, in a non-technical way, the principles
            and technical foundations of the TAN format.</para>
         <para><xref linkend="class_common"/>, <xref linkend="class_1"/>, <xref linkend="class_2"/>,
            and <xref linkend="class_3"/> describe each TAN format, by class. Each chapter starts
            with theoretical or scholarly contextual before explaining technical points. </para>
         <para>The chapters in this part are meant to provide a narrative companion to the much more
            detailed technical appendixes, <xref linkend="elements-attributes-and-patterns"/> and
               <xref linkend="vocabularies-master-list"/>, which are derived from the master schemas
            and vocabularies.</para>
         <para>The chapters in this part of the guidelines should be read selectively, not
            consecutively. They have been written with the assumption that you have already read the
            previous part (<xref linkend="general_overview"/>) and that you have already started to
            create or edit a TAN collection.</para>
         <para>Because readers will come from different specialties, all acronyms, abbreviations,
            and concepts are defined and explained, albeit tersely. Concepts or technologies are
            discussed only insofar as they affect the use of TAN; suggestions for further reading
            are provided for those who want a more thorough introduction to a topic. </para>
      </partintro>
      <chapter xml:id="concepts_common">
         <title>General Underpinnings</title>
         <para>This chapter retains something of the introductory spirit of the previous one by
            providing an overview of the fundamental principles and technologies behind TAN. The
            goal is to explain the principles behind the design of the format. Although this chapter
            assumes on your part no prior knowledge of any particular technology, it is also not
            meant to be a tutorial. Links to further reading will take you to good introductory
            material.</para>
         <section xml:id="design_principles">
            <title>Design Principles</title>
            <para>The TAN formats have been designed around a few basic principles:</para>
            <para><emphasis role="bold">Scholarly habits</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Be patient.</para>
                  </listitem>
                  <listitem>
                     <para>Simplify.</para>
                  </listitem>
                  <listitem>
                     <para>Stay focused.</para>
                  </listitem>
                  <listitem>
                     <para>Don't be redundant.</para>
                  </listitem>
                  <listitem>
                     <para>Don't state the obvious.</para>
                  </listitem>
                  <listitem>
                     <para>Use familiar conventions.</para>
                  </listitem>
               </itemizedlist></para>
            <para>
               <emphasis role="bold">Scholarly freedom</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Express doubt.</para>
                  </listitem>
                  <listitem>
                     <para>Offer alternatives.</para>
                  </listitem>
                  <listitem>
                     <para>Exercise independence.</para>
                  </listitem>
                  <listitem>
                     <para>Invite interdependence.</para>
                  </listitem>
               </itemizedlist></para>
            <para>
               <emphasis role="bold">Scholarly responsibility</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Declare your assumptions.</para>
                  </listitem>
                  <listitem>
                     <para>Make your work citable.</para>
                  </listitem>
                  <listitem>
                     <para>Satisfy scholars' expectations:</para>
                     <itemizedlist>
                        <listitem>
                           <para>Who did what when?</para>
                        </listitem>
                        <listitem>
                           <para>What are your sources?</para>
                        </listitem>
                        <listitem>
                           <para>How do you define your terms?</para>
                        </listitem>
                        <listitem>
                           <para>What alterations have you made to your sources?</para>
                        </listitem>
                        <listitem>
                           <para>What rights do I have to use your material?</para>
                        </listitem>
                     </itemizedlist>
                  </listitem>
               </itemizedlist>
               <emphasis role="bold">General utility</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Use stable technology.</para>
                  </listitem>
                  <listitem>
                     <para>Keep design predictable, consistent.</para>
                  </listitem>
                  <listitem>
                     <para>Make each datum human readable.</para>
                  </listitem>
                  <listitem>
                     <para>Make each datum computer actionable.</para>
                  </listitem>
               </itemizedlist>
            </para>
         </section>
         <section>
            <title>Format Organization</title>
            <para>The Text Alignment Network is a modular suite of XML encoding formats, each one
               designed for a specific type of textual data, divided into three classes: texts
               (class 1), text alignments and annotations (class 2), and everything else (class 3). </para>
            <para><emphasis role="bold">Class 1</emphasis>, representations of textual objects,
               consists solely of transcription files. (See <link
                  xlink:href="#transcription_and_transliteration">note on transcriptions versus
                  transliterations</link>.) Each transcription file contains the text of a single
               work from a single text-bearing object (which we term <emphasis>scriptum</emphasis>;
               see <xref xlink:href="#domain_model"/>), whether physical or digital. There are two
               types of transcription file: a standard generic format (TAN-T) and a customization of
               TEI All (TAN-TEI). These two types are differentiated by the root element,
                     <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> and
                  <code>&lt;TEI></code> respectively. </para>
            <para><emphasis role="bold">Class 2</emphasis>, encode claims about class-1 texts, and
               align them. There are two types of alignment, one for broad, general alignments and
               another for granular, word-for-word aligments. The former, with <code><link
                     linkend="element-TAN-A">&lt;TAN-A></link></code> as the root element, aligns
               any number (one or more) of class-1 files, and allows a wide variety of claims about
               those files. The latter, <code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>, aligns only pairs of class-1 files.
               Lexico-morphology files, <code><link linkend="element-TAN-A-lm"
                  >&lt;TAN-A-lm></link></code>, are used to encode the lexical and morphological (or
               part-of-speech) forms of individual words from a single class-1 file, or of a
               language in general.</para>
            <para><emphasis role="bold">Class 3</emphasis>, covers everything else. <code><link
                     linkend="element-TAN-mor">&lt;TAN-mor></link></code> is used to define the
               grammatical categories or features of a given language and to specify rules for
               lexico-morphological codes in dependent TAN-A-lm files. <code><link
                     linkend="element-TAN-voc">&lt;TAN-voc></link></code> collects and labels
               vocabulary items used in other TAN files. TAN catalog files have the root element
                     <code><link linkend="element-collection">&lt;collection></link></code>, and
               they index locally available TAN files, and selective parts of their metadata.</para>
            <para>This modular approach relies upon a <emphasis role="italic">stand-off</emphasis>
               approach to annotation or markup. In the alternative method, <emphasis role="italic"
                  >inline</emphasis> markup, an annotation is inserted directly into a
               transcription, e.g., <code>&lt;p>He said &lt;quote>"Jump!"&lt;/quote>&lt;/p></code>,
               where the inner element <code>&lt;quote></code> annotates the third word. Most TEI
               and HTML files rely upon in-line annotation. In stand-off annotation, <code>&lt;p>He
                  said "Jump!"&lt;/p></code> would be left as-is, and somewhere else there would be
               an annotation that states that the third word is a quotation. If the stand-off
               annotation is in the same file, it is an <emphasis>internal stand-off</emphasis>
               annotation. If the annotation is in a different file, it is an <emphasis>external
                  stand-off</emphasis> annotation.</para>
            <para>TAN depends upon external stand-off annotation, which provides several benefits: <itemizedlist>
                  <listitem>
                     <para>An editor can focus on a limited set of closely related questions.</para>
                  </listitem>
                  <listitem>
                     <para>A source text without inline annotations is less cluttered, and therefore
                        easier to read, than one with inline annotations. </para>
                  </listitem>
                  <listitem>
                     <para>Editors can work on separate annotation files based upon the same master
                        transcription file, even if they have very different research
                        interests.</para>
                  </listitem>
                  <listitem>
                     <para>Complementary or competing annotations can be made, even in the same
                        file, and those annotations may point to concurrent or overlapping spans of
                        text (a major problem for in-line annotation, where according to XML rules
                        no element may interlock or overlap with another).</para>
                  </listitem>
                  <listitem>
                     <para>A corpus of stand-off external annotation files become, collectively, a
                        complex dataset, supporting lines of research that might not have been
                        anticipated by any single project.</para>
                  </listitem>
                  <listitem>
                     <para>Editorial labor can be conducted without central coordination, as
                        individuals work at their own pace, independently.</para>
                  </listitem>
                  <listitem>
                     <para>When an errors is found in a transcription file, in can be corrected in a
                        single place, in the master. Anyone using a copy of that master file will be
                        notified in the validation process of changes that have been made and they
                        can deal with them accordingly. </para>
                  </listitem>
                  <listitem>
                     <para>Any data file can be updated independent of any other that points to it,
                        or to which it points.</para>
                  </listitem>
                  <listitem>
                     <para>Cross-file links required in stand-off annotation networks files, which
                        can then be combined and transformed in any number of ways to produce a wide
                        variety of derivative documents (e.g., collated versions, statistical
                        analysis).</para>
                  </listitem>
               </itemizedlist></para>
            <para>The stand-off approach works toward a principle often valued in computer science,
               that of the disaggregation of data. That is, in a master format, data should be
               simple and not entangled with other data. It can later be reaggregated in all kinds
               of ways, but that is an end product, not the way data should be managed. It is
               analogous to the way any well-run kitchen keeps its ingredients separate, until it is
               time to cook or bake a variety of products, at which time a few disaggregated
               ingredients can be combined in a variety of ways.</para>
            <para>Stand-off annotation is not without problems and vulnerabilities. Files might be
               altered or altogether deleted, rendering pointers in dependent files meaningless. An
               editor may find that not having the annotated text in the same place as the
               annotation is an inconvenience. These are important challenges, but TAN validation
               rules have been designed to mitigate such problems.</para>
         </section>
         <section xml:id="assumptions_creating_data">
            <title>Assumptions in the Creation of TAN Data</title>
            <para>All creators and users of TAN files are expected to share few basic
               assumptions.</para>
            <para>First, all TAN-compliant data is to be understood as largely
                  <emphasis>derivative</emphasis>. That is, data files express no originality or
               creativity independent of their sources (but see below about interpretation).
               TAN-compliant data must be created with the intent of adhering as closely as possible
               to some model or archetype. For example, a transcription is assumed to replicate
               faithfully some earlier digital edition or text-bearing material object (e.g., stone,
               papyrus, manuscript, printed book for written text; audiovisual media for oral or
               performative texts). Morphological files and alignment files should describe as
               clearly and as reliably as possible their source transcriptions. <emphasis>In
                  creating and publishing a TAN file you claim to have offered a good-faith
                  representation or description of something; in using a TAN file, you hold the
                  creator to that expectation.</emphasis></para>
            <para>Second, all core TAN files are <emphasis>interpretive</emphasis>. That is, they
               are permeated by editorial assumptions and opinions that might not be shared by
               everyone. If there is any resemblance of originality or creativity in a TAN file it
               is in that interpretive outlook. For example, if you edit a transcription file you
               must decide how to handle unusual letterforms and other visible marks. Your decisions
               will be influenced by your perspective on the original text and its native writing
               system, and how you interpret and use Unicode. If you write an alignment file, you
               must make decisions about what factors caused one text to be transformed into
               another. Lexicomorphological files require you to commit to one or more grammars and
               dictionaries, which adopt certain perspectives on language, and you must discern how
               best to handle cases of vagueness and ambiguity. No TAN file ever stands completely
               outside the interpretive act. <emphasis>In creating and publishing a TAN file you
                  claim to have disclosed as best you can the assumptions behind your interpretive
                  outlook; in using a TAN file, you hold the creator to that
               expectation.</emphasis></para>
            <para>Third, all core TAN files are <emphasis>applicable</emphasis>. That is, the
               interpretive impluse is assumed to be coupled with an equally strong desire to make
               the data as useful to as many users as possible, even those who may not share your
               assumptions or interpretation. A creator of a transcription file, for example, should
               normalize and segment texts with a minimum of idiosyncracies, adopting the most
               widely used reference systems, so as to optimize the alignment process. Morphological
               files should depend whenever possible upon commonly accepted grammars and lexica.
               Alignment files should work with comprehensible categories of text reuse. No TAN file
               will always be applicable to everyone, but it should be as applicable to as many as
               possible, as often as possible. <emphasis>In creating a TAN file you claim to use
                  common, shared conventions whenever possible, and to note any departures; in using
                  a TAN file, you hold the creator to that expectation.</emphasis></para>
            <para xml:id="accuracy-precision-comprehensiveness">Fourth, TAN data is to be considered
                  <emphasis>accurate, but not necessarily precise or exhaustive</emphasis>. For
               example, if a TAN-A file claims that the opening of Plato's
                  <emphasis>Republic</emphasis> book 3 quotes from Homer's
                  <emphasis>Iliad</emphasis>, the claim is true and accurate, but is neither precise
               nor exhaustive. There are parts of the opening of book 3 that are certainly not
               quotations, and most parts of the <emphasis>Iliad</emphasis> are not quoted in the
                  <emphasis>Republic</emphasis>. A token-for-token alignment of two texts might be
               selective, and focus only on the points of interest to the editor. Although the TAN
               formats permit a great deal of both precision and comprehensiveness, neither is
               mandated, unless explicitly required by a specific part of the TAN specifications.
                  <emphasis>In creating a TAN file you claim to make accurate assertions; in using a
                  TAN file, you should hold the creator to that expectation, but you must assess for
                  yourself how precise and complete it is.</emphasis></para>
         </section>
         <section>
            <title>Core Technology</title>
            <para>TAN depends upon a set of relatively stable technologies. Those technologies and
               the underlying terminology are briefly explained below, with attention paid to
               interpretive decisions that affect validation rules. References to further reading
               will lead you to better and more thorough introductions elsewhere. </para>
            <section xml:id="unicode">
               <title>Unicode</title>
               <section>
                  <title>What is it?</title>
                  <para>Unicode is the worldwide standard for the encoding, representation, and
                     exchange of digital texts. The standard is maintained by a nonprofit consortium
                     whose goal is to represent all the world's writing systems, living and
                     historical. The Unicode standard allows us to share texts in any alphabet
                     reliably, regardless of how that text is rendered (e.g., fonts,
                     display).</para>
                  <para>With more than 128,000 characters, Unicode is almost as complex as human
                     writing itself. The entire sequence of characters is divided into blocks, each
                     one reserved, more or less, for a particular alphabet or group of characters.
                     Within each block, characters may be grouped further. Each character is
                     assigned a single number called a codepoint.</para>
                  <para>Codepoints are numbered according to the hexadecimal system (base 16), which
                     uses the digits 0 through 9 and the letters A through F. (The decimal number 10
                     is hexadecimal A; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex
                     4F.) It is helpful to think of Unicode as a very long table of sixteen columns,
                     a glyph in each square; this is illustrated nicely <link
                        xlink:href="http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF"
                        >in this article</link>.</para>
                  <para>It is common to refer to Unicode characters by their value and perhaps by
                     their name. The value customarily starts "U+" and continues with the
                     hexadecimal value, usually at least four hexadecimal characters. When the
                     official Unicode name is given, it is normally in uppercase. Examples:</para>
                  <para>
                     <table frame="all">
                        <title>Unicode characters</title>
                        <tgroup cols="3">
                           <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                           <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                           <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                           <thead>
                              <row>
                                 <entry>Character</entry>
                                 <entry>Unicode value</entry>
                                 <entry>Unicode name</entry>
                              </row>
                           </thead>
                           <tbody>
                              <row>
                                 <entry>" " (space)</entry>
                                 <entry>U+0020</entry>
                                 <entry>SPACE</entry>
                              </row>
                              <row>
                                 <entry>®</entry>
                                 <entry>U+00AE</entry>
                                 <entry>REGISTERED SIGN</entry>
                              </row>
                              <row>
                                 <entry>ю</entry>
                                 <entry>U+044E</entry>
                                 <entry>CYRILLIC SMALL LETTER YU</entry>
                              </row>
                           </tbody>
                        </tgroup>
                     </table>
                  </para>
                  <para>In an XML file, nearly any Unicode codepoint may be used, either by typing
                     or pasting the character directly, or by using <emphasis role="bold">XML
                        entities</emphasis>. An XML entity is a proxy for some other text, marked by
                     an ampersand, some text, and then the semicolon, e.g., <code>&amp;amp;</code>
                     for the ampersand or <code>&amp;lt;</code> for &lt;. To access specific Unicode
                     characters an entity may start <code>&amp;#x</code> followed by the hexadecimal
                     codepoint (if you prefer the decimal version, leave off the #). For example,
                     the XML entity <code>&amp;#x44E;</code> is a proxy for the Cyrillic small
                     letter yu.</para>
               </section>
               <section xml:id="normalization">
                  <title>Unicode Normalization</title>
                  <para>Unicode rules provide guidance on how text should be normalized, to identify
                     equivalent variations. For example, the character o (U+006F: LATIN SMALL LETTER
                     O) followed by the combining accent ¨ (U+0308: COMBINING DIAERESIS) should be
                     treated identical in meaning to the single character ö (U+00F6: LATIN SMALL
                     LETTER O WITH DIAERESIS). There are two codepoints that could be used for the
                     Greek question mark (;), and normalization converts the less preferred
                     codepoint to the other.</para>
                  <para>TAN validation rules require all data to be normalized according to the
                     Unicode NFC algorithm (the most common of the four normalization methods). Any
                     text in a TAN file that is not NFC normalized will be marked as invalid. A
                     supplied Schematron Quick Fix will let users automatically normalize text (for
                     editing environments that support Schematron Quick Fixes).</para>
               </section>
               <section xml:id="unicode-characters-with-special-interpretation">
                  <title>Unicode characters with special interpretation</title>
                  <para>The characters U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, and U+00AD
                     SOFT HYPHEN placed at the end of a leaf <link linkend="element-div"
                           ><code>&lt;div></code></link>, perhaps followed by space that will be
                     ignored (see below), signal that the text is to be joined with any subsequent
                     text (i.e., the next leaf <link linkend="element-div"
                        ><code>&lt;div></code></link>). Accordingly, any TAN function that needs to
                     extract text from a <link linkend="element-div"><code>&lt;div></code></link>
                     structure will delete the U+200B, U+200D, or U+00AD character and its trailing
                     space. (By contrast, text from a leaf <link linkend="element-div"
                           ><code>&lt;div></code></link> that does not end this way will be
                     space-normalized, then appended by a single space.) Because these characters
                     are difficult to distinguish visually from spaces and hyphens, any output based
                     on the character mapping of the core functions should replace these characters
                     with their XML entities, <code>&amp;#x200b;</code>, <code>&amp;#x200d;</code>,
                     and <code>&amp;#xad;</code>.</para>
                  <para>Much has been written about the different ways U+00AD SOFT HYPHEN has been
                     or should be used and interpreted. Debate will no doubt continue. In designing
                     TAN, we have adopted the position that the soft hyphen marks a place in a word
                     where a line break has occurred, is allowed to occur, or both. In situations
                     where the text is printed or displayed, any soft hyphen that does not mark a
                     word that breaks across lines should not be displayed.</para>
               </section>
               <section xml:id="combining_characters">
                  <title>Combining characters</title>
                  <para>At the core level of conformance, Unicode does not dictate whether combining
                     characters (accents, modifying symbols) should be counted independently or as
                     part of a base character, nor do core XML technologies. In most cases, this
                     point is negligible. But it can affect regular expressions and XPath
                     expressions (see below). </para>
                  <para>Two of the class-2 formats allow the counting of characters. Such counting
                     is assumed to be made exclusively of individual non-combining characters (each
                     perhaps followed by one or more combining characters). Therefore one character
                     is defined as the regular expression <code>\P{M}\p{M}*</code>, bound to global
                     variable <xref xlink:href="#variable-char-reg-exp"/>. Any numerical reference
                     made in a TAN file to an individual character, i.e., through <code><link
                           linkend="attribute-chars">@chars</link></code>, will be found by counting
                     only non-combining characters. When the nth character is requested, TAN
                     functions will return the nth base character along with any combining
                     characters that immediately follow. </para>
                  <para>TAN rules stipulate that combining characters must have a preceding base
                     character. Any <link linkend="element-div"><code>&lt;div></code></link> that,
                     after any initial space, starts with a combining character will be marked as
                     invalid. See also <xref linkend="reg_exp_and_comb_chars"/>.</para>
               </section>
               <section xml:id="deprecated-unicode-points">
                  <title>Unicode points not allowed</title>
                  <para>Because TAN files are not scriptum-oriented (see <xref
                        xlink:href="#domain_model"/>), the following characters will generate an
                     error if found in a TAN file:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>U+00A0 NO-BREAK SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2000 EN QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2001 EM QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2002 EN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2003 EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2004 THREE-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2005 FOUR-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2006 SIX-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2007 FIGURE SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2008 PUNCTUATION SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2009 THIN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+200A HAIR SPACE</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
               <section>
                  <title>Further Reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://unicode.org">Unicode
                              Consortium</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="http://en.wikipedia.org/wiki/Unicode"
                                 >Unicode</link> (Wikipedia)</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="xml">
               <title>eXtensible Markup Language (XML)</title>
               <section>
                  <title>What is it?</title>
                  <para>Defined by the W3C, the eXtensible Markup Language (XML) is a markup
                     language that that can be extended to allow anyone to define the structure and
                     rules of a document type. For a quick, simple introduction to XML see <xref
                        linkend="gentle_guide"/>.</para>
               </section>
               <section>
                  <title>Schemas and validation</title>
                  <para>Validation files are found in the <code>schemas</code> subdirectory.</para>
                  <para>Each TAN file is validated by two types of schema files, one dealing with
                     major rules concerning structure and data type, written in RELAX-NG, the other
                     with more complex, detailed rules, written in Schematron.</para>
                  <para>The RELAX-NG rules are written primarily in compact syntax
                        (<code>*.rnc</code>), and then converted to XML syntax (<code>*.rng</code>).
                     For TAN-TEI, the special format One Document Does it all
                        (<code>TAN-TEI.odd</code>) is used to adjust the rules for TEI All. The ODD
                     file is then processed by TEI stylesheets into compact and XML RELAX-NG
                     formats.</para>
                  <para>The Schematron files are generally quite short. The primary work is done by
                     a substantial function library written in XSLT. For the most part, the
                     Schematron files simply point to the TAN function library, and handle its
                     results. For a detailed overview of this process, see <xref
                        xlink:href="#validating_tan_files"/> and <xref
                        linkend="tan-stylesheets-and-function-library"/>.</para>
                  <para>Some validation engines that process a valid TAN-compliant TEI file may
                     return an error something like <code>conflicting ID-types for attribute "who"
                        of element "comment" from namespace "tag:textalign.net,2015:ns"</code>. Such
                     a message alerts you to the fact that by mixing TEI and TAN namespaces, you
                     open yourself up to the possibility of conflicting <code>xml:id</code> values.
                     It is your responsibility to ensure that you have not assigned duplicate
                     identifiers. An XML editor may be configured to ignore this discrepancy. (In
                     oXygen XML editor go to Options > Preferences... > XML > XML Parser > RELAX NG
                     and uncheck the box ID/IDREF.)</para>
               </section>
               <section xml:id="whitespace">
                  <title>Space characters and normalization</title>
                  <para>By default in XML, unless otherwise specified, consecutive space characters
                     (space, tab, newline, and carriage return) are considered equivalent to a
                     single space. This gives editors the freedom to format XML documents as they
                     like, balancing human readability against compactness. In XML, <emphasis
                        role="bold">space normalization</emphasis> is performed by stripping leading
                     and trailing whitespace and replacing sequences of one or more whitespace
                     character with a single space, <code>&amp;#x20;</code>. </para>
                  <para>All TAN formats assume space normalization, with an extra caveat for leaf
                           <code><link linkend="element-div">&lt;div></link></code>s. Initial space
                     is always stripped. If a leaf <code><link linkend="element-div"
                        >&lt;div></link></code> ends in the soft hyphen or the zero width joiner
                     (see <xref linkend="unicode-characters-with-special-interpretation"/>) the
                     character is suppressed along with any ending space, otherwise the text ends in
                     a single space character (whether or not there are space characters in the leaf
                           <code><link linkend="element-div">&lt;div></link></code> itself).</para>
                  <para>If retention of multiple spaces or spaces of specific sizes is important for
                     your files and research, then you should not be working with the TAN format,
                     which cannot be used to replicate the appearance of a scriptum (see <xref
                        xlink:href="#domain_model"/>). Pure TEI (and not TAN-TEI) is a better
                     alternative, since it allows for a literal use of space, and supports the
                     creation of scriptum-oriented XML files.</para>
                  <para>For more on space see guidance in <link
                        xlink:href="https://www.w3.org/TR/REC-xml/#sec-white-space">the W3C
                        recommendation</link>.</para>
               </section>
               <section xml:id="non-mixed_content">
                  <title>Mixed, non-mixed, and semi-mixed content</title>
                  <para>In many popular XML formats such as TEI, XHTML, and Docbook some elements
                     allow a mixture of elements and nonspace text as children, e.g.,
                        <code>&lt;div>Some &lt;span>text&lt;/span>&lt;/div></code>. These are called
                        <emphasis role="bold">mixed content</emphasis> models. The TAN formats,
                     aside from TAN-TEI, are committed to a <emphasis role="bold">non-mixed
                        content</emphasis> model, e.g., <code>&lt;div>&lt;span>Some
                        &lt;/span>&lt;span>text&lt;/span>&lt;/div></code>. Nonspace text nodes and
                     elements are never siblings. The practical effect of this decision is TAN files
                     may be indented as you like, and whitespace text may be placed anywhere,
                     without altering the meaning.</para>
                  <para>An expanded TAN file (see <xref linkend="validating_tan_files"/>) may
                     include what we term a <emphasis role="bold">semi-mixed content</emphasis>
                     model, in which any element may have one and only one nonspace text node along
                     with any children elements. That nonspace text node may appear at the beginning
                     or the end of the children nodes.</para>
               </section>
            </section>
            <section xml:id="namespace">
               <title>Namespaces</title>
               <section>
                  <title>What are they?</title>
                  <para>XML allows users to create document types of whatever kind. One person may
                     wish to use the element <code>&lt;band></code> to refer to a musical group;
                     another might use this element to encode radio frequencies. Perhaps someone
                     wishes to mention a musical group and a radio frequency in the same document,
                     which would entail mixing two very different types of <code>&lt;band></code>.
                     XML allows users to mix vocabularies, even when those vocabularies use the same
                     element names. Disambiguation is accomplished by associating an element name
                     with a kind of family name. That family name is an IRI (see <xref
                        linkend="IRIs_and_linked_data"/> below). The actual full name of an element,
                     then, is the local name plus the IRI that qualifies its meaning, e.g.,
                        <code>band{http://music-example.com/terms/}</code> and
                        <code>band{http://frequency-example.com/terms/}</code>. </para>
                  <para>The IRI—the family name—is called the <emphasis>namespace</emphasis>, a term
                     that is understandably vague or confusing to many, because it has nothing to do
                     with space. </para>
                  <para>Namespaces can be declared in an XML document. When they appear, they look a
                     lot like attributes. (They aren't.) They take the form
                        <code>xmlns="http://music-example.com/terms/"</code> (this defines the
                        <emphasis role="bold">default namespace</emphasis>) or
                        <code>xmlns:[PREFIX]="http://frequency-example.com/terms/"</code> (this
                     assigns a namespace to a prefix) placed inside an opening tag. For example,
                        <code>&lt;band xmlns="http://music-example.com/terms/">...&lt;/band></code>
                     declares <code>http://music-example.com/terms/</code> to be the default
                     namespace for <code>&lt;band></code> and all descendants, unless explicitly
                     overridden. </para>
                  <para>To return to our example, different <code>&lt;band></code>s can be combined
                     through namespaces:</para>
                  <programlisting>&lt;band xmlns="http://music-example.com/terms/">
    &lt;band xmlns="http://frequency-example.com/terms/">
        ...
    &lt;/band>
&lt;/band>

&lt;band xmlns="http://music-example.com/terms/" 
    xmlns:e2="http://frequency-example.com/terms/">
    &lt;e2:band >
        ...
    &lt;/e2:band>
&lt;/band>

&lt;e1:band xmlns:e1="http://music-example.com/terms/" 
    xmlns:e2="http://frequency-example2.com/terms/">
    &lt;e2:band >
        ...
    &lt;/e2:band>
&lt;/e1:band></programlisting>
               </section>
               <section>
                  <title>TAN namespace and prefix</title>
                  <para>The TAN namespace is <emphasis role="bold"
                           ><code>tag:textalign.net,2015:ns</code></emphasis>. The recommended
                     prefix is <emphasis role="bold"><code>tan</code></emphasis>. The namespace does
                     not change from one version of TAN to another.</para>
                  <para>The TAN-TEI format uses as its default the TEI namespace, <code><link
                           xlink:href="http://www.tei-c.org/ns/1.0"/></code>, normally given the
                     prefix <emphasis role="bold"><code>tei</code></emphasis>. But in a TAN-TEI
                     file, the <code>head</code> and its descendants are in the TAN
                     namespace.</para>
               </section>
            </section>
            <section xml:id="TEI">
               <title>The Text Encoding Initiative</title>
               <section>
                  <title>What is it?</title>
                  <para>The Text Encoding Initiative (TEI; <link
                        xlink:href="http://www.tei-c.org/index.xml"/>) is consortium of scholars and
                     scholarly organizations that maintains the rules and documentation behind a
                     collection of XML formats intended for encoding texts. TEI files have been
                     widely used by libraries, museums, publishers, and individual scholars to
                     prepare and publish texts for online research, teaching, and preservation. In
                     addition to the guidelines themselves, the Consortium provides a variety of
                        <link xlink:href="http://www.tei-c.org/Support/Learn/">resources</link> and
                        <link xlink:href="http://members.tei-c.org/Events">training events</link>
                     for learning TEI, information on <link
                        xlink:href="http://www.tei-c.org/Activities/Projects/">projects using the
                        TEI</link>, a <link
                        xlink:href="http://www.tei-c.org/Activities/SIG/Education/tei_bibliography.xml"
                        >bibliography of TEI-related publications</link>, and <link
                        xlink:href="http://www.tei-c.org/Tools/">software</link>.</para>
                  <para>TEI gave the impetus for the creation of TAN, and continues to inspire its
                     development. TEI was designed to be highly customizable, to suit the needs of
                     individuals or communities of practice. One of the TAN formats, TAN-TEI, is one
                     such customization, based as it is on an ODD file that is in the same directory
                     as the rest of the schemas. TAN-TEI schemas are generated on the basis of the
                     official TEI All schema that is available at the time of release. </para>
                  <para>TAN-TEI files and standard, out-of-the-box TEI All files are not
                     automatically interchangeable. TAN-TEI expects all metadata to be human- and
                     computer-readable, whereas TEI metadata is geared primarily to human
                     readability. TAN-TEI tightly regulates the structure of the text, whereas TEI
                     allows for a variety of structures. In any conversion process to and from TEI
                     and TAN-TEI, some human intervention is required, and conversion in either
                     direction may entail loss.</para>
                  <para>For more about the strictures placed upon the TEI All schema see <xref
                        linkend="tan-tei"/>. See also <xref linkend="class_common"/> and <xref
                        linkend="class_1"/>.</para>
               </section>
               <section>
                  <title>Further reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://www.tei-c.org/">Text Encoding
                                 Initiative</link></para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="data_types">
               <title>Data types</title>
               <para>Being written purely in XML technologies, TAN uses data types defined in the
                  W3C's <link xlink:href="https://www.w3.org/TR/xmlschema-2/">official
                     specifications</link>, e.g., strings, booleans, integers. The following data
                  types require some special comments.</para>
               <section xml:id="language">
                  <title>Languages</title>
                  <para>TAN adopts for language identification Best Common Practices (BCP) 47, which
                     standardizes identifiers for languages and scripts. For most users of TAN, this
                     will be a simple two- or three-letter abbreviation, sometimes supplemented with
                     a hyphen and an abbreviation designating a script or regional subtag. For
                     example, <code>eng</code>, <code>eng-UK</code>, and <code>eng-UK-Cyrl</code>
                     refer, respectively, to English (in general), English from the United Kingdom,
                     and English from the United Kingdom written in the Cyrillic script. As a
                     general rule, values of this type should begin with a three-letter language
                     code, preferably lowercase. (The two-letter codes cover only a few dozen
                     languages; the three-letter codes support thousands of them.)</para>
                  <para>ISO codes for human languages appear in <code><link
                           linkend="attribute-xmllang">@xml:lang</link></code> and <code><link
                           linkend="element-for-lang">&lt;for-lang></link></code>. The former states
                     what language the enclosed text is in. The latter is an empty element that
                     simply points to a specific language. For example, <code><link
                           linkend="element-for-lang">&lt;for-lang></link></code> in the context of
                     a TAN-mor file indicates which languages the file was written for.</para>
                  <para>TAN has several global variables and functions useful for working with
                     language codes. See <xref xlink:href="#vkft-TAN-language"/>.</para>
                  <para>For more information, see one of the following:<itemizedlist>
                        <listitem>
                           <para>BCP 47 <link xlink:href="http://tools.ietf.org/rfc/bcp/bcp47"
                                 >official specifications</link></para>
                        </listitem>
                        <listitem>
                           <para>BPC 47 <link
                                 xlink:href="http://www.w3.org/TR/xmlschema11-2/#language">technical
                                 details</link></para>
                        </listitem>
                     </itemizedlist></para>
               </section>
               <section xml:id="date_and_datetime">
                  <title>Dates and times</title>
                  <para>For dates and dates + times, TAN adopts the corresponding XML data types,
                     which follow ISO syntax. That syntax begins with years (the largest unit) and
                     ends with days, seconds, or fractions of seconds (the smallest).</para>
                  <para>The simplest date takes this form: <code>YYYY-MM-DD</code>. If a time is
                     included, it is specified by continuing the string, first with a <code>T</code>
                     (for time) then the form <code>hh:mm:ss.sss(Z|[-+]hh:mm)</code>. For example,
                     the following is <code>2016-09-20T20:38:27.141-04:00</code> is an ISO date-time
                     for Tuesday, September 20, 2016 at 8:38 p.m., Eastern Time Zone.</para>
                  <para>More reading:<itemizedlist>
                        <listitem>
                           <para><link xlink:href="https://www.w3.org/TR/xmlschema-2/#dateTime">W3C
                                 specification</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="https://en.wikipedia.org/wiki/ISO_8601">Wikipedia
                                 entry on ISO 8601</link></para>
                        </listitem>
                     </itemizedlist></para>
               </section>
            </section>
            <section xml:id="IRIs_and_linked_data">
               <title>Identifiers and Their Use (IRIs, URIs, URLs, URNs, UUIDs)</title>
               <para>TAN makes extensive use of the following identifiers:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><emphasis>IRI</emphasis>: Internationalized Resource Identifier, a
                           generalization of the URI system, allowing the use of Unicode; <link
                              xlink:href="http://www.ietf.org/rfc/rfc3987.txt">defined by RFC
                              3987</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URI</emphasis>: Uniform Resource Identifier, a string of
                           characters used to identify a name or a resource; <link
                              xlink:href="https://tools.ietf.org/html/rfc3986">defined by RFC
                              3986</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URL</emphasis>: Uniform Resource Locator, a URI that
                           identifies a Web resource and the communication protocol for retrieving
                           the resource.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URN</emphasis>: Uniform Resource Name, a term that
                           originally referred to persistent names that used a bare
                              <code>urn:</code> scheme, but is now applied to a variety of systems
                           that have registered with the IANA. URNs are generally best thought of as
                           a subset of URIs.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>UUID</emphasis>: Universally Unique Identifier, a
                           computer-generated 128-bit number that may be attached as an identifier
                           to any entity. UUIDs can be built into a URN by prefixing them with
                              <code>urn:</code>.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>The TAN format makes extensive use of all the above. See also <xref
                     xlink:href="#tag_urn"/>.</para>
               <section xml:id="rdf_and_lod">
                  <title>Resource Description Framework (RDF) and Linked Open Data</title>
                  <section>
                     <title>What are they?</title>
                     <para>Identifiers are used in many contexts for many purposes. One such purpose
                        is called Linked Open Data (LOD), also known as the Semantic Web, which aims
                        to network data across projects. It relies upon a very simple data model
                        called Resource Description Framework (RDF), recommended by the World Wide
                        Web Consortium (W3C). The term "Resource"—the R in RDF—refers to any person,
                        place, concept—anything at all, whether you think of it as a resource or
                        not. "Description" is overly specific, too, since RDF was designed to
                        support general assertions, descriptive or not. Perhaps it is easiest to
                        think of RDF as a standardized way to make assertions, as if the name were
                        simply "Assertion Framework."</para>
                     <para>The RDF data model rests upon the concept of a statement, made of three
                        parts: subject, predicate, and object. Subjects and predicates take
                        identifiers that name things. The object may take an identifier or just
                        data. As people independently identify concepts with the same URLs, they
                        create RDF datasets can be combined, synthesized, and compared. RDF
                        statements found across the web allow inferences no individual project could
                        ever anticipate. </para>
                     <para>The Semantic Web recommends the use of URLs as identifiers. That way, if
                        a computer encounters a URL naming a concept, it can be programmed go to the
                        web resource and retrieve other RDF statements, recursively. So URL
                        identifiers look like a web page address (e.g., <code>http://...</code>),
                        but they are first and foremost names for things. Ideally, those URLs will
                        still name those things after the domain name expires and the web resource
                        cannot be found. </para>
                     <para>Although RDF statements must be made of only three components, it is
                        possible in a roundabout way to create more complex assertions. In one
                        technique, the assertion itself is given a URL, and then RDF statements are
                        made about the assertion. Such assertions are in some cases not easily
                        integrated with other RDF statements. Users who query an RDF database will
                        not find relevant complex RDF statements unless they build their queries to
                        anticipate such situations (or the query engine has been customized).</para>
                  </section>
                  <section>
                     <title>TAN Claims and RDF</title>
                     <para>Much of TAN can be converted to RDF statements. In fact, TAN may be one
                        of the most human-friendly ways to read and write RDF. For example, consider
                        how one might express "Person X's name is 'Dave Smith'." Compare this
                        snippet (taken from <link
                           xlink:href="http://linkeddatabook.com/editions/1.0/"/>), written in
                        Turtle, the RDF syntax generally regarded as the most human-readable,
                        ...<programlisting>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix foaf: &lt;http://xmlns.com/foaf/0.1/> . 

&lt;http://biglynx.co.uk/people/dave-smith> 
rdf:type foaf:Person ; 
foaf:name "Dave Smith" .</programlisting></para>
                     <para>...with the TAN
                        equivalent:<programlisting>&lt;person>
   &lt;IRI>http://biglynx.co.uk/people/dave-smith&lt;/IRI>
   &lt;name>Dave Smith&lt;/name>
&lt;/person></programlisting></para>
                     <para>These TAN and RDF expressions are interchangeable. </para>
                     <para>But in more complex claims, it is, at this time, not clear whether all
                        assertions in TAN can be losslessly converted to the RDF model. This happens
                        most often in the context of the TAN-A <code><link linkend="element-claim"
                              >&lt;claim></link></code>, which is designed to allow scholarly
                        assertions and claims that are difficult or impossible to express in RDF.
                        For example, RDF does not allow one to say "Person X is not the author of
                        text Y," but TAN does. </para>
                     <para>TAN claims can also be quite complex. Whereas the standard RDF claim
                        consists of three components—subject, predicate, object—most TAN claims have
                        more. Every TAN claim must have at the minimum: a claimant (no RDF
                        counterpart; the person, organization, or algorithm that asserts the claim),
                        a subject (counterpart to RDF subject), and a verb (counterpart to RDF
                        predicate). Verbs can be defined to permit, require, or disallow other claim
                        components, such as adverbs or objects, many of which are permitted by
                        default. Most TAN claims involve more than three components, so converting a
                        TAN claim to RDF requires creating a complex RDF statement (see previous
                        section).</para>
                     <para>Many TAN claims involve textual subjects or objects. References to parts
                        of text can be quite complex, and they must be made with reference to other
                        entities. It doubtful whether a given specific textual subject or object can
                        be satisfactorily reduced to an unambiguous IRI, because such an IRI would
                        need to include a mechanism to resolve the meaning of the syntax. Such an
                        IRI must not only explain the work's reference system, but also identify the
                        chosen version, scriptum, and perhaps token definition and numeration
                        system. Many texts have more than one "canonical" reference system, so an
                        IRI might point to two different textual passages, thereby breaking a
                        cardinal rule of IRIs: although an entity may be given multiple IRIs, it is
                           <emphasis>never</emphasis> acceptable for an IRI to be ambiguous.</para>
                     <para>For more details see <xref linkend="TAN-A"/> and <code><link
                              linkend="element-claim">&lt;claim></link></code>.</para>
                  </section>
                  <section>
                     <title>Further reading</title>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para><link xlink:href="https://www.w3.org/RDF/">W3C
                                    recommendation</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://linkeddata.org/">Linked
                                 Data</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://lov.okfn.org/dataset/lov/">Linked Open
                                    Vocabularies</link></para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </section>
               </section>
               <section xml:id="tag_urn">
                  <title>Tag URNs</title>
                  <para>TAN files make extensive use of tag URNs (see <xref
                        xlink:href="#IRIs_and_linked_data"/>). In fact, TAN's namespace is a tag URN
                        (<xref linkend="namespace"/>). A <link xlink:href="http://www.taguri.org"
                        >tag URN</link> has two parts:</para>
                  <para>
                     <orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Namespace.</emphasis>
                              <code>tag:</code> + an e-mail address or domain name owned by the
                              person or organization that has authorized the creation of the TAN
                              file + <code>,</code> + an arbitrary day on which that address or
                              domain name was owned + <code>:</code>. The day is expressed in the
                              form <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
                                 <code>YYYY</code>. A missing <code>MM</code> or <code>DD</code> is
                              implicitly assigned the value of <code>01</code>.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Name of the TAN file.</emphasis> An arbitrary
                              string (unique to the namespace chosen) chosen by the namespace owner
                              as a label for the entire file and related versions. It can be the
                              same as the filename, but it is a good practice not to do so, because
                              filenames . You should pick a name that is at least somewhat
                              intelligible to human readers.</para>
                        </listitem>
                     </orderedlist>
                  </para>
                  <para>Although you may use any tag URN coined by someone else, you may create a
                     tag URN only in namespaces you own.</para>
                  <para>Great care must be taken in choosing the name, because you are the sole
                     guarantor of its uniqueness. <emphasis role="italic">It is permissible for
                        something to have multiple identifiers, but never acceptable for an
                        identifier to name more than one thing.</emphasis> It is a good practice to
                     keep a master checklist of tag URNs you have created. If you find yourself
                     forgetting, or think you run the risk of creating duplicate tag URNs, you
                     should start afresh by creating a new namespace for your tag URNs, if only by
                     changing the date in the tag URN namespace.</para>
                  <para>
                     <example>
                        <title>Tag URNs</title>
                        <programlisting>tag:jan@example.com,1999-01-31:TAN-T001
tag:example.com,2001-04:hamlet-tan-t
tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts
tag:bbrb@example.org,1995-04-01:pos-grc</programlisting>
                        <para>The first example comes from someone who owned the email address
                              <code>jan@example.com</code> on January 31, 1999 (at the stroke of
                           midnight, Universal Coordinated Time). The other examples follow a
                           similar logic. The namespace of the second and third examples are tied to
                           the owners of specific domain names. The <code>2014</code> in the third
                           example is shorthand for the first second of January 1, 2014.</para>
                     </example>
                  </para>
                  <para>TAN has adopted tag URNs over URLs for several reasons:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><emphasis role="bold">Permanence.</emphasis> Authors of TAN data
                              are creating files that are meant to be relevant for decades and
                              centuries from now, well after most domain names today have changed
                              ownership or fallen into obsolesence, and well after the creators are
                              dead. URLs are not built for such permanence. </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Responsibility.</emphasis> The TAN format
                              requires every piece of data to be attributable to someone (a person,
                              a group of persons, or an algorithm). A tag URN connects the
                              identifier with the responsible person or group. URLs cannot provide
                              such support.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Accessibility.</emphasis> Tag URNs have
                              almost no barriers. They can be created by anyone who has an email
                              address. No one has to register with a central authority. You can
                              begin naming anything you want, any time you want, without seeking
                              anyone's approval, and without paying anything.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Ease</emphasis>. Tag URNs are easy to use.
                              Many potential TAN authors never have owned a domain name, and never
                              will. Further, many of those who do own domain names cannot or do not
                              wish to configure, populate, maintain, and troubleshoot servers with
                              the referral mechanisms recommended by Semantic Web advocates (see
                                 <xref xlink:href="#rdf_and_lod"/>).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Scholarly citation norms</emphasis>. In the
                              Semantic Web, the conflation of URL <emphasis>qua</emphasis> name with
                              URL <emphasis>qua</emphasis> location is considered by many a virtue
                              because the single string does double duty, both naming the resource
                              and pointing to a location where more can be learned. Although the
                              combination is elegant from an engineering perspective, it is
                              confusing to others: URLs are commonly thought to be purely locations
                              for data, not names for things. It also goes against an important
                              principle in scholarly bibliographies, namely, the name of a cited
                              publication should always be distinguished from where it might be
                              found. In scholarly citation practice, a name and a location should
                              always be disambiguated.</para>
                        </listitem>
                     </itemizedlist>
                  </para>
                  <para>Further reading:<itemizedlist>
                        <listitem>
                           <para><link xlink:href="https://tools.ietf.org/html/rfc4151">RFC
                                 4151</link>, the official definition of tag URNs</para>
                        </listitem>
                     </itemizedlist></para>
               </section>
            </section>
            <section xml:id="regular_expressions">
               <title>Regular Expressions</title>
               <para>Regular expressions are patterns for searching text. The term <emphasis
                     role="italic">regular</emphasis> here does not mean ordinary. Rather, alluding
                  to the Latin root <emphasis role="italic">regula</emphasis> (rule), it refers to a
                  rule-based method of finding and replacing text through patterns. Regular
                  expressions come in different flavors, and have several layers of complexity. TAN
                  regular expressions adhere closely to the <link
                     xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions">recommendation
                     of XSLT 3.0</link> (XML Schema Datatypes plus some extensions), and outlined in
                     <link xlink:href="http://www.w3.org/TR/xpath-functions-30/#regex-syntax">XPath
                     Fuctions 3.0</link>. <caution>
                     <para>XML Schema Datatypes define regular expressions differently than do Perl,
                        one of the most common forms of regular expression. For example, the pipe
                        symbol, |, is treated as a word character in XML regular expressions
                           (<code>\w</code>), but the opposite is true for Perl. For convenience,
                        here are the word classifications for codepoints U+0020..U+00FF according to
                        XML (and therefore TAN):</para>
                     <para><emphasis role="bold">Word characters </emphasis>(<code>\w</code>):
                           <code>$ + 0 1 2 3 4 5 6 7 8 9 &lt; = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ</code>
                     </para>
                     <para><emphasis role="bold">Non-word characters </emphasis>(<code>\W</code>):
                           <code>! " # % &amp; ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ­ ¶ · »
                           ¿</code></para>
                     <para>Some of these decisions about what is word-like and what isn't may seem
                        counterintuitive or wrong. But at this point complaining will not change the
                        conventions. The distinction is a legacy that will endure. Just familiarize
                        yourself with decisions that look admittedly arbitrary.</para>
                  </caution></para>
               <para>A regular expression search pattern is treated just like a conventional search
                  pattern until the computer reaches a special character: <code>. [ ] \ | - ^ $ ? *
                     + { } ( )</code>. Here is a brief key to how those special characters behave in
                  regular expressions when they are first found. (Some of these special characters
                  change their meaning if they are found inside square brackets; on this point, see
                  the recommended reading below):</para>
               <para>
                  <table frame="all">
                     <title>Special characters in regular expressions</title>
                     <tgroup cols="2">
                        <colspec colname="c1" colnum="1" colwidth="1*"/>
                        <colspec colname="c2" colnum="2" colwidth="12.33*"/>
                        <thead>
                           <row>
                              <entry>Symbol</entry>
                              <entry>Meaning</entry>
                           </row>
                        </thead>
                        <tbody>
                           <row>
                              <entry><code>.</code></entry>
                              <entry>any character</entry>
                           </row>
                           <row>
                              <entry><code>|</code></entry>
                              <entry>or (union)</entry>
                           </row>
                           <row>
                              <entry><code>^</code></entry>
                              <entry>start of line</entry>
                           </row>
                           <row>
                              <entry><code>?</code></entry>
                              <entry>zero or one</entry>
                           </row>
                           <row>
                              <entry><code>*</code></entry>
                              <entry>zero or more</entry>
                           </row>
                           <row>
                              <entry><code>+</code></entry>
                              <entry>one or more</entry>
                           </row>
                           <row>
                              <entry><code>[ ]</code></entry>
                              <entry>a class of characters</entry>
                           </row>
                           <row>
                              <entry><code>( )</code></entry>
                              <entry>a group</entry>
                           </row>
                           <row>
                              <entry><code>\w</code></entry>
                              <entry>any word character</entry>
                           </row>
                           <row>
                              <entry><code>\W</code></entry>
                              <entry>any nonword character</entry>
                           </row>
                           <row>
                              <entry><code>\s</code></entry>
                              <entry>any of the four standard spacing characters: space (U+0020),
                                 tab (U+0009), newline (U+000A), carriage return (U+000D)</entry>
                           </row>
                           <row>
                              <entry><code>\S</code></entry>
                              <entry>anything not a spacing character</entry>
                           </row>
                           <row>
                              <entry><code>\d</code></entry>
                              <entry>any digit (0-9)</entry>
                           </row>
                           <row>
                              <entry><code>\D</code></entry>
                              <entry>anything not a digit</entry>
                           </row>
                           <row>
                              <entry><code>\p{IsGujarati}</code></entry>
                              <entry>any character from the Unicode block named Gujarati</entry>
                           </row>
                           <row>
                              <entry><code>^</code></entry>
                              <entry>beginning of a line or string (doesn't capture any
                                 characters)</entry>
                           </row>
                           <row>
                              <entry><code>$</code></entry>
                              <entry>end of a line or string (doesn't capture any
                                 characters)</entry>
                           </row>
                           <row>
                              <entry><code>\\</code></entry>
                              <entry>backslash (an escaped escape character)</entry>
                           </row>
                           <row>
                              <entry><code>\^</code></entry>
                              <entry>a caret sign (must be escaped with the \)</entry>
                           </row>
                           <row>
                              <entry><code>\$</code></entry>
                              <entry>dollar sign (escaped)</entry>
                           </row>
                           <row>
                              <entry><code>\(</code></entry>
                              <entry>opening parenthesis (escaped)</entry>
                           </row>
                           <row>
                              <entry><code>\[</code></entry>
                              <entry>opening square bracket (escaped)</entry>
                           </row>
                        </tbody>
                     </tgroup>
                  </table>
               </para>
               <para>Some examples:</para>
               <table frame="all">
                  <title>Examples of Regular Expressions</title>
                  <tgroup cols="3">
                     <colspec colname="newCol1" colnum="1" colwidth="1*"/>
                     <colspec colname="c1" colnum="2" colwidth="1.48*"/>
                     <colspec colname="c2" colnum="3" colwidth="6.59*"/>
                     <thead>
                        <row>
                           <entry>Expression</entry>
                           <entry>Meaning</entry>
                           <entry>What the expression matches when applied to "Wi-fi, good. A_hem*
                              isn't!"</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>^.+$</code></entry>
                           <entry>one whole line of characters</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>[ae]</code></entry>
                           <entry>a or e</entry>
                           <entry>"e"</entry>
                        </row>
                        <row>
                           <entry><code>[a-e]</code></entry>
                           <entry>a, b, c, d, or e</entry>
                           <entry>"d", "e"</entry>
                        </row>
                        <row>
                           <entry><code>[^ae]+</code></entry>
                           <entry>one or more characters that are anything except a or e</entry>
                           <entry>"Wi-fi, good. A_h", "m* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>.i</code></entry>
                           <entry>any character followed by i.</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>(.i)</code></entry>
                           <entry>when a character followed by an i is found treat it as a capture
                              group (used only in a search pattern)</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>[aeiou]\w*</code></entry>
                           <entry>any lowercase vowel along with every word character that
                              follows</entry>
                           <entry>"i", "i", "ood", "em", "isn"</entry>
                        </row>
                        <row>
                           <entry><code>[t*].</code></entry>
                           <entry>any t or * and the following character</entry>
                           <entry>"* ", "t!" Note that the asterisk, if inside a character class,
                              represents itself.</entry>
                        </row>
                        <row>
                           <entry><code>\s+</code></entry>
                           <entry>one or more space characters</entry>
                           <entry>" ", " ", " "</entry>
                        </row>
                        <row>
                           <entry><code>\w+</code></entry>
                           <entry>one or more word characters</entry>
                           <entry>"Wi", "fi", "good", "A_hem", "isn", "t"</entry>
                        </row>
                        <row>
                           <entry><code>\W+</code></entry>
                           <entry>match one or more nonword characters</entry>
                           <entry>"-", ", ", ". ", "* ", "'", "!"</entry>
                        </row>
                        <row>
                           <entry><code>[^q]+</code></entry>
                           <entry>one or more characters that are not a q</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>The examples above provide a taste of how regular expressions are constructed
                  and read.</para>
               <warning xml:id="reg_exp_and_comb_chars">
                  <title>Regular Expressions and Combining Characters</title>
                  <para>A regular expressions might be ambiguous in the context of combining
                     characters. Suppose we have a string of three characters, áb (i.e., an acute
                     accent over the a, <code>&amp;#x61;&amp;#x301;&amp;#x62;</code>). The regular
                     expression <code>a.</code> will in some search engines include the b and others
                     not.</para>
                  <para>Unicode has differentiated three levels of support for regular expressions
                     (see <link xlink:href="http://www.unicode.org/reports/tr18/">official
                        report</link>). Only level-one conformance in TAN is guaranteed. Combining
                     characters fall in level two. In TAN, character counts depend exclusively upon
                     base characters, not combining ones (see <xref linkend="combining_characters"
                     />).</para>
               </warning>
               <para>TAN includes several functions that usefully extend XML regular expressions.
                  See <xref xlink:href="#vkft-regex-ext-tan"/>.</para>
               <para>Further reading:<itemizedlist>
                     <listitem>
                        <para>Various <link
                              xlink:href="http://www.google.com/search?q=tutorial+regular+expressions"
                              >tutorials on Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para>Wikipedia, <link
                              xlink:href="http://en.wikipedia.org/wiki/Regular_expression">Regular
                              Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions"
                              >Regular Expressions in XSLT 3.0</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.unicode.org/reports/tr18/">Unicode and
                              Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xmlschema-2/#regexs">XML Schema
                              Datatypes</link></para>
                     </listitem>
                     <listitem>
                        <para><link
                              xlink:href="http://www.balisage.net/Proceedings/vol25/html/Kalvesmaki01/BalisageVol25-Kalvesmaki01.html"
                              >A New \u: Extending XPath Regular Expressions for Unicode</link>
                        </para>
                     </listitem>
                  </itemizedlist></para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_common">
         <title>Patterns and Structures Common to All TAN Encoding Formats</title>
         <para>This chapter provides general background to the elements and attributes that are
            common to all TAN files. For more detailed discussion, see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>This chapter does not discuss TAN catalog files, on which see <xref
               linkend="catalog-files"/>.</para>
         <section xml:id="patterns">
            <title>Common Patterns</title>
            <section xml:id="pattern-iri_and_name">
               <title>IRI + name Pattern</title>
               <para>Both humans and computers need to read and write TAN metadata. Very often what
                  is readable to humans is unreadable to computers, and vice versa. So the TAN
                  format requires that all metadata be provided whenever possible in both forms.
                  Although this rule may appear to introduce redundancy and therefore opportunities
                  for error, the clarity is critical. It is the only way at present to ensure that
                  any person or algorithm that approaches the data can parse and use it. In
                  addition, doubly expressed metadata provides a safeguard much like a checksum:
                  human- and computer-readable descriptions should comport. Any discrepancy signals
                  an error that should be checked.</para>
               <para>Some metadata, such as that inside <code><link linkend="element-comment"
                        >&lt;comment></link></code> or <code><link linkend="element-change"
                        >&lt;change></link></code>, are neither easily nor profitably translated
                  into a computer-actionable string. In such cases only the human-readable form is
                  required. Other metadata involve regular expressions (e.g., <code><link
                        linkend="attribute-pattern">@pattern</link></code>) or ISO-compliant dates
                  (e.g., <code><link linkend="attribute-when">@when</link></code>), both of which
                  are well formed and are usually human-legible. Such data are not repeated,
                  although they may be explained via <code><link linkend="element-desc"
                        >&lt;desc></link></code> or <code><link linkend="element-comment"
                        >&lt;comment></link></code>.</para>
               <para>Those exceptions aside, all other metadata takes what is called the <emphasis
                     role="italic">IRI + name</emphasis> pattern: one or more <code><link
                        linkend="namespace">&lt;IRI></link></code>s and <code><link
                        linkend="element-name">&lt;name></link></code>s and zero or more <code><link
                        linkend="element-desc">&lt;desc></link></code>s. This is the core pattern
                  for nearly all TAN vocabulary items.</para>
            </section>
            <section xml:id="digital_entity_metadata">
               <title>Digital Entity Metadata Pattern</title>
               <para>Some entities identified by the <xref linkend="pattern-iri_and_name"/> will be
                  digital resources. In those cases, the IRI + name Pattern is extended.</para>
               <para>There must be one or more <code><link linkend="element-location"
                        >&lt;location></link></code>s, with <code><link linkend="attribute-href"
                        >@href</link></code> and <code><link linkend="attribute-accessed-when"
                        >@accessed-when</link></code>, which signals where the resource is and when
                  it was last consulted. In validation, only the first document available will be
                  used. Extra <code><link linkend="element-location">&lt;location></link></code>s
                  might prove helpful for applications.</para>
               <para>There may be an optional <code><link linkend="element-checksum"
                        >&lt;checksum></link></code>, to more accurately specify which version of a
                  file was consulted.</para>
               <para>If the entity is a TAN file, then <code><link linkend="namespace"
                        >&lt;IRI></link></code> (one and only one) must be a valid tag URN that
                  matches the <code><link linkend="attribute-id">@id</link></code> value of the TAN
                  file being referred to. If the entity is not a TAN file, then any IRI may be used,
                  including its resolved URL.</para>
               <para><code><link linkend="attribute-accessed-when">@accessed-when</link></code>
                  indicates when a file was last accessed. During validation, the target file will
                  be checked. any changes before that date will be ignored, and any after will be
                  reported, normally as warnings. See <xref xlink:href="#versioning-tan-files"
                  />.</para>
               <para>All these requirements may seem excessive, since in other contexts (HTML, TEI),
                  one needs simply a link, via <code>@href</code> or <code>@src</code>. TAN files
                  are meant to be valid long after their creation, when <code><link
                        linkend="attribute-href">@href</link></code> point to broken links. An
                        <code><link linkend="namespace">&lt;IRI></link></code> might allow one to
                  find a missing file, and it will also check, in case the original file has been
                  deleted and another, with a different name, has taken its place.</para>
            </section>
            <section xml:id="edit_stamp">
               <title>Edit Stamp</title>
               <para>Most TAN elements allow for an optional edit stamp, an <code><link
                        linkend="attribute-ed-who">@ed-who</link></code> and an <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, stating who created or
                  edited the enclosed data and when. Neither attribute is allowed without the other. </para>
               <para><code><link linkend="attribute-ed-when">@ed-when</link></code> is one of the
                  attributes that help determine a file's version. See <xref
                     xlink:href="#versioning-tan-files"/>.</para>
               <para>An edit stamp is much like a <code><link linkend="element-change"
                        >&lt;change></link></code> without a description. The attributes simply mark
                  the element where a change has been made. If a description of the alteration is
                  considered necessary, <code><link linkend="element-change"
                     >&lt;change></link></code> should be used, perhaps in addition to the edit
                  stamp.</para>
            </section>
         </section>
         <section xml:id="structure">
            <title>Overall Structure</title>
            <para>All TAN-compliant files, no matter the type or class, follow a common basic
               structure: (1) a prolog with at least two processing instruction nodes; (2) a root
               element; and (3) a head, a body, and an optional teiHeader and tail.</para>
            <para><emphasis role="italic">Prolog and processing instruction nodes</emphasis>: The
               standard prolog of every XML file should begin: <code>&lt;?xml version="1.0"
                  encoding="UTF-8"?></code>
               <note>
                  <para>XML version 1.1 is a permissible alternative, and
                        <code>encoding="UTF-8"</code> is optional.</para>
               </note></para>
            <para>After that come two processing instructions specifying the two schema files
               required for validation<itemizedlist>
                  <listitem>
                     <para><code>&lt;?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
                           c]"?></code></para>
                  </listitem>
                  <listitem>
                     <para><code>&lt;?xml-model
                        href="[PATH]/[ROOT-ELEMENT-NAME].sch"?></code></para>
                  </listitem>
               </itemizedlist></para>
            <para>The first processing instruction node points to the RELAX-NG schema that declares
               the major, structural rules. The second points to the finely tuned rules, written in
               Schematron. Both processing instructions are required, except in systems where those
               processing instructions are implicitly understood (e.g., an oXygen project or
               framework). <code>[PATH]</code> represents the pathname to the schema file, whether
               local or on a server and <code>[ROOT-ELEMENT-NAME]</code> stands for the name of the
               file's root element (the element that is the ancestor of all other elements in the
               document and the descendant of none). It is your choice whether you use
                  <code>.rnc</code> or <code>.rng</code> as the extension for the RELAX-NG schema.
               The former is the compact syntax and the latter, the XML format. They are equivalent.
               The schemas are written initially in the compact sequence, then converted to the XML
               format.</para>
            <para>TAN files permit three different levels of validation: <code>terse</code>,
                  <code>normal</code>, and <code>verbose</code>. A phase may be specified with a
               pseudoattribute <code>phase</code> in the prolog, e.g., <code>&lt;?xml-model
                  href="TAN-A.sch" phase="verbose"?></code>. But it is customary not to specify the
               phase, since most users will want to pick the level of validation desired at a given
               time. Verbose takes the longest time, and terse the shortest. Verbose provides the
               most feedback, terse the least. But some files will not show any difference in
               results from one phase to the next. For more on validation, see <xref
                  xlink:href="#validating_tan_files"/>.</para>
            <para><emphasis role="italic">Root element</emphasis>: The name of the root element
               identifies the type of TAN file:<table frame="all">
                  <title>Root TAN elements</title>
                  <tgroup cols="3">
                     <colspec colname="c1" colnum="1" colwidth="1.19*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.19*"/>
                     <colspec colname="newCol3" colnum="3" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Root element name</entry>
                           <entry>Type of data</entry>
                           <entry>TAN class</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-TAN-T"
                              >&lt;TAN-T></link></code></entry>
                           <entry>plain text transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>TEI transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A"
                              >&lt;TAN-A></link></code></entry>
                           <entry>division-based alignments and annotations</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-tok"
                                 >&lt;TAN-A-tok></link></code></entry>
                           <entry>token-based alignments</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-lm"
                              >&lt;TAN-A-lm></link></code></entry>
                           <entry>lexico-morphological annotations</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-mor"
                              >&lt;TAN-mor></link></code></entry>
                           <entry>part of speech / morphology patterns</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-voc"
                              >&lt;TAN-voc></link></code></entry>
                           <entry>glossaries</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-collection"
                                 >&lt;collection></link></code></entry>
                           <entry>catalog of TAN files</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table><note>
                  <para><code><link linkend="element-collection">&lt;collection></link></code> is
                     provided here only to complete the table. None of the material in this chapter
                     applies to this special class 3 format. See <xref linkend="catalog-files"
                     />.</para>
               </note></para>
            <para>Each root element takes a mandatory <code><link linkend="attribute-id"
                  >@id</link></code> and <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code>. On <code><link linkend="attribute-id"
                  >@id</link></code>, see below. <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code> must be <code>2020</code>, the current version of
               TAN.</para>
            <para>All TAN elements take the namespace <code>tag:textalign.net,2015:ns</code>. In
               most cases, this value is placed in the root element. (The only exceptions are
               TAN-TEI transcription files, which take as a default namespace
                  <code>http://www.tei-c.org/ns/1.0</code> everywhere but in <code>/TEI/head</code>,
               which takes the TAN namespace.) For more about namespaces, see <xref
                  linkend="namespace"/>.</para>
            <para><emphasis>Root element children:</emphasis> Most root elements take two mandatory
               children: <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code>, the latter containing data and
               the former, metadata (data about the data). Root elements of TAN-TEI files take three
               children: <code>&lt;teiHeader></code>, <code><link linkend="element-head"
                     >&lt;head></link></code>, and <code>&lt;text></code>. The apparent duplication
               of a head element is necessary: the <code>&lt;teiHeader></code> does not satisfy TAN
               metadata requirements. See <xref linkend="tan-tei"/>. </para>
            <para>All TAN files may take one final optional child, <link linkend="element-tail"
                     ><code>&lt;tail></code></link>, a private use element that allows any
               well-formed XML. It was introduced initially to experiment with methods in improving
               the efficiency of validation and applications, but it can be used for a variety of
               tasks or applications. Nothing in a TAN file should be dependent upon the <link
                  linkend="element-tail"><code>&lt;tail></code></link>. That is, if you are editing
               a TAN file and you add a <link linkend="element-tail"><code>&lt;tail></code></link>,
               assume that it will be disregarded by other users. Similarly, you may delete any TAN
               file's <link linkend="element-tail"><code>&lt;tail></code></link> without
               consequence.</para>
            <section xml:id="tan-file-id">
               <title>Identifying TAN files: <code><link linkend="attribute-id"
                  >@id</link></code></title>
               <para>Every TAN file requires in its root element an <code><link
                        linkend="attribute-id">@id</link></code>, which must take the form of a tag
                  URN (see <xref linkend="tag_urn"/> for syntax). The file's <code><link
                        linkend="attribute-id">@id</link></code> is the primary way other TAN files
                  will refer to it, and it may be used in RDFa, JSON-LD, and linked open data (see
                     <xref linkend="IRIs_and_linked_data"/>).</para>
               <para>A tag URN begins with a namespace component, and concludes with the identifying
                  string. The namespace of <code><link linkend="attribute-id">@id</link></code> must
                  match at least one other tag URN namespace from the <code><link
                        linkend="element-IRI">&lt;IRI></link></code> of a <code><link
                        linkend="element-person">&lt;person></link></code> identified by <code><link
                        linkend="element-file-resp">&lt;file-resp&gt;</link></code>. See <xref
                     xlink:href="#responsibility"/>.</para>
               <para>In choosing a value for <code><link linkend="attribute-id">@id</link></code>
                  you might borrow the filename, but you probably should not, since files are
                  frequently renamed, often with good reason. A TAN file's <code><link
                        linkend="attribute-id">@id</link></code> should not be changed, especially
                  after public release. The name should remain permanent and stable, even after
                  updates. </para>
               <para>On occasion during editing, it will become clear that revisions are so deep
                  that the file is altogether a different kind of thing. If a previous version has
                  been published, then coining a new <code><link linkend="attribute-id"
                     >@id</link></code>
                  <emphasis>is </emphasis>advised, to make a clean break. You may always document
                  the connection by supplying <code><link linkend="element-predecessor"
                        >&lt;predecessor&gt;</link></code>, which establishes a line of
                  ancestry.</para>
               <para>If you take someone else's data and alter it then you should <emphasis
                     role="italic">not</emphasis> change the <code><link linkend="attribute-id"
                        >@id</link></code>. To ensure that you are credited with any revisions you
                  make to the file (if you are allowed—see <code><link linkend="element-license"
                        >&lt;license&gt;</link></code>), you should add yourself as a <link
                     linkend="element-person"><code>&lt;person></code></link> and then document your
                  alterations through <link linkend="element-change"><code>&lt;change></code></link>
                  or <link linkend="attribute-ed-when"><code>@ed-when</code></link> and <link
                     linkend="attribute-ed-who"><code>@ed-who</code></link>. You might also add a
                        <code><link linkend="element-predecessor">&lt;predecessor&gt;</link></code>
                  element, pointing to a version of the file that predates your intervention.</para>
               <para>The <code><link linkend="attribute-id">@id</link></code> is the only
                  file-specific metadatum positioned outside <code><link linkend="element-head"
                        >&lt;head></link></code>. It is placed as rootward in the document as
                  possible to emphasize that it names the entire document.</para>
            </section>
            <section>
               <title xml:id="versioning-tan-files">TAN file versions</title>
               <para>The version of a TAN file is identified by the most recent date in a file's
                        <code><link linkend="attribute-when">@when</link></code>, <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, and <code><link
                        linkend="attribute-accessed-when">@accessed-when</link></code>. </para>
               <para>Whenever you change a TAN file that has already been published, provide at
                  least an edit stamp (<xref linkend="edit_stamp"/>) in the part of the file you
                  changed or in a <code><link linkend="element-comment">&lt;comment></link></code>
                  or <code><link linkend="element-change">&lt;change></link></code>, so that anyone
                  validating a TAN file dependent upon yours will be warned that changes have been
                  made. The user may then either continue to process the file (the changes may be
                  minor or inconsequential) or pause and see if anything on their end needs to be
                  changed. </para>
            </section>
         </section>
         <section xml:id="inheritable_attributes">
            <title>Attribute inheritability and priority</title>
            <para>Some attributes affect not merely their parent element but all their parent's
               descendents. This phenomenon is called <emphasis>inheritability</emphasis>.</para>
            <para>Most attributes are non-inheritable. That is, the attribute relates only to the
               parent element. Examples: <code><link linkend="attribute-xmlid"
               >@xml:id</link></code>, <code><link linkend="attribute-flags">@flags</link></code>.
               If TAN schema documentation for an attribute does not state anything about the
               inheritability of an attribute's values, it should be treated as
               non-inheritable.</para>
            <para>Most inheritable attributes are weakly inheritable. That is, inheritance stops at
               any descendant that has the same attribute. For example, <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> set to <code>eng</code>
               specifies that its text nodes are in English, but it might contain another element
               whose <code><link linkend="attribute-xmllang">@xml:lang</link></code> is set
                  <code>lat</code>. In that case, the text will be marked as Latin, not English. </para>
            <para>Other inherited attributes are cumulative. That is, their values somehow combine.
               For example, if an element with <code><link linkend="attribute-cert"
                  >@cert</link></code> wraps another, and each one has a <code><link
                     linkend="attribute-cert">@cert</link></code> value of <code>0.5</code>, it
               means that the wrapped element qualifies any claim it participates in as being only
               25% certain (compounded perhaps by other elements in the claim that are not
               completely certain). <code><link linkend="attribute-n">@n</link></code> in a
                     <code><link linkend="element-div">&lt;div></link></code> is indirectly
               cumulative for the purposes of resolving values of <code><link
                     linkend="attribute-ref">@ref</link></code>. Any given <code><link
                     linkend="element-div">&lt;div></link></code> has one or more implied
               references, formed by all permutations of concatenating values of inherited
                     <code><link linkend="attribute-n">@n</link></code>s.</para>
            <para>Cumulative inherited attributes are infrequent. The documentation must be studied
               to understand how each one behaves.</para>
            <para>Some attributes have greater priority over other attributes. This is important for
               interpretation. <code><link linkend="attribute-claimant">@claimant</link></code>, for
               example, has priority over <code><link linkend="attribute-cert">@cert</link></code>.
               That is, the two attributes in the same element are to be interpreted to mean:
                     "<code><link linkend="attribute-claimant">@claimant</link></code> has
                     <code><link linkend="attribute-cert">@cert</link></code> confidence about the
               following claim:...." (It does not mean that one is uncertain whether the claimant
               made such-and-such a claim.)</para>
         </section>
         <section xml:id="defining_tokens">
            <title>Defining Words and Tokens</title>
            <para>At the heart of interaction between class-1 and class-2 files is the need to
               identify words. This poses a problem at the outset. The term <emphasis role="italic"
                  >word</emphasis> is notoriously difficult to define, no matter the context or
               language. For example, "New York" and "didn't" can each be justifiably claimed to be
               one or two words. Furthermore, some scholars consider punctuation to be words (e.g.,
               commas in modern prose, representing "and"), whereas others ignore them as being
               anachronistic or capricious (e.g., commas inserted by a medieval scribe or a modern
               scholar into ancient Greek and Latin texts). In the end, the number of meanings for
               "word" reflects the diversity of scholarship.</para>
            <para>TAN follows the field of corpus linguistics and avoids <emphasis>word</emphasis>
               in favor of the proximate term <emphasis role="italic">token</emphasis>—one or more
               characters defined not according to grammar but according to a regular expression
               (see <xref linkend="regular_expressions"/>). </para>
            <para>In TAN, a token is purely a string definition, used as a segmenting and pointing
               mechanism. To define a token in TAN does not entail any linguistic commitments.
               Neither editors nor users of TAN data should infer that a <link linkend="element-tok"
                     ><code>&lt;tok></code></link> points to a morpheme, a lexeme, or any other
               linguistic entity. There will frequently be a fortuitous correlation between the two,
               but it is not guaranteed. </para>
            <para>TAN was developed with a concern for ancient literature, where punctuation is
               generally ignored as being late or not central to the text. Happily, even in
               contemporary use, most people ignore punctuation when they count words. Therefore the
               default <link linkend="element-token-definition"
                  ><code>&lt;token-definition></code></link> defines a token as being any continuous
               string of word characters (<code>\w</code>), the soft hyphen, the zero-width space,
               or the zero-width joiner, formally defined by <xref
                  xlink:href="#variable-token-definition-default"/>:</para>
            <para>
               <programlisting>&lt;token-definition regex="[\w&amp;#xad;&amp;#x200b;&amp;#x200d;]+"/></programlisting>
            </para>
            <para>This pattern closely resembles what is ordinarily thought of as words, but perhaps
               with some surprises (see above, <xref linkend="regular_expressions"/>). If no <link
                  linkend="element-token-definition"><code>&lt;token-definition></code></link> is
               explicitly given, the default token definition above will be used.</para>
            <para>If you are working with modern texts, where punctuation might be important to name
               and number, try the built-in keyword <code>letters and punctuation</code>:</para>
            <para>
               <programlisting>&lt;token-definition regex="[\w&amp;#xad;​&amp;#x200b;&amp;#x200d;]+|[^\w&amp;#xad;&amp;#x200b;​&amp;#x200d;\s]"/></programlisting>
            </para>
            <para>This expression defines a token as a sequence of word characters or any single
               character that is neither a word nor a space. The string <code>(I go!)</code> would
               have five tokens: <code>( I go ! )</code>.</para>
            <para>For other standard TAN token definitons see <xref
                  xlink:href="#vocabularies-token-definitions"/><link
                  linkend="element-token-definition"><code>&lt;token-definition></code></link>s. You
               may customize your own <link linkend="element-token-definition"
                     ><code>&lt;token-definition></code></link>. But keep in mind that TAN files
               were meant to be shared across fields and disciplines. You should define tokens in a
               way users of your texts expect. Specialized definitions make it difficult to compare
               the data in your TAN file with that in others. Two class-2 files annotating the same
               class-1 file cannot be easily compared or synthesized if they use different token
               definitions.</para>
         </section>
         <section xml:id="metadata_head">
            <title>Metadata (<code><link linkend="element-head">&lt;head></link></code>)</title>
            <para>No matter how much one TAN format differs from another, the metadata follows the
               same basic structure. Anyone getting a TAN file, no matter its class or type, is
               assumed to want to know, and therefore to find easily and predictably, the following:<orderedlist>
                  <listitem>
                     <para>the stable name of the file;</para>
                  </listitem>
                  <listitem>
                     <para>its version;</para>
                  </listitem>
                  <listitem>
                     <para>its sources;</para>
                  </listitem>
                  <listitem>
                     <para>other files upon which it depends or otherwise has an important
                        relationship;</para>
                  </listitem>
                  <listitem>
                     <para>the most significant moments in the editorial history;</para>
                  </listitem>
                  <listitem>
                     <para>the linguistic or scholarly conventions that have been adopted in
                        creating and editing the data;</para>
                  </listitem>
                  <listitem>
                     <para>the license, i.e., who holds what rights to the data, and what kind of
                        reuse is allowed.</para>
                  </listitem>
                  <listitem>
                     <para>the persons, organizations, or entities that helped create the data, and
                        the roles played by each.</para>
                  </listitem>
               </orderedlist></para>
            <para>To answer these questions completely, consistently, and predictably the
                     <code><link linkend="element-head">&lt;head></link></code>, a mandatory child
               of the root element, takes a common pattern across all TAN formats, making work
               across large numbers and types of TAN files predictable. The TAN <code><link
                     linkend="element-head">&lt;head></link></code>, intended to be concise and
               focused, compels you to provide metadata for the data that is governed by <code><link
                     linkend="element-body">&lt;body></link></code>, but it does not accommodate
               metadata for the metadata. That is, your metadata should focus on the data itself and
               not on other things. For example, <code><link linkend="element-head"
                  >&lt;head></link></code> requires you name the people who helped create or edit
               the data, but you are not expected to tell us about them. Merely give good
                     <code><link linkend="element-IRI">&lt;IRI></link></code>s to point to
               authoritative sources that provide background information.<note>
                  <para>The principles above explain why the TEI extension of TAN requires two
                     heads, one for TEI and the other for TAN. The <code>&lt;teiHeader></code>
                     supports the creation of metadata that has little to no relevance to the
                     content of <code><link linkend="element-body">&lt;body></link></code>, has its
                     own unique structure, has very few metadata that are required, and is not
                     designed to incorporate IRIs. Although <code>&lt;teiHeader></code>and TAN's
                           <code><link linkend="element-head">&lt;head></link></code> have overlap,
                     they cannot be mapped onto each other. Each one serves its own purpose, so both
                     must be retained.</para>
               </note></para>
            <para>In what follows we provide a general overview of the TAN <code><link
                     linkend="element-head">&lt;head></link></code>, focusing on its general
               structure, and some of the principles that affect other parts of the TAN
               ecosystem.</para>
            <section>
               <title>Key Information</title>
               <para>Key information about the file as a whole is the first section of a <code><link
                        linkend="element-head">&lt;head></link></code>. This includes <code><link
                        linkend="element-name">&lt;name></link></code>, perhaps one or more
                        <code><link linkend="element-desc">&lt;desc></link></code>s, and perhaps one
                  or more <code><link linkend="element-master-location"
                     >&lt;master-location></link></code>s, which point to locations for
                  authoritative versions. <code><link linkend="element-master-location"
                        >&lt;master-location></link></code> is optional, but not if <code><link
                        linkend="element-to-do">&lt;to-do&gt;</link></code> (see below) is
                  empty.</para>
            </section>
            <section xml:id="key_declarations">
               <title>Key Declarations</title>
               <para>Each <code><link linkend="element-head">&lt;head></link></code> in a TAN file
                  has a declaration section, pertaining to how the file should be used: <code><link
                        linkend="element-license">&lt;license></link></code> and <code><link
                        linkend="element-numerals">&lt;numerals&gt;</link></code>.</para>
               <para><code><link linkend="element-license">&lt;license></link></code> stipulates the
                  license(s) under which the persons listed in its <code><link
                        linkend="attribute-licensor">@licensor</link></code> are releasing the data.
                     <emphasis role="bold">The license applies only to the data in <code><link
                           linkend="element-body">&lt;body></link></code>, not to its
                     sources.</emphasis> The distinction is important, and helpful. It is much
                  easier for you to decide and state the rights and license behind your own work
                  than to do so for that of others. Declaring who holds what rights over your
                  source(s) may be not only difficult but risky, and is therefore optional, best
                  handled in a <code><link linkend="element-desc">&lt;desc&gt;</link></code> or
                        <code><link linkend="element-comment">&lt;comment></link></code>.</para>
               <para>When using a TAN file, you should investigate the entire chain of rights. You
                  may find discrepancies between the license of a TAN file and that of its sources.
                  For example, you might create a thorough lexico-morphological analysis of a
                  20th-century novel, and legitimately release the TAN data under a public domain
                  license, even though the novel itself is under copyright. Users must be aware of
                  and respect licenses, and know that the license in a TAN file may not be the
                  license of its sources. </para>
               <para>TAN adopts the Creative Commons licenses as its default license vocabulary. See
                     <xref linkend="vocabularies-licenses"/>.</para>
               <para><code><link linkend="element-numerals">&lt;numerals&gt;</link></code> may be
                  used to declare whether an ambiguous numeral should be interpreted as an
                  alphabetic numeral or a Roman numeral (default). See the entry for <code><link
                        linkend="element-numerals">&lt;numerals&gt;</link></code> as well as the
                     <link xlink:href="#numeration-systems">section on numeration
                  systems</link>.</para>
               <para>Many TAN files allow in this section <code><link
                        linkend="element-token-definition">&lt;token-definition></link></code>,
                  which specifies a definition for tokens, perhaps tailored via <code><link
                        linkend="attribute-src">@src</link></code> to a specific class-2 file. See
                     <xref xlink:href="#defining_tokens"/> and <code><link
                        linkend="element-token-definition"
                  >&lt;token-definition></link></code>.</para>
            </section>
            <section xml:id="inclusions-and-vocabularies">
               <title>Networked Files</title>
               <para>The third major section of <code><link linkend="element-head"
                     >&lt;head></link></code> accommodates links and references to other files. Some
                  files are essential to processing the TAN file, while others are less
                  important.</para>
               <para>The two most critical types of files are marked by <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                        linkend="element-vocabulary">&lt;vocabulary></link></code>. The files
                  pointed to by these elements should be considered constituent parts of the
                  dependent TAN file. In the validation process, failure to access one will be
                  treated as a fatal error.</para>
               <para><code><link linkend="element-inclusion">&lt;inclusion></link></code> and
                        <code><link linkend="element-vocabulary">&lt;vocabulary></link></code> were
                  developed to reduce duplication (and therefore potential error) in collections of
                  TAN files. Many if not most TAN files are created alongside or in the context of a
                  project, where certain data patterns are repeated. Explicit repetition from one
                  file to the next makes them prone to error. Changes might be made in one file but
                  not in another, introducing version conflicts. <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                        linkend="element-vocabulary">&lt;vocabulary></link></code> provide a
                  specialized method of inclusion that leads to cleaner, smaller files.</para>
               <para>In general, you should first try using <code><link linkend="element-vocabulary"
                        >&lt;vocabulary></link></code>. If that element does not do what you want,
                  then try <code><link linkend="element-inclusion">&lt;inclusion></link></code>. It
                  is normally easier to diagnose a complex set of <code><link
                        linkend="element-vocabulary">&lt;vocabulary></link></code>s than a complex
                  set of <code><link linkend="element-inclusion"
                  >&lt;inclusion></link></code>s.</para>
               <section>
                  <title>Vocabularies</title>
                  <para>Oftentimes, from one file to the next, an editor needs to refer repeatedly
                     to a common set of things, e.g., manuscripts, works of literature, or persons
                     who helped edit the files. </para>
                  <para>Projects are advised to create their own <code><link
                           linkend="element-TAN-voc">&lt;TAN-voc&gt;</link></code> files populated
                     with commonly used vocabulary. Once set up, the TAN-voc file must be linked to
                     via a <code><link linkend="element-vocabulary">&lt;vocabulary></link></code> in
                     the <code><link linkend="element-head">&lt;head></link></code> of other TAN
                     files. Vocabulary items can then be invoked either by pointing to <code><link
                           linkend="element-name">&lt;name&gt;</link></code> values, or by assigning
                     an <code><link linkend="attribute-xmlid">@xml:id</link></code> to a vocabulary
                     item placed in the <link linkend="element-head"><code>&lt;head></code></link>'s
                           <code><link linkend="element-vocabulary-key"
                        >&lt;vocabulary-key></link></code>. If you draw upon <code><link
                           linkend="element-name">&lt;name&gt;</link></code>, you may make
                     alterations to capitalization, and hyphens, spaces, and underscores are treated
                     as interchangeable. Capitalization and spelling of <code><link
                           linkend="attribute-xmlid">@xml:id</link></code>, however, must be
                     strictly followed.</para>
                  <para>Vocabulary (TAN-voc) files tend to require frequent change and expansion, so
                     it is recommended that you depend upon only those TAN-voc files that are part
                     of your project, and not those from a different project.</para>
                  <para>In the host file, any attribute that takes multiple IDrefs, e.g.,
                           <code><link linkend="attribute-who">@who</link></code>, <code><link
                           linkend="attribute-type">@type</link></code>, <code><link
                           linkend="attribute-subject">@subject</link></code>, may take mix
                     references to vocabulary items via <code><link linkend="attribute-xmlid"
                           >@xml:id</link></code> or <code><link linkend="element-name"
                           >&lt;name&gt;</link></code>, but because in such attributes spaces are
                     reserved to delimit multiple values, in the case of the latter, any space in a
                           <code><link linkend="element-name">&lt;name&gt;</link></code> must be
                     replaced with the underscore or hyphen. A <link linkend="attribute-which"
                           ><code>@which</code></link> in the host file, however, can take no more
                     than one value, so using spaces is fine. </para>
                  <para><emphasis role="bold"><code><link linkend="attribute-id">@id</link></code>
                        and <code><link linkend="attribute-xmlid">@xml:id</link></code> are
                        case-sensitive, and do not allow spaces. <link linkend="attribute-which"
                              ><code>@which</code></link> and therefore <code><link
                              linkend="element-name">&lt;name&gt;</link></code> are not
                        case-sensitive, and the space, hyphen, and underscore are
                        equivalent.</emphasis></para>
                  <para><emphasis role="bold">If you point to </emphasis><code><link
                           linkend="attribute-id"><emphasis role="bold"
                        >@id</emphasis></link></code><emphasis role="bold"> or
                           </emphasis><code><link linkend="attribute-xmlid"><emphasis role="bold"
                              >@xml:id</emphasis></link></code><emphasis role="bold"> you must
                        respect case and punctuation. If you are pointing to a
                           </emphasis><code><link linkend="element-name"><emphasis role="bold"
                              >&lt;name&gt;</emphasis></link></code><emphasis role="bold"> you can
                        ignore case, and you should probably replace the space with a
                     _.</emphasis></para>
                  <para>TAN includes a number of standard vocabulary (TAN-voc) files for a variety
                     of concepts commonly used in textual scholarship (see <xref
                        linkend="vocabularies-master-list"/>). For example, there are more than one
                     hundred types of textual divisions that can be invoked simply by using their
                     names (see <xref xlink:href="#vocabularies-div-types"/>).</para>
                  <para><code><link linkend="element-vocabulary">&lt;vocabulary></link></code>
                     itself may take <link linkend="attribute-which"><code>@which</code></link>, but
                     only to point to one of the extra TAN vocabularies listed in <xref
                        xlink:href="#vocabularies-vocabularies"/>. This restriction avoids some
                     complexity in the validation routine. See <xref linkend="extra_n_vocabulary"/>
                     on how to use this feature.</para>
                  <para>Files pointed to by <code><link linkend="element-vocabulary"
                           >&lt;vocabulary></link></code> are considered an essential part of any
                     TAN file. Failure to find the target file will throw a fatal error during
                     validation.</para>
               </section>
               <section>
                  <title>Inclusions</title>
                  <para>Whereas vocabularies do not change the host document, inclusions do. Unlike
                     other forms of inclusion you might be familiar with, TAN inclusion is targeted
                     at select elements, <emphasis>never</emphasis> an entire file. TAN inclusion is
                     a two-step process. </para>
                  <para>First, a TAN file is linked to, and therefore made available for inclusion,
                     via <code><link linkend="element-inclusion">&lt;inclusion></link></code>s
                     (inside <link linkend="element-head"><code>&lt;head></code></link>). Like
                           <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code>, an <code><link linkend="element-inclusion"
                           >&lt;inclusion></link></code> does nothing on its own. It merely points
                     to a file that is eligible for inclusions. No actual inclusions occur until the
                     next step.</para>
                  <para>Second, select parts of the included file are invoked in the dependent file.
                     To do so, insert an element X in a valid location, but with nothing but
                           <code><link linkend="attribute-include">@include</link></code>, with one
                     or more values (space-delimited) pointing to the <code><link
                           linkend="attribute-xmlid">@xml:id</link></code> values of the <code><link
                           linkend="element-inclusion">&lt;inclusion></link></code>s desired. In the
                     validation process, that element will be replaced with every element X found in
                     the inclusion file, resolved recursively, and ignoring duplications (deeply
                     equal elements).</para>
                  <para>For example, a TAN-T file might have a <code>&lt;div
                     include="poem1"></code>. The validation routine will replace that element with
                     every rootmost <code><link linkend="element-div">&lt;div></link></code> in the
                     included file called <code>poem1</code>. </para>
                  <para>Any host file that includes elements from another file inherits any
                     associated vocabulary, and along with it <code><link linkend="attribute-xmlid"
                           >@xml:id</link></code> values. This may result in errors if there are any
                     resultant conflicts in IDrefs. </para>
                  <para>TAN inclusion is very practical for texts. Textual works commonly nest
                     inside each other. By setting up your class-1 files as a series of inclusions,
                     you can reduce validation time, both in the file and in class-2 files that
                     depend upon the transcriptions. See the <code>examples</code> subdirectory for
                     a case of the Gospels including the Sermon on the Mount including the Lord's
                     Prayer. </para>
                  <para>The inclusion technique is also especially useful for vocabulary (TAN-voc)
                     files. A single master TAN-voc file can include other vocabulary files, each
                     devoted to a particular type of item (e.g., one for works, one for scripta).
                     Project files then need to link merely to the master TAN-voc file.</para>
                  <para>You can include a TAN file that itself includes other TAN files. Inclusion
                     is recursive. In any recursive system, circularity is fatal. That is true for
                     TAN inclusion as well, but only within the scope of specified element names. It
                     is perfectly legal for two files to include each other, as long as they do not
                     try to include (directly or indirectly) the same elements, or try to consult
                     each other to resolve any vocabulary.</para>
                  <para>Files pointed to by <code><link linkend="element-inclusion"
                           >&lt;inclusion></link></code> are considered an essential part of any TAN
                     file. Failure to find the target file will throw a fatal error during
                     validation.</para>
               </section>
               <section xml:id="other_related_files">
                  <title>Other Related Files</title>
                  <para>Other files can be specified. The more that are mentioned, the richer the
                     network. <code><link linkend="element-predecessor"
                        >&lt;predecessor&gt;</link></code> and <code><link
                           linkend="element-successor">&lt;successor&gt;</link></code> point to
                     versions of the file that precede and postdate it. </para>
                  <para><code><link linkend="element-source">&lt;source&gt;</link></code> is another
                     type of related file, but it may or may not link to another file. In class-2
                     files <code><link linkend="element-source">&lt;source&gt;</link></code> always
                     points to a class-1 TAN file. In class-1 and class-3 files, <code><link
                           linkend="element-source">&lt;source&gt;</link></code> may point either to
                     a file or to a scriptum (see <xref xlink:href="#domain_model"/>).</para>
                  <para><code><link linkend="element-see-also">&lt;see-also></link></code> can be
                     used to point to any file that has some relationship to a TAN file. The
                     required <code><link linkend="attribute-relationship"
                        >@relationship</link></code> points to one or more <code><link
                           linkend="element-relationship">&lt;relationship&gt;</link></code>
                     vocabulary items. There is no standard TAN vocabulary for relationships.
                     Normally, when a file-to-file relationship is considered important, it becomes
                     a full-fledged element.</para>
                  <para>Some TAN formats allow special types of related files (e.g., <code><link
                           linkend="element-redivision">&lt;redivision&gt;</link></code> and
                           <code><link linkend="element-model">&lt;model&gt;</link></code> for
                     class-1 files). See metadata descriptions under specific classes or formats. </para>
               </section>
            </section>
            <section xml:id="adjustments">
               <title>Adjustments</title>
               <para>The fourth major section of <code><link linkend="element-head"
                     >&lt;head></link></code>, which is optional, consists of <code><link
                        linkend="element-adjustments">&lt;adjustments></link></code>, which
                  specifies changes that have been made (class 1), or should be made (class 2), to
                  the sources. </para>
               <para>In class-1 files, these consist of <code><link linkend="element-normalization"
                        >&lt;normalization></link></code>s and <code><link linkend="element-replace"
                        >&lt;replace&gt;</link></code>s; see <xref
                     xlink:href="#normalizing_transcriptions"/>. </para>
               <para>Class-2 files allow <code><link linkend="element-skip"
                     >&lt;skip&gt;</link></code>, <code><link linkend="element-rename"
                        >&lt;rename&gt;</link></code>, <code><link linkend="element-equate"
                        >&lt;equate&gt;</link></code>, and <code><link linkend="element-reassign"
                        >&lt;reassign&gt;</link></code> as adjustments; see <xref
                     xlink:href="#class_2_metadata"/>.</para>
            </section>
            <section>
               <title>Local vocabulary items and ID assignments: <code><link
                        linkend="element-vocabulary-key">&lt;vocabulary-key></link></code></title>
               <para>The fifth major part of <code><link linkend="element-head"
                     >&lt;head></link></code>, <code><link linkend="element-vocabulary-key"
                        >&lt;vocabulary-key></link></code>, allows you to declare any specific
                  vocabulary items relevant for the file. It also allows you to take vocabulary
                  items existing in other TAN-voc files (whether defined in <code><link
                        linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or standard
                  TAN vocabulary), and assign them <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code>s that are valid only in the current file. Anything in
                        <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code> will overwrite default TAN vocabulary, but
                  not any TAN-voc files pointed to via <code><link linkend="element-vocabulary"
                        >&lt;vocabulary&gt;</link></code>.</para>
               <para>These id assignments can be supplemented with <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code>es, which are used to
                  assign an id to one or more ids. This practice resembles what text editors do when
                  naming groups of manuscripts. Each manuscript is given a siglum, say a single
                  lowercase Greek or Latin letter, and the manuscripts are grouped together into
                  families, with each family given its own siglum, say an uppercase letter. If the
                  editor wishes to indicate that a whole family of manuscripts departs from a
                  particular reading, the family siglum is all that is needed. An <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code> works much the same way,
                  and can be used for any vocabulary items. For example, if a textual division can
                  be legitimately called both a rubric and a heading, you could assign
                     <code>rubr</code> and <code>hd</code> as ids in the <code><link
                        linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> to the
                  vocabulary items for the rubric and the heading, and then insert <code>&lt;alias
                     xml:id="rubrichead" idrefs="rubr hd"&gt;</code>. Then, in that file,
                     <code>&lt;div n="1" type="rubrichead"></code> would identify that <code><link
                        linkend="element-div">&lt;div></link></code> as being both a rubric and a
                  head.</para>
               <para>Unlike other similar attributes, the <code><link linkend="attribute-idrefs"
                        >@idrefs</link></code> of an <code><link linkend="element-alias"
                        >&lt;alias&gt;</link></code> cannot point to the <code><link
                        linkend="element-name">&lt;name&gt;</link></code> value of vocabulary items.
                  They can point only to the id references of locally defined instances of
                        <code><link linkend="attribute-xmlid">@xml:id</link></code>. This
                  restriction reduces confusion, and avoids some complexity in the resolution and
                  validation of a TAN file.</para>
               <para><code><link linkend="element-alias">&lt;alias&gt;</link></code>es may recurse,
                  as long as there is no circularity. That is, <code><link
                        linkend="attribute-idrefs">@idrefs</link></code> in an <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code> may refer to any
                        <code><link linkend="attribute-xmlid">@xml:id</link></code> or <code><link
                        linkend="attribute-id">@id</link></code>, not only to a vocabulary item but
                  to another <code><link linkend="element-alias">&lt;alias&gt;</link></code>. </para>
               <para>In most cases <code><link linkend="element-alias">&lt;alias&gt;</link></code>
                  should refer to items of the same type. In a few situations mixed groups do not
                  pose a problem, for example mixing <code><link linkend="element-person"
                        >&lt;person&gt;</link></code>s, <code><link linkend="element-algorithm"
                        >&lt;algorithm&gt;</link></code>s, and <code><link
                        linkend="element-organization">&lt;organization&gt;</link></code>s. TAN
                  validation will indicate whether mixed typology introduces errors.</para>
               <para>Because <code><link linkend="attribute-xmlid">@xml:id</link></code> may not
                  contain certain types of characters, such as common punctuation marks, and because
                        <code><link linkend="element-alias">&lt;alias&gt;</link></code> must be able
                  to coin unusual ids (especially for grammatical features), <code><link
                        linkend="attribute-id">@id</link></code> may be used instead of <code><link
                        linkend="attribute-xmlid">@xml:id</link></code> in <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code>.</para>
            </section>
            <section xml:id="responsibility">
               <title>Responsibility</title>
               <para>The sixth section of a <code><link linkend="element-head"
                     >&lt;head></link></code> declares who is responsible for the file. It consists
                  of a <code><link linkend="element-file-resp">&lt;file-resp&gt;</link></code> and
                  one or more <code><link linkend="element-resp">&lt;resp></link></code>s. The
                  persons, organizations, or algorithms pointed to in <code><link
                        linkend="element-file-resp">&lt;file-resp&gt;</link></code> must include at
                  least one who has a tag URN whose namespace matches the namespace in the tag URN
                  of the root element's <code><link linkend="attribute-id">@id</link></code>. </para>
               <para>This requirement strengthens the effort to make sure that each TAN file is
                  associated with the person or persons who are or were responsible for the file.
                        <code><link linkend="element-person">&lt;person></link></code>s so
                  identified by <code><link linkend="element-file-resp"
                     >&lt;file-resp&gt;</link></code> are called primary agents, and are bound to
                  the global variable <code><link linkend="variable-primary-agents"
                        >$primary-agents</link></code>. If a claim is made in a TAN file, and no
                        <code><link linkend="attribute-claimant">@claimant</link></code> is
                  explicitly declared, it is assumed that the <code><link
                        linkend="variable-primary-agents">$primary-agents</link></code> are making
                  the claim.</para>
            </section>
            <section>
               <title>Change log</title>
               <para>The change log, the seventh section of the <code><link linkend="element-head"
                        >&lt;head></link></code> consists of one or more <code><link
                        linkend="element-change">&lt;change></link></code>s, which provide a partial
                  history of the file. The entire history is calculated from every attribute that
                  has a date or timeDate value, which can be fetched via the function <code><link
                        linkend="function-get-doc-history">tan:get-doc-history</link>()</code> or
                  the global variable <code><link linkend="variable-doc-history"
                     >$doc-history</link></code>.</para>
               <para>The change log is an effective way to communicate to those who might use your
                  files. In all likelihood, a user will download from the master location a local
                  copy. You might make changes or updates to your master copy. Anyone depending upon
                  a copy will be warned, during Schematron validation, of each <code><link
                        linkend="element-change">&lt;change></link></code> that postdates the value
                  of their <code><link linkend="attribute-accessed-when"
                     >@accessed-when</link></code>. If you have introduced an important or
                  disruptive change, you can mark your <code><link linkend="element-change"
                        >&lt;change></link></code> with <code><link linkend="attribute-flag"
                        >@flag</link></code>, that allows the following values: <code>warning</code>
                  (default value), <code>error</code>, <code>info</code>, <code>fatal</code>. By
                  marking a change as <code>info</code>, you lower the level of a change's
                  importantce; <code>error</code> raises it. The value <code>fatal</code> will halt
                  the validation process in the dependent file altogether.</para>
               <para>If you receive change messages during validation, and you want to stop them,
                  merely update the value of <code><link linkend="attribute-accessed-when"
                        >@accessed-when</link></code> to the current date.</para>
            </section>
            <section>
               <title>Pending work</title>
               <para>The last section of a <code><link linkend="element-head"
                     >&lt;head></link></code> lists all pending tasks that yet need to be applied to
                  a file. These are itemized as a list of <code><link linkend="element-comment"
                        >&lt;comment></link></code>s in <code><link linkend="element-to-do"
                        >&lt;to-do&gt;</link></code>. A file with an empty <code><link
                        linkend="element-to-do">&lt;to-do&gt;</link></code> is assumed to be no
                  longer in progress, so there must be a <code><link
                        linkend="element-master-location">&lt;master-location></link></code>
                  provided.</para>
               <para>Like the change log, the <code><link linkend="element-to-do"
                        >&lt;to-do&gt;</link></code> effectively communicates cautionary notes to
                  those who might use your files. Anyone depending upon a copy will be warned,
                  during Schematron validation, of each item in the list. The report is not
                  dependent upon when the file was last consulted (<code><link
                        linkend="attribute-accessed-when">@accessed-when</link></code>), because
                  these are standing, unresolved issues. </para>
               <para>One benefit of <code><link linkend="element-to-do">&lt;to-do&gt;</link></code>
                  is that a clear account of what remains to be done will encourage people to
                  release their material earlier than normal, because other users will have fair
                  warning about what is imperfect or incomplete.</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_1">
         <title>Class-1 TAN Files, Representations of Textual Objects (Scripta)</title>
         <para>This chapter provides general background to class-1 TAN files and their elements and
            attributes. For detailed discussion see <xref linkend="elements-attributes-and-patterns"
            />.</para>
         <para>Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri,
            stones, or any other objects with writing on them—collectively termed here
               <emphasis>scripta</emphasis> (sg. <emphasis>scriptum</emphasis>). Class-1 files are
            the foundation of any TAN project. No TAN-A-tok or TAN-A-lm file can be created without
            at least one class-1 file. </para>
         <para>There are two types of class-1 formats, identified by the root element. <code><link
                  linkend="element-TAN-T">&lt;TAN-T></link></code> is a simple, generic format, as
            close as one can get to plain text. <code>&lt;TEI></code> (also referred to in this
            manual as TAN-T(EI)), on the other hand, can be complex and highly expressive. Because
            the two formats function almost identically, the generic TAN-T format is described
            first, followed by supplemental comments on TAN-TEI.</para>
         <section xml:id="transcription_principles">
            <title>Principles and Assumptions</title>
            <section>
               <title>General</title>
               <para>(For more general principles and assumptions applying to all TAN files, not
                  just class 1, see <xref linkend="design_principles"/>.)</para>
               <para>Class-1 formats are designed for faithful but judiciously normalized digital
                  transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of
                  a single work found in a single scriptum (text-bearing object), segmented and
                  uniquely labeled with a (preferably familiar) reference system. </para>
               <para>Editors of TAN-T(EI) files should be able to read, write, and proofread texts
                  in the languages of the transcriptions. They should understand the texts well
                  enough to segment them and label them according to the conventions used for those
                  works. They should be able to distinguish the text of a primary source from its
                  editorial apparatus. They should be familiar with normalizing conventions for
                  texts from the period, language, and culture. They should know how the
                  transcription might be used in other contexts, especially translation studies or a
                  study of quotations.</para>
               <para>Editors need not understand everything about their texts, and they need not
                  have any specialized skill in grammar or lexicography. They need not know the
                  morphology of individual words, or how individual parts of the text have been
                  translated. Those skills are more profitably spent editing other TAN formats. </para>
               <para>TAN-T(EI) editors stand at the foundation level of the Text Alignment Network.
                  Because other files will depend upon them, careful proofreading is important.
                  Eliminating as many typographical errors as possible before publication will
                  maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been designed
                  with the assumption that most files in circulation have typographical errors that
                  can and should be corrected as they are found. If you are aware that a text needs
                  proofreading, but you still want to make it available, simply leave a <code><link
                        linkend="element-comment">&lt;comment></link></code> in the <code><link
                        linkend="element-to-do">&lt;to-do&gt;</link></code> part of the <link
                     linkend="element-head"><code>&lt;head></code></link>.</para>
               <para>If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate
                  alignment and annotation, which requires use of a suitable reference system (see
                     <link linkend="reference_system">reference systems</link>). Transcription files
                  should be segmented and labeled according to a reference system that is familiar
                  and can be easily applied to other versions of the same text in other languages.
                  If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should
                  be prioritized over visual (lines, columns, pages, volumes). Any transcription can
                  be furnished multiple reference systems, but it is advisable to do so on the basis
                  of separate files, linked by <code><link linkend="element-redivision"
                        >&lt;redivision&gt;</link></code>s in the <code><link linkend="element-head"
                        >&lt;head></link></code>. See <xref linkend="reference_system"/>.</para>
            </section>
            <section xml:id="domain_model">
               <title>Domain model</title>
               <para>Contributors and users of TAN files must sharply distinguish between a scriptum
                  (text-bearing object) and a conceptual work, e.g., between a specific printed copy
                  of the <emphasis>Iliad</emphasis> and the <emphasis>Iliad</emphasis> concieved
                  generally. The former has materiality (digital files are treated here as being
                  material) and the latter does not. Even though both are constitutively necessary
                  for any transcription, the two are always differentiated in the TAN format:
                        <code><link linkend="element-source">&lt;source&gt;</link></code> and
                        <code><link linkend="attribute-src">@src</link></code> point to physical
                  exemplars; <code><link linkend="element-work">&lt;work&gt;</link></code>,
                        <code><link linkend="attribute-work">@work</link></code>, and <code><link
                        linkend="element-version">&lt;version></link></code> to the conceptual.
                  Adherence to this distinction is quite important.</para>
               <para>Some readers may be reminded at this point of the domain model defined by the
                  Functional Requirements for Bibliographical Records (FRBR), which identifies in
                  its Group 1 (Products of intellectual &amp; artistic endeavor) four types of
                  entities: <emphasis>work</emphasis>, <emphasis>expression</emphasis>,
                     <emphasis>manifestation</emphasis>, and <emphasis>item</emphasis>. A work is "a
                  distinct intellectual or artistic creation" and an expression is the conceptual,
                  immaterial realization of a work. Both <emphasis>work</emphasis> and
                     <emphasis>expression</emphasis> are terms for conceptual, non-material
                  entities. A manifestation, on the other hand, is "the physical embodiment of an
                  expression" and an item is a single exemplar of a manifestation. <note>
                     <para>Quotations in this section come from International Federation of Library
                        Associations and Institutions, <emphasis>Functional Requirements for
                           Bibliographic Records: Final Report</emphasis>, amended and corrected
                        (February 2009), <link xlink:href="http://www.ifla.org/VII/s13/frbr/"
                        />.</para>
                  </note></para>
               <table frame="all">
                  <title>Examples of FRBR Group 1 Entities</title>
                  <tgroup cols="4">
                     <colspec colname="c2" colnum="1" colwidth="1*"/>
                     <colspec colname="c3" colnum="2" colwidth="1*"/>
                     <colspec colname="c4" colnum="3" colwidth="1*"/>
                     <colspec colname="c5" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Work</entry>
                           <entry>Expression</entry>
                           <entry>Manifestation</entry>
                           <entry>Item</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><emphasis>Iliad</emphasis></entry>
                           <entry>Caroline Alexander's English translation of the
                                 <emphasis>Iliad</emphasis>.</entry>
                           <entry>the print run identified with ISBN 978-0062046284</entry>
                           <entry>A specific copy</entry>
                        </row>
                        <row>
                           <entry>The Psalms</entry>
                           <entry>The (Hebrew) Masoretic Psalter</entry>
                           <entry>The 1820 printing of George Offor's edition of the Hebrew
                              Psalms</entry>
                           <entry>Biblioteca Palatina Cod. Parm. 1699</entry>
                        </row>
                        <row>
                           <entry><emphasis>A River Runs Through It</emphasis></entry>
                           <entry>
                              <para>Norman MacClean's original version</para>
                              <para>The 1992 film version</para>
                           </entry>
                           <entry>
                              <para>Print run ISBN 0226500608</para>
                              <para>Blue Ray disc UPC code 004339632533</para>
                           </entry>
                           <entry>
                              <para>Author's personal print copy</para>
                              <para>Reference print CGB 7432-7438 (deposited in the Library of
                                 Congress)</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>TAN's domain model differs slightly. The most important difference is
                  abandonment of FRBR's <emphasis>expressions</emphasis>, which was considered
                  problematic in the development of sample TAN data. The term
                     <emphasis>expressions</emphasis> was intended to describe a conceptual,
                  non-material entity, but the FRBR guidelines defined and explained it in vague or
                  material terms. <note>
                     <para>"<emphasis>Expression</emphasis> encompasses, for example, the <emphasis
                           role="bold">specific</emphasis> words, sentences, paragraphs, etc. that
                        result from the realization of a work <emphasis role="bold">in the form of a
                           text</emphasis>....defined, however, so as to exclude aspects of physical
                        form, such as typeface and page layout, <emphasis role="bold">that are not
                           integral</emphasis> to the intellectual or artistic <emphasis role="bold"
                           >realization</emphasis> of the work as such." (ibid., p. 19, emphasis
                        added) That is, <emphasis>expression</emphasis> includes integral aspects of
                        physical form (e.g., typeface that <emphasis>is</emphasis> integral to the
                        realization). "Inasmuch as <emphasis role="bold">the form of expression is
                           an inherent characteristic of the expression</emphasis>, any change in
                        form (e.g., from alpha-numeric notation to spoken word) results in a new
                        expression." (p. 20, emphasis added)</para>
                  </note>Even the very term <emphasis>expression</emphasis> and FRBR's preferred
                  synonym, <emphasis>realization</emphasis>, imply materiality (without which
                  nothing can be expressed or realized). Further, FRBR's
                     <emphasis>expression</emphasis> does not easily handle creative adaptations of
                  works that are themselves arguably works in their own right. For example,
                  Euripides' <emphasis>Medea</emphasis> was adapted several centuries later by
                  Seneca the Younger. Seneca's <emphasis>Medea</emphasis> is arguably merely an
                  expression, but has itself been subject to various editions and performances,
                  i.e., expressions. But FRBR does not accommodate expressions of expressions. If
                  Seneca's <emphasis>Medea</emphasis> is treated as a work in its own right, its
                  expression relationship to Euripides' origin is lost, since FRBR does not
                  accommodate works that are expressions of other works.</para>
               <para>In the TAN domain model, <emphasis>expression</emphasis> is altogether dropped.
                  There is only one type of conceptual, non-material entity, namely, a work.</para>
               <para>The term <emphasis>version</emphasis> in TAN is applied to a work that
                  substantially follows but varies another work, e.g., translations and adaptations.
                  But such versions are themselves still works. One work is indicated to be the
                  version of another if a class-1 file through the <code><link
                        linkend="element-work">&lt;work></link></code> and <code><link
                        linkend="element-version">&lt;version></link></code> declarations.</para>
               <para>As for material entities, FRBR's <emphasis>manifestation</emphasis> and
                     <emphasis>item</emphasis> are combined in TAN through the term
                     <emphasis>scriptum</emphasis>. A scriptum is a text-bearing object, e.g., book,
                  manuscript, pamphlet, tombstone, traffic sign, digital file (digital media is
                  interpreted as being material). When <emphasis>scriptum</emphasis> is used in a
                  TAN file, it points either to a single physical item or to a set of physical items
                  that are for all intents and purposes are indistinguishable (i.e., a scriptum
                  reproduced mechanically). A scriptum that points to a manuscript points only to
                  that one particular manuscript. But a scriptum that points to a printed book or a
                  digital file is understood as applying to all copies of that printed book or
                  digital file. </para>
               <para>There is at present no formal mechanism to specify whether a scriptum points to
                  one object or a set of objects. The distinction must be inferred from a scriptum's
                  IRI + name pattern. In cases of potential ambiguity, it is up to creators of a TAN
                  file to assign to the scriptum IRIs that avoid confusion. For example, to point to
                  Edward Gibbon's personally annotated copy of the 1763 edition of Herodotus (now
                  held by the Wren Library, Trinity College, Cambridge University), one should not
                  use <link xlink:href="https://lccn.loc.gov/92189906"/> or <link
                     xlink:href="http://www.worldcat.org/oclc/27188122"/>, which point to the set of
                  all copies. In this case, one may need to mint their own IRI, based on the Wren
                  Library's acquisition number, RW.50.15.</para>
               <para>In summary, the TAN domain model defines two kinds of entities: works and
                  scripta. Works, which are immaterial, conceptual entities, may contain other
                  works, or they may be versions of other works (or work-versions). Scripta, which
                  are material entities, may contain other scripta, and they may refer either to a
                  single object or to a set of copies. A work may be instantiated in many scripta,
                  and similarly, any scriptum may contain many works. Most work-scriptum
                  relationships can be inferred from the <code><link linkend="element-head"
                        >&lt;head></link></code> of a class-1 file, and they may be expressed in a
                        <code><link linkend="element-TAN-A">&lt;TAN-A></link></code> file.</para>
               <table frame="all">
                  <title>Examples of TAN Entities</title>
                  <tgroup cols="2">
                     <colspec colname="c3" colnum="1" colwidth="1*"/>
                     <colspec colname="c4" colnum="2" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Work</entry>
                           <entry>Scriptum</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>
                              <para><emphasis>Iliad</emphasis></para>
                              <para>Caroline Alexander's English translation of the
                                    <emphasis>Iliad</emphasis>.</para>
                           </entry>
                           <entry>
                              <para>the print run identified with ISBN 978-0062046284</para>
                              <para>a specific copy</para>
                           </entry>
                        </row>
                        <row>
                           <entry>
                              <para>The Psalms</para>
                              <para>The (Hebrew) Masoretic Psalter</para>
                           </entry>
                           <entry>
                              <para>The 1820 printing of George Offor's edition of the Hebrew
                                 Psalms</para>
                              <para>Biblioteca Palatina Cod. Parm. 1699</para>
                           </entry>
                        </row>
                        <row>
                           <entry>
                              <para>Norman MacClean's <emphasis>A River Runs Through
                                 It</emphasis></para>
                              <para>The 1992 film <emphasis>A River Runs Through
                                 It</emphasis></para>
                           </entry>
                           <entry>
                              <para>Print run ISBN 0226500608</para>
                              <para>Author's personal print copy</para>
                              <para>Blue Ray disc UPC code 004339632533</para>
                              <para>Reference print CGB 7432-7438 (deposited in the Library of
                                 Congress)</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </section>
            <section>
               <title>One Version, One Work, One Object, One Reference System</title>
               <para><emphasis>Every TAN-T(EI) file must be restricted to a transcription of a
                     single version of a single work found on a single scriptum, segmented and
                     labeled according to a single reference system</emphasis>. </para>
               <para>The principle above is critical to the the success of the network. It reduces
                  the risk of confusion and simplifies the files. It follows the generally advisable
                  principle, that master data should be disaggregated.</para>
               <section xml:id="textual_objects">
                  <title>One Scriptum</title>
                  <para>Each TAN-T(EI) file must transcribe one and only one text-bearing object or
                     scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a
                     bottlecap. If the object you've chosen has been made mechanically and is
                     virtually indistinguishable from other objects created by the same process
                     (e.g., copies of a printed book or copies of a digital file), then the entire
                     set of copies (what some librarians call a <emphasis>manifestation</emphasis>)
                     is to be regarded as the scriptum. </para>
                  <para>Identifying and naming a scriptum might require an editor's discernment and
                     judgment. For example, some manuscripts have been split up, their parts now
                     residing in multiple libraries around the world; other manuscripts are
                     composites, made of several manuscripts. In such cases, you may need to define
                     your scriptum in a way that might not match the way others define it. But the
                     decision is your prerogative, not theirs. You have both the right and
                     responsibility to define your object in the way that you think will most
                     benefit users of your files.</para>
                  <para>The scriptum is declared via <code><link linkend="element-source"
                           >&lt;source></link></code>, which either takes the IRI + name pattern, or
                     points to a <code><link linkend="element-scriptum"
                        >&lt;scriptum&gt;</link></code> vocabulary item. It is a good idea to name
                     your scriptum with an <code><link linkend="element-IRI">&lt;IRI></link></code>
                     value in the form of an <code>http</code> URL that points to a detailed entry
                     in a library catalogue. Doing so allows users to retrieve extensive, structured
                     bibliographical information. You also save yourself the hassle of having to
                     write a detailed, structured bibliographical description. If a URL cannot be
                     found for <code><link linkend="element-IRI">&lt;IRI></link></code>, you may
                     simply coin a tag URN or a UUID. Alternatively, if you find another TAN file
                     that uses the same scriptum-source, incorporate its <code><link
                           linkend="element-name">&lt;name&gt;</link></code>s and <code><link
                           linkend="element-IRI">&lt;IRI></link></code>s with your own (multiple
                           <code><link linkend="element-name">&lt;name&gt;</link></code>s and
                           <code><link linkend="element-IRI">&lt;IRI></link></code>s are a
                     virtue).</para>
                  <para>If you need to specify exactly where on a scriptum a work-version appears
                     (e.g., page range), <code><link linkend="element-comment"
                        >&lt;comment></link></code> or <code><link linkend="element-desc"
                           >&lt;desc&gt;</link></code> should be used.</para>
               </section>
               <section xml:id="conceptual_works">
                  <title>One Work</title>
                  <para>The transcription must be restricted to a single creative work, identified
                     by <code><link linkend="element-work">&lt;work></link></code> (part of the
                     declarations section of <code><link linkend="element-head"
                        >&lt;head></link></code>). </para>
                  <para>Many scripta have more than one work. Identifying the creative work you
                     transcribe is, once again, your prerogative. Suppose the scriptum you have is a
                     Bible. You define the work. Perhaps you wish to encode the entire Bible and
                     treat it as a single work. Or maybe you wish to treat only the New Testament as
                     the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific
                     episode in that gospel, or merely the Beatitudes. Use whichever work you like,
                     but make sure that the TAN-T(EI) file contains nothing but the work you have
                     declared. It should be a complete representation of what is found on the
                     object, even if only partially preserved, and respect as far as is practical
                     the order of the text in the scriptum.</para>
                  <para>The requirement to provide the entirety of the work-version on the scriptum
                     is a significant departure from the fourth principle of <xref
                        xlink:href="#assumptions_creating_data"/>. <emphasis>Users should be able to
                        assume that the transcription in a class-1 file covers the entirety of the
                        work-version chosen, within the particular scriptum</emphasis>. If you are
                     aware that the transcription is incomplete, leave a <code><link
                           linkend="element-comment">&lt;comment></link></code> to that effect in
                     the <code><link linkend="element-head">&lt;head></link></code>'s <code><link
                           linkend="element-to-do">&lt;to-do&gt;</link></code>, identifying which
                     portions are missing from the transcription.</para>
                  <para>Well-known works may have a suitable IRI already assigned to them, say by
                     means of a <link xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>
                     entry. Most works have not been assigned IRIs or are named in IRI vocabularies
                     that are not well known. You may assign any work your own URN, through a UUID
                     or a tag URN. </para>
               </section>
               <section xml:id="work-versions">
                  <title>One Version</title>
                  <para>The transcription must be restricted to a single version of the creative
                     work, identified perhaps by <code><link linkend="element-version"
                           >&lt;version></link></code> (part of the declarations section of
                           <code><link linkend="element-head">&lt;head></link></code>). In most
                     cases, <code><link linkend="element-version">&lt;version></link></code> is
                     unnecessary, because <code><link linkend="element-work">&lt;work></link></code>
                     in conjunction with <code><link linkend="element-source"
                        >&lt;source></link></code> are in most cases sufficient to identify a
                     particular work-version. But if the source carries multiple versions (e.g., a
                     bilingual edition of a text), then <code><link linkend="element-version"
                           >&lt;version></link></code> should be included, to specify which version
                     has been transcribed. <code><link linkend="element-version"
                        >&lt;version></link></code> can also be used to declare explicitly that the
                     work mentioned in <code><link linkend="element-version"
                        >&lt;version></link></code> is a version of the work mentioned in
                           <code><link linkend="element-work">&lt;work></link></code>.</para>
                  <para>If you have a scriptum with multiple versions of a work, and you wish to
                     transcribe them all, each version should be in its own separate TAN-T(EI) file. </para>
                  <para>There may be cases where individual textual divisions are repeated, not so
                     much because they represent a different version, but because they are variants
                     that are integral to the work-version chosen. Creating a separate file for such
                     individual cases would be both impractical and misleading. Standard TAN
                     vocabulary for div types includes as a standard item <code>variant</code>,
                     which may be use to wrap every variant in its own <code><link
                           linkend="element-div">&lt;div></link></code>, e.g., </para>
                  <programlisting>. . . . .
&lt;div type="title" n="title">
   &lt;div type="variant" n="orig">The Place&lt;/div>
   &lt;div type="variant" n="subscript" xml:lang="grc">Ὁ Τόπος&lt;/div>
&lt;/div>
. . . . .</programlisting>
                  <para>Notes should be included only if they are an integral part of the primary
                     work (i.e., by the same author, not by a later editor). If you think the notes
                     to a work are important, and legitimately a work in their own right, consider
                     putting them in their own TAN-T(EI) file, or converting them to claims in a
                     TAN-A file.</para>
                  <para>Very few work-versions have IRIs. It is advisable to assign a tag URN or a
                     UUID. If the IRI you have used for <code><link linkend="element-work"
                           >&lt;work></link></code> is in a namespace that you own or control, then
                     you are entitled to modify it, and you may wish merely to add a suffix to the
                     work IRI. For example, you might have
                        <code>tag:urn:example.com,2001:work:a</code> defined for the work; a 1987
                     German translation might be specified as
                        <code>tag:urn:example.com,2001:work:a:ver:1987:deu</code>.</para>
               </section>
               <section xml:id="reference_system">
                  <title>One Reference System</title>
                  <para>Every TAN transcription must be segmented into a hierarchy of labeled
                     divisions, defined in the <code><link linkend="element-body"
                        >&lt;body></link></code> through <code><link linkend="element-div"
                           >&lt;div></link></code>s and their <code><link linkend="attribute-n"
                           >@n</link></code> values. </para>
                  <para>Those divisions, whenever possible, should align with the reference system
                     that prevails for the work across different versions or translations, in what
                     is sometimes called a canonical reference system. Because even the most
                     familiar reference system admits degrees and dispute, the term
                        <emphasis>canonical</emphasis> is problematic. It is avoided in these
                     guidelines we refer simply to a work's <emphasis role="italic">reference
                        system</emphasis>. </para>
                  <para>If you have your choice, preference should be given to reference systems
                     that follow the semantic contours of the work, not the physical features of a
                     particular scriptum. Chapter, paragraph, and sentence numbers are preferable to
                     volume, page, and line numbers, because other versions of the work (e.g.,
                     translations, paraphrases) will only roughly, if at all, follow a reference
                     system based on features found in a particular scriptum. </para>
                  <para>Sometimes a scriptum-based reference system is inescapable, or is the most
                     common reference system for a work (e.g., Porphyry's commentary on the
                        <emphasis>Categories</emphasis>). It is perfectly acceptable to adopt that
                     system, but it may entail more labor during the alignment process. </para>
                  <para>If a given work has more than one common reference system (e.g., the works
                     of Plato and Aristotle, which have two reference systems—logical and
                     scriptum-oriented—both of which are standard and important), then the
                     recommended practice is to create two class-1 files with identical
                     transcriptions, each one structured by its own reference system. Place in each
                     file a <code><link linkend="element-redivision"
                        >&lt;redivision&gt;</link></code> pointing to the other. Under verbose
                     validation, you will be notified if there are textual discrepancies between the
                     transcriptions, and Schematron Quick Fixes will allow you to automatically
                     update one text to match the other. </para>
                  <para>Having two or more alternatively divided editions can be quite useful. They
                     could serve as the basis for reference cross-indexes, or to help convert other
                     versions of the work from one reference system to the other.</para>
                  <para>If there is a good reference system, but the divisions are overly lengthy,
                     you may introduce subdivisions. But there is no guarantee that the provisional
                     subdivisions you introduce will be adopted by other editors who create or edit
                     TAN versions of the same work. Editors working independently upon the same text
                     and subdividing it, will likely produce discordant schemes. Class-2 formats
                     provide a mechanism via <code><link linkend="element-adjustments"
                           >&lt;adjustments></link></code> to reconcile some basic differences. But
                     a discordant scheme might be best handled simply by creating a copy, and
                     restructuring it according to the preferred system, making sure related files
                     refer to each other through <code><link linkend="element-redivision"
                           >&lt;redivision&gt;</link></code>.</para>
                  <para>If a work does not have a reference system, or if you think that the ones
                     that exist are inadequate or misguided, create one of your own. If you develop
                     your own reference system, be sure to design it so that it can be easily
                     applied to any version of the work, including translations. Prefer logical
                     divisions of text over scriptum-based divisions.</para>
                  <para xml:id="numeration-systems">TAN supports five major methods of numeration in
                     reference systems:<orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals</emphasis>. 1, 2, 3,
                              etc.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Roman numerals</emphasis>. Values up to 5000,
                              utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with
                              liberal syntactic rules (within a roman numeral, any digit preceding
                              one of a higher value will be deducted from the total value; all
                              others are added).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences</emphasis>. The
                              26-letter Latin alphabet, with numbers higher than 26 (or any multiple
                              of 26) beginning with the letter a incrementally repeated, e.g., y
                              (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase
                              allowed. (Note, this is not the hexavigesimal (base 26) system, where
                              a is 0, b is 1, z is 25, aa is 00, ab is 01, etc.) </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals + alphabetic
                                 sequences</emphasis>. Arabic numerals followed immediately by an
                              alphabetic sequence. The second item is to be calculated as a
                              subsequence of the first item, with the lack of a second item taking
                              highest priority. E.g., 4, 4a, 4b, 4c....</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences + Arabic
                                 numerals</emphasis>: As above, but with alphabetic sequence
                              preceding Arabic numerals.</para>
                        </listitem>
                     </orderedlist></para>
                  <para>See <code><link linkend="function-letter-to-number"
                           >tan:letter-to-number()</link></code> and references there to TAN
                     functions for converting numbering systems.</para>
                  <para>The TAN validation process attempts to convert all values of <code><link
                           linkend="attribute-n">@n</link></code> to Arabic numerals. Some values
                     are ambiguously Roman numerals or alphabetic sequences. For example,
                        <code>c</code> could mean 3 (alphabetic sequence) or 100 (Roman numeral).
                     Such numerals are assumed to be Roman, unless you supply a <code><link
                           linkend="element-numerals">&lt;numerals&gt;</link></code> and assign
                           <code><link linkend="attribute-priority">@priority</link></code> to
                     specify <code>letters</code> (or <code>roman</code>).</para>
                  <section xml:id="extra_n_vocabulary">
                     <title>Extra <code><link linkend="attribute-n">@n</link></code>
                        vocabulary</title>
                     <para>If you are using <code><link linkend="attribute-n">@n</link></code> to
                        label the names of books of the Bible or Surahs of the Qur'an, you will run
                        into the issue of different conventions for <code><link
                              linkend="attribute-n">@n</link></code>. To avoid this long-standing
                        problem, you may want to use extra TAN vocabulary for <code><link
                              linkend="attribute-n">@n</link></code>. If you include in the head of
                        your TAN file <code>&lt;vocabulary which="bible eng"/></code>, then any
                        non-numeric values of <code><link linkend="attribute-n">@n</link></code>
                        will be checked against the corresponding TAN-voc file (in this case, the
                        TAN-voc file at <code>/vocabularies/extra/n.bible.eng.tan-voc.xml</code>).
                        This, in turn, will will allow other files to refer to that <code><link
                              linkend="element-div">&lt;div></link></code> by any other <code><link
                              linkend="element-name">&lt;name&gt;</link></code> that is a synonym.
                        For example, in a class-1 file pointing to the TAN English Bible vocabulary
                        above, a <code>&lt;div type="book" n="matt">...&lt;/div></code> would be
                        regarded as containing the work the Gospel of Matthew. Any class-2 file that
                        refers to that class-1 file as a source may use any synonym listed in the
                        extra vocabulary file <code>n.bible.eng.tan-voc.xml</code>, i.e.,
                           <code>Mt</code>, <code>Mat</code>, <code>Matt</code>, or
                           <code>Matthew</code> (or their lowercase equivalents). An extra benefit
                        of this method is that such <code><link linkend="element-div"
                              >&lt;div></link></code>s are also marked as the works, identified by
                        the <code><link linkend="element-IRI">&lt;IRI></link></code>s of the target
                        TAN vocabulary items.</para>
                     <para>If you use extra TAN vocabulary, it is recommended you include in the
                        declarations section of your <code><link linkend="element-head"
                              >&lt;head></link></code> an <code><link linkend="element-n-alias"
                              >&lt;n-alias&gt;</link></code>. This element, along with its
                              <code><link linkend="attribute-div-type">@div-type</link></code>,
                        specifies exactly which types of <code><link linkend="element-div"
                              >&lt;div></link></code>s are eligible for this kind of aliasing on
                              <code><link linkend="attribute-n">@n</link></code>. Supplying this
                        element considerably speeds the validation process on long files.</para>
                     <para>The goal behind the extra vocabularies is to eliminate the need to worry
                        about what abbreviations are used to name well-known, unnumbered <code><link
                              linkend="element-div">&lt;div></link></code>s. It is hoped that in
                        future releases of TAN these extra vocabularies will grow in number and
                        quality.</para>
                     <para>Extra TAN <code><link linkend="attribute-n">@n</link></code> vocabularies:<itemizedlist>
                           <listitem>
                              <para><xref linkend="vocabularies-n-bible-eng"/></para>
                           </listitem>
                           <listitem>
                              <para><xref linkend="vocabularies-n-bible-spa"/></para>
                           </listitem>
                           <listitem>
                              <para><xref linkend="vocabularies-n-quran-eng-ara"/></para>
                           </listitem>
                           <listitem>
                              <para><xref linkend="vocabularies-n-unlabeled-divs-1-eng"/></para>
                           </listitem>
                        </itemizedlist></para>
                  </section>
               </section>
            </section>
            <section xml:id="normalizing_transcriptions">
               <title>Normalizing Transcriptions</title>
               <para>You should declare how you have normalized the transcription via <code><link
                        linkend="element-adjustments">&lt;adjustments></link></code> and its
                  children, e.g., <code><link linkend="element-normalization"
                        >&lt;normalization></link></code> or <code><link linkend="element-replace"
                        >&lt;replace&gt;</link></code>. (For suggestions on values of <code><link
                        linkend="element-IRI">&lt;IRI></link></code> for <code><link
                        linkend="element-normalization">&lt;normalization></link></code> see <xref
                     linkend="vocabularies-normalizations"/>.)</para>
               <para>Generally speaking, normalization entails the suppression of things extraneous
                  to or separable from the work-version you have chosen. You are encouraged to omit
                  parenthetical editorial insertions (especially quotation references), stray
                  handwritten remarks, discretionary word-breaking hyphens, editorial comments,
                  inserted cross-references, and reference numerals (page numbers, section numbers,
                  etc.). If chapter 4 of a text begins "4." or "IV" then leave out that labeling
                  numeral—you've already indicated it in <code><link linkend="attribute-n"
                     >@n</link></code>, so there's no need to clutter the transcription with it.
                  Remember, scholars who use your file will be concerned with things like
                  word-for-word alignments and lexico-morphological analysis, and putting in a
                  modern editor's "4" might contaminate research results. For the same reason, you
                  should resolve ligatures and correct unintended typographical errors. </para>
               <para>The goal is a transcription whose text is free of the interpretive voice of
                  later editors. You should remove from the text anything that is not part of the
                  work proper and would interfere with detailed word-for-word alignment, or would
                  require extra preprocessing or postprocessing work for other users. If you are
                  breaking a transcription into individual lines, and you are required to break a
                  word, do so with either the soft hyphen (<code>&amp;#xad;</code>), the zero-width
                  space (<code>&amp;#x200b;</code>), or the zero-width joiner
                     (<code>&amp;#x200d;</code>). TAN processors that handle the text within a leaf
                        <code><link linkend="element-div">&lt;div></link></code> will automatically
                  normalize its space. If either of those two characters are found at the end then
                  it will be deleted and the text from the next leaf <code><link
                        linkend="element-div">&lt;div></link></code> (if there is one) will
                  immediately follow without intervening space; if those two characters do not occur
                  at the end, then a space, <code>&amp;#x20;</code>, will be added, and all other
                  space will be normalized. For more details, see <xref linkend="whitespace"
                  />.</para>
               <para>In a digital source, variable lengths of special spacing marks (e.g., General
                  Punctuation U+2000..U+200B) should be converted to ordinary spaces (see <xref
                     xlink:href="#deprecated-unicode-points"/>), and superscript combining Roman
                  letters (U+0363..U+036F) should probably be converted to their non-combining
                  counterparts. All Unicode must be normalized to NFC forms (see <xref
                     linkend="normalization"/>). </para>
               <para>Variant readings should not be transcribed. For example, a manuscript may have
                  correctors' marks. Or a set of footnotes (or apparatus criticus) might provide an
                  alternative reading. In those cases, each set of corrections should be moved to a
                  separate TAN-T file, or rewritten as <code><link linkend="element-claim"
                        >&lt;claim></link></code>s of a TAN-A file.</para>
               <para>In some ambiguous areas, you can use TAN-TEI both to normalize and to preserve
                  what is in the scriptum. Suppose, for example, a manuscript has reference numerals
                  that are sui generis. That is, these reference numbers do not correspond to the
                  "canonical" reference scheme, and are scribal adjustments to the text's structure
                  (sometimes mistaken). On the one hand, such reference numerals are metadata, and
                  should arguably be deleted; on the other, they are part of the text, and witness
                  to how a text was read and changed over time. A middle-ground approach would move
                  these references to TAN-TEI's <code>&lt;milestone rend="[TEXT]"></code>,
                  substituting <code>[TEXT]</code> for the reference text. In that way, the numerals
                  are properly removed from the main text, but the information is retained.
                  Generally speaking, TEI's <code>@rend</code> is an excellent way to remove
                  something from a transcription while keeping it in the file.</para>
               <para>Overall, normalization is a difficult, understudied topic. Scholars are not in
                  the habit of documenting everything they normalize, and sometimes have so
                  internalized a set of normalizations that they are unaware of them. Not all
                  decisions will be clear-cut. You may justly hesitate before normalizing
                  orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode
                  that permit different conventions may need special consideration. You may need to
                  deliberate on whether an unusual or rarely used Unicode character might be
                  misinterpreted or hinder searches. Document any decisions in the <code><link
                        linkend="element-adjustments">&lt;adjustments></link></code>. Whether you
                  use <code><link linkend="element-normalization">&lt;normalization></link></code>
                  or <code><link linkend="element-replace">&lt;replace&gt;</link></code> is up to
                  you. The former can be used to apply a class of changes to a vocabulary item. The
                  latter provides a precise, regular-expression-based method of describing exactly
                  what has been changed, and the order in which those changes took place. Note, a
                        <code><link linkend="element-replace">&lt;replace&gt;</link></code> might
                  help one to reconstruct the path that led from the input to the output, but not
                  the reverse. If it is important to document exactly what the pre-normalized
                  version of a text was like, use <code><link linkend="element-predecessor"
                        >&lt;predecessor&gt;</link></code> or a similar element available in the key
                  links section of the <code><link linkend="element-head">&lt;head></link></code>
                  (see <xref xlink:href="#other_related_files"/>) to point to the original.</para>
               <para>If you find it very difficult to bring yourself to normalize to the depth
                  advised above, try first making a (non-TAN) TEI file, and create the transcription
                  you have in mind as the ideal. Once that is finished, create a second, TAN
                  version, and be more aggressive in your normalization, with <code><link
                        linkend="element-see-also">&lt;see-also></link></code> pointing to the first
                  approach. Users of your TAN transcription will be more interested in your TAN
                  version than the TEI version, but you will have at least satisfied your craving to
                  avoid normalizing.</para>
               <section>
                  <title xml:id="normalizing-annotations">Normalizing Annotations</title>
                  <para>The footnotes or endnotes in a scriptum should be normalized. Many, most, or
                     all should likely be deleted. Before deciding, distinguish between those that
                     are an intrinsic part of the work you're transcribing from those that aren't.
                     Those that aren't can be removed, or they can be put into a separate TAN-T(EI)
                     file, perhaps linking the two through <code><link linkend="element-see-also"
                           >&lt;see-also></link></code>, and hopefully structuring both files with
                     the same reference system, to facilitate alignment. Another way to approach the
                     task is to convert some or all of the notes you're removing into <code><link
                           linkend="element-TAN-A">&lt;TAN-A></link></code>
                     <code><link linkend="element-claim">&lt;claim></link></code>s.</para>
                  <para>Footnotes, endnotes, glosses, or marginalia that are intrinsic parts of the
                     work present special challenges for encoding in general, and normalization in
                     particular. </para>
                  <para>First is the issue of connecting an annotation to the text annotated. When
                     we encounter a superscript number—a note signal—while reading the text of a
                     printed book, we infer that we are being invited to find a companion footnote,
                     and that footnote comments on the text we have just read. But specifically what
                     text? Is it only the preceding word? Is it a word or phrase that occurs earlier
                     in the sentence? Does the annotation cover earlier sentences, the entire
                     paragraph, or even prior paragraphs? For some notes, identifying the text being
                     annotated requires interpretation.</para>
                  <para>In a digital file, connecting an annotation to its text cannot be so vague;
                     it requires a decision and a commitment. Here are three possible ways to
                     approach annotations in a TAN file:</para>
                  <para><orderedlist>
                        <listitem>
                           <para>Use the <code>&lt;note></code> feature of TAN-TEI (see related
                                 <link
                                 xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-note.html"
                                 >TEI documentation</link>). This will allow you to connect the
                              annotation to merely an anchor in the text, i.e., to no text
                              whatsover. </para>
                           <programlisting>&lt;div n="1" type="p">
   &lt;p>The process occurred in New York, among other places.&lt;ref rend="1"/>
      &lt;note>&lt;p>&lt;ref rend="1"/>On New York, see: X.&lt;/p>&lt;/note>
   &lt;/p>
&lt;/div></programlisting>
                        </listitem>
                        <listitem>
                           <para>Move each annotation into a <code><link linkend="element-div"
                                    >&lt;div></link></code> with a <code><link
                                    linkend="attribute-type">@type</link></code> that implies that
                              it is an annotation (e.g., <code>scholium</code>) and place it
                              immediately after the <code><link linkend="element-div"
                                    >&lt;div></link></code> it annotates.</para>
                           <programlisting>&lt;div n="1" type="p">The process occurred in New York, among other places.&lt;/div>
&lt;div n="n1" type="footnote">On New York, see: X.&lt;/div></programlisting>
                           <para>Note in the example above that <code>n1</code> is used to make sure
                              that <code>1</code> unambiguously points to only one <code><link
                                    linkend="element-div">&lt;div></link></code>.</para>
                        </listitem>
                        <listitem>
                           <para>As #2, but also write a <code><link linkend="element-TAN-A"
                                    >&lt;TAN-A></link></code> file that more precisely connects each
                              annotation to the text it
                              annotates.<programlisting>&lt;claim verb="annotates">
   &lt;subject src="text" ref="n1"/>
   &lt;object src="text">
      &lt;from-tok ref="1" val="The"/>
      &lt;through-tok ref="1" val="York"/>
   &lt;/object>
&lt;/claim></programlisting></para>
                        </listitem>
                     </orderedlist>The first option is expeditious, and will allow you to be as
                     precise or imprecise as you like. Validation is not affected, but you should be
                     aware that the <code>&lt;note></code> will be treated as a constituent part of
                     its parent <code><link linkend="element-div">&lt;div></link></code>. The second
                     option is also relatively easy, but it entails a decrease in precision. The
                     third option provides immense precision, permits multiple annotations on the
                     same text range, and allows notes to target overlapping ranges of text. But the
                     task could be time-consuming, if only because you will need to determine the
                     range of text targeted by each annotation, and the targeted text might be quite
                     messy or vague. You will need to take stock of how precise and comprehensive
                     you choose to make your connections. (See also <link
                        linkend="accuracy-precision-comprehensiveness">accuracy, precision, and
                        comprehensiveness</link>.)</para>
                  <para>Remember that the note signals in the main text and in the footnote area are
                     metadata meant to help readers link corresponding passages of texts, and in the
                     spirit of normalizing should be deleted. In a TAN-TEI file you can replace a
                     note signal with <code>&lt;ref></code> (see above). </para>
               </section>
            </section>
         </section>
         <section>
            <title>Class 1 Metadata</title>
            <para>The <code><link linkend="element-head">&lt;head></link></code> of a class-1 file
               is much like that of other formats, with some extra options. </para>
            <para>In the key declarations area (see <xref xlink:href="#key_declarations"/>), class-1
               files may allow <code><link linkend="element-n-alias">&lt;n-alias&gt;</link></code>.
               See <xref xlink:href="#reference_system"/> for context on how to use this
               element.</para>
            <para>In the section devoted to links to other digital resources (see <xref
                  xlink:href="#inclusions-and-vocabularies"/>), class-1 files allow several extra
               types of files.</para>
            <para>One <code><link linkend="element-model">&lt;model&gt;</link></code> is allowed, to
               point to another class-1 file that has the model reference system. The model should
               be the same work. It may be in a different language, or come from a different
               source/scriptum. During verbose validation, any differences between a class-1 file
               and its model will be presented as warnings, since small differences are nearly
               always inevitable.</para>
            <para>Zero or more <code><link linkend="element-redivision"
                  >&lt;redivision&gt;</link></code>s are allowed, to point to an alternative
               transcription that follows a different reference system. A class-1 file and any
               redivisions must have identical text in the <code><link linkend="element-body"
                     >&lt;body></link></code>, and draw from the same version of the same
               source/scriptum. <code><link linkend="element-redivision"
                  >&lt;redivision&gt;</link></code> is an important alternative to the knotty,
               longstanding problem that besets texts that admit multiple reference systems. In a
               traditional TEI file, one must adopt a primary reference system, and add other
               reference systems through milestone-like anchors. This can result in transcriptions
               that are difficult to read. TEI anchors do not have the semantic underpinnings needed
               to cycle through the milestones from one primary reference system from one to
               another. TAN's design principles call for simplicity and disaggregation, hence the
               stand-off annotation model. So the ideal TAN approach is to encode same transcription
               in multiple files, one per reference system, linked through <code><link
                     linkend="element-redivision">&lt;redivision&gt;</link></code>. This may appear
               to contradict another principle, that one should not repeat themselves. But that is
               the easier principle to repair. During verbose validation, <code><link
                     linkend="element-redivision">&lt;redivision&gt;</link></code> transcriptions
               will be checked against the host, and specific areas that differ will be flagged.
               Should users wish, a Schematron Quick Fix will provide an automatic update of a text
               to a <code><link linkend="element-redivision">&lt;redivision&gt;</link></code>,
               without changing the reference structure.</para>
            <para>Zero or more <code><link linkend="element-annotation"
                  >&lt;annotation&gt;</link></code>s point to class-2 files that use the file as a
                     <code><link linkend="element-source">&lt;source&gt;</link></code>. This type of
               linked resource is helpful for keeping track of key alignments and
               annotations.</para>
            <para>Zero or more <code><link linkend="element-companion-version"
                     >&lt;companion-version&gt;</link></code>s point to different versions of the
               same work in the same scriptum. This feature is useful for correlating multiple
               versions of a work that appear in a single scriptum, e.g., the original text and a
               facing translation in a bilingual edition.</para>
            <para>The adjustment section of the <code><link linkend="element-head"
                  >&lt;head></link></code> (see <xref xlink:href="#adjustments"/>) allows zero or
               more <code><link linkend="element-normalization">&lt;normalization></link></code>s
               and <code><link linkend="element-replace">&lt;replace&gt;</link></code>. See <xref
                  xlink:href="#normalizing_transcriptions"/>.</para>
         </section>
         <section xml:id="tan-t_data">
            <title>Class 1 Data</title>
            <para>The sole purpose of the <code><link linkend="element-body">&lt;body></link></code>
               of a class-1 file is to contain an ordered, segmented transcription of a single
               version of a single work from a scriptum. <code><link linkend="element-body"
                     >&lt;body></link></code> must take <code><link linkend="attribute-xmllang"
                     >@xml:lang</link></code>, specifying the predominant language of the text. If a
               change in language occurs in a descendant <code><link linkend="element-div"
                     >&lt;div></link></code>, ensure that its <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> also changes.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> takes one or more
                     <code><link linkend="element-div">&lt;div></link></code>s, each of which govern
               either other <code><link linkend="element-div">&lt;div></link></code>s, or text (or
               TEI elements), but never both. TAN files adopt a non-mixed content model (see <xref
                  linkend="non-mixed_content"/>).</para>
            <para>The term <emphasis>leaf div</emphasis> refers to those <code><link
                     linkend="element-div">&lt;div></link></code>s that contain only text, and not
               other <code><link linkend="element-div">&lt;div></link></code>s.</para>
            <para>Within this treelike structure of <code><link linkend="element-div"
                     >&lt;div></link></code>s, the concatenation of <code><link
                     linkend="attribute-n">@n</link></code> values, starting from the most rootward
                     <code><link linkend="element-div">&lt;div></link></code>, provides the
               reference system used by class-2 files to refer to parts of TAN-T(EI) files. A given
                     <code><link linkend="element-div">&lt;div></link></code> may have more than one
               reference, if its <code><link linkend="attribute-n">@n</link></code> or any
                     <code><link linkend="attribute-n">@n</link></code> it inherits has multiple
               values. Every permutation is calculated, and they are treated as synonymous ways to
               refer to that <code><link linkend="element-div">&lt;div></link></code>.</para>
            <para>In previous versions of TAN, there was a requirement that each leaf <code><link
                     linkend="element-div">&lt;div></link></code> should have a unique reference.
               That requirement has been downgraded to a warning, because there are cases where
               non-unique leaf <code><link linkend="element-div">&lt;div></link></code>s are required.<note>
                  <para>Some scripta are encoded such that leaf divs are broken up (see Bodëús's
                     edition of Aristotle's <emphasis>Categories</emphasis>, at 2a35, 2b5, and
                     2b6b). And some translations must be encoded so that leaf divs interleave.
                     Further, one TAN-T's leaf divs might easily become another TAN-T's non-leaf
                     divs, and vice versa. The distinction between leaf and non-leaf div is
                     arbitrary, so both types should be expected to adhere to the same kind of rules
                     for the reference system.</para>
               </note>For any two <code><link linkend="element-div">&lt;div></link></code>s that
               share the same reference, it is not allowed that one be a leaf <code><link
                     linkend="element-div">&lt;div></link></code> and the other not (to do otherwise
               would entail a mix content model). It is also further assumed that all <code><link
                     linkend="element-div">&lt;div></link></code>s that share the same reference are
               consecutive, constituent parts of the same <code><link linkend="element-div"
                     >&lt;div></link></code>. That is, any two <code><link linkend="element-div"
                     >&lt;div></link></code>s with the same <code><link linkend="attribute-n"
                     >@n</link></code> are not alternatives to each other, but are rather disjoint
               parts. For true alternatives, see discussion above on using <code>variant</code> in
                     <code><link linkend="attribute-type">@type</link></code>.</para>
         </section>
         <section xml:id="tan-tei">
            <title>Transcriptions Using the Text Encoding Initiative (<code>&lt;TEI></code>)</title>
            <para>
               <note>
                  <para>This section is to be read in conjunction with <xref linkend="class_1"/> and
                        <xref linkend="TEI"/>, which address related technical issues.</para>
               </note>
            </para>
            <para>Some creators and editors of transcriptions will find the rather stripped-down
               TAN-T format inadequate. Some may wish to mark up the text further. Some may already
               have a library of transcriptions whose annotations are desirable to keep, even if
               uninteresting to every user. In these cases, you should use TAN-TEI, a customization
               of the Text Encoding Intiative (TEI) format, which is well known for its
               expressiveness, its stability, its flexibility, and its widespread use in textual
               scholarship.</para>
            <para>TEI was designed to be maximally expressive and flexible, to serve the detailed
               needs of scholars in the humanities. In serving this mission, TEI has come to define
               more than five hundred different elements, and more than two hundred attributes
               (roughly six times more than are defined in TAN). Of course, any given TEI file uses
               only a small subset of those elements and attributes, and TEI itself comes in
               different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to
               TEI All, which opens up almost the entire library. </para>
            <para>Although TEI XML is oftentimes described as a standard, it lacks charactistics one
               normally expects of a standard. It is very flexible, admits flavors and
               interpretation, and is best used when it is customized. Individuals and projects may
               define their own subset of TEI elements, to constrict or expand the allowable rules
               as they see fit. TAN-TEI is one of those customizations, based on TEI All. The major
               difference between TEI All and TAN-TEI is that the latter imposes extra strictures,
               to ensure that transcriptions are maximally likely to be interchangeable with other
               TAN-TEI files.</para>
            <para>All TEI files are validated against a <link
                  xlink:href="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#STIN"
                  >TEI-conformant schema</link> normally as an XML DTD, RELAX NG, or W3C Schema.
               TAN's TEI-conformant schema is based upon the <code>TAN-TEI.odd</code> file in the
                  <code>schemas</code> directory, converted to a RELAX-NG file, <code>TEI.rnc</code>
               and <code>TEI.rng</code>, to define the structural rules of TAN-TEI files. There is
               an additional layer of validation, through the related Schematron process
                  (<code>TEI.sch</code>), which performs detailed validation not possible in a
               TEI-conformant schema. In the discussion below, it is important to distinguish
               between structural validation and Schematron validation. See <xref
                  linkend="validating_tan_files"/>.</para>
            <para>TAN's customization of the TEI can be summarized as follows (the default namespace
               in this section is the TEI namespace,
               <code>http://www.tei-c.org/ns/1.0</code>):</para>
            <para>
               <table frame="all">
                  <title>Synopsis of TAN-TEI customization</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1*"/>
                     <colspec colname="c3" colnum="2" colwidth="3.21*"/>
                     <thead>
                        <row>
                           <entry>TEI element</entry>
                           <entry>Strictures</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must have <code><link linkend="attribute-id"
                                          >@id</link></code> with tag URN</para>
                                 </listitem>
                                 <listitem>
                                    <para>must have <code><link linkend="attribute-TAN-version"
                                             >@TAN-version</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>takes a new child element, <code><link
                                             linkend="element-head">&lt;head></link></code>, placed
                                       between <code>&lt;teiHeader></code> and
                                          <code>&lt;text></code>; it and its descendants must be in
                                       the TAN namespace,
                                          <code>xmlns:tan="tag:textalign.net,2015:ns"</code>
                                    </para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code>&lt;text></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>There are no extra strictures, but during Schematron
                                       validation (not RELAX-NG), this element and any children
                                          <code>&lt;front></code> and <code>&lt;back></code> will be
                                       ignored. Of its children, only <code><link
                                             linkend="element-body">&lt;body></link></code> will be
                                       Schematron validated. </para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-body">&lt;body></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-xmllang"
                                             >@xml:lang</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>any non-<code><link linkend="element-div"
                                          >&lt;div></link></code> children will be ignored during
                                       Schematron validation; most often only <code><link
                                             linkend="element-div">&lt;div></link></code> should be
                                       children</para>
                                 </listitem>
                                 <listitem>
                                    <para>contents must be restricted to a single version of a
                                       single work</para>
                                 </listitem>
                                 <listitem>
                                    <para>any and all text nodes will be treated as part of the
                                       transcription</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-div">&lt;div></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>may encompass a textual division of whatever size you like
                                       (TEI defines <code><link linkend="element-div"
                                             >&lt;div></link></code> as being larger than block-like
                                       or paragraph-like textual divisions; TAN's <code><link
                                             linkend="element-div">&lt;div></link></code> is much
                                       more like HTML's).</para>
                                 </listitem>
                                 <listitem>
                                    <para>must take elements; either they all are <code><link
                                             linkend="element-div">&lt;div></link></code>s (perhaps
                                       interleaved with anchors such as <code>&lt;pb></code>) or
                                       none of them are <code><link linkend="element-div"
                                             >&lt;div></link></code>s (non-mixed model)</para>
                                 </listitem>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-type"
                                             >@type</link></code> and <code><link
                                             linkend="attribute-n">@n</link></code> (or only <link
                                          linkend="attribute-include"
                                       ><code>@include</code></link>)</para>
                                 </listitem>
                                 <listitem>
                                    <para><code><link linkend="attribute-type">@type</link></code>
                                       may take multiple values, space delimited, pointing via IDref
                                       to a vocabulary item</para>
                                 </listitem>
                                 <listitem>
                                    <para><code><link linkend="attribute-n">@n</link></code> must
                                       consist of word characters or the underscore, conforming to
                                       the following regular expression: <code>[\w\._]+([\-
                                          ,]+[\w\._]+)*</code>. If <code><link linkend="attribute-n"
                                             >@n</link></code> is to be given more than one value,
                                       those items must be separated by a space or a comma. A
                                       hyphen-minus, - (U+002D, the most common form of hyphen),
                                       always has special meaning in <code><link
                                             linkend="attribute-n">@n</link></code>, specifying a
                                       range. This feature is useful for cases where a <code><link
                                             linkend="element-div">&lt;div></link></code> straddles
                                       more than one standard reference number (e.g., a translation
                                       of Aristotle that cannot be easily tied to Bekker numbers).
                                       If you need to use a hyphen-like character in an <code><link
                                             linkend="attribute-n">@n</link></code> that does not
                                       specify a range of numbers, consider ‐ (U+2010 HYPHEN), ‑
                                       (U+2011 NON-BREAKING HYPHEN), ‒ (U+2012 FIGURE DASH), –
                                       (U+2013 EN DASH), or − (U+2212 MINUS SIGN).</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>TAN-TEI files have two heads, which may strike you as strange. Each head does
               something different, and was designed for different purposes. Whereas the TAN <link
                  linkend="element-head"><code>&lt;head></code></link> is meant to be brief and
               restricted to only those matters relevant to the transcription, the
                  <code>&lt;teiHeader></code> permits quite an expansive range of metadata, and may
               be used to encode a variety of things, including those that are tangential or
               irrelevant to the data. Unlike the TAN <link linkend="element-head"
                     ><code>&lt;head></code></link>, whose data is designed to be both computer- and
               human-readable, <code>&lt;teiHeader></code> was designed for data to be read
               principally by humans; although it can accommodate IRIs, it was not designed around
               them. Further, a TAN <link linkend="element-head"><code>&lt;head></code></link> can
               never be empty and valid; a bare-bones <code>&lt;teiHeader></code> with no actual
               text content, such as the following, is considered
               valid:<programlisting>&lt;teiHeader>
   &lt;fileDesc>
      &lt;titleStmt>&lt;title/>&lt;/titleStmt>
      &lt;publicationStmt>&lt;p/>&lt;/publicationStmt>
      &lt;sourceDesc>&lt;p/>&lt;/sourceDesc>
   &lt;/fileDesc>
&lt;/teiHeader></programlisting></para>
            <para>TAN's Schematron validation process ignores the contents of
                  <code>&lt;teiHeader></code>, since its contents are unpredictable and therefore
               not reliably parsable. If your <code>&lt;teiHeader></code> has any kind of metadata
               that needs to appear in the TAN <link linkend="element-head"
                  ><code>&lt;head></code></link> (see <xref linkend="metadata_head"/> and <xref
                  linkend="transcription_principles"/>), the conversion needs to be performed
               manually, since (as mentioned above) the two headers are incommensurate, and writing
               each one requires a different mentality.</para>
            <para>In a TAN-TEI file, the TAN <code><link linkend="element-head"
                  >&lt;head></link></code> must be in the TAN namespace, i.e., <code>&lt;head
                  xmlns="tag:textalign.net,2015:ns"></code> (or <code>&lt;tan:head
                  xmlns:tan="tag:textalign.net,2015:ns"></code>, but this would require all
               descendant elements to be prefixed <code>tan:</code>).</para>
            <para>Within any leaf <code><link linkend="element-div">&lt;div></link></code>, you may
               use whatever TEI markup you wish, to whatever level of depth or complexity. Most
               users of your TAN-TEI file will be interested in the text; only a subset will care
               about any markup within leaf <code><link linkend="element-div"
               >&lt;div></link></code>s. </para>
            <para>TEI files are flexible, permitting different approaches to markup. A TAN-TEI file
               should not be scriptum-oriented, i.e., it should not try to replicate how the text
               appears or looks on the object. </para>
            <para>You may have a TEI file that you wish to convert to TAN-TEI. As a matter of
               practicality, it is helpful to envision the conversion process as falling in three
               steps:</para>
            <para>
               <orderedlist>
                  <listitem>
                     <para>Structure: insert new processing instructions (pointing to files to
                        perform TAN-TEI structural and Schematron validation); adjust root element
                        by supplying a tag URN for <code><link linkend="attribute-id"
                           >@id</link></code> and <code><link linkend="attribute-TAN-version"
                              >@TAN-version</link></code>.</para>
                  </listitem>
                  <listitem>
                     <para>Metadata: create new <code><link linkend="element-head">&lt;head
                              xmlns="tag:textalign.net,2015:ns"></link></code> and populate
                        it.</para>
                  </listitem>
                  <listitem>
                     <para>Data: edit <code><link linkend="element-body">&lt;body></link></code> to
                        make sure all text nodes are restricted to the content of a single version
                        of a single work; restructure <code><link linkend="element-body"
                              >&lt;body></link></code> content into nesting <code><link
                              linkend="element-div">&lt;div></link></code>s with correct <code><link
                              linkend="attribute-type">@type</link></code> and <code><link
                              linkend="attribute-n">@n</link></code> values.</para>
                  </listitem>
               </orderedlist>
            </para>
            <para>It has been the experience of those who have made TEI to TAN-TEI conversions that
               step 2 is the most time-consuming, particularly in finding suitable IRIs. But step 3
               should not be underestimated, either. Many people write TEI files with a focus on the
               original textual object, and they do not normalize to the level expected in a TAN
               file. Some TEI files have been written with little attention paid to space and space
               normalization. Some TEI files are so laden with annotations that the text is
               impossible to read. In general, the more simple the TEI file the better, with
               annotations pushed to external files.</para>
            <para>Some TEI markup is already implicit, or is easily calculable (e.g.,
                  <code>&lt;w></code> to mark words, which should already comport with the
               tokenization declared in the <code><link linkend="element-head"
                  >&lt;head></link></code>; users of <code>&lt;w></code> easily lose track of where
               space is and isn't). Some TEI markup can be expressed in a class-2 file (e.g.,
               lexico-morphological data, which should be expressed in a TAN-A-lm file).</para>
         </section>
      </chapter>
      <chapter xml:id="class_2">
         <title>Class-2 TAN Files, Annotations of Texts</title>
         <para>This chapter provides general background to class-2 TAN files. For detailed
            discussion of individual elements and attributes see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>There are three types of class-2 files:<orderedlist>
               <listitem>
                  <para><emphasis role="bold">TAN-A</emphasis> files provide broad, macroscopic
                     alignment of multiple versions of any number of works. It also allows
                     annotations of texts, in the form of claims.</para>
               </listitem>
               <listitem>
                  <para><emphasis role="bold">TAN-A-tok</emphasis> files provide narrow, microscopic
                     alignment of any two class-1 files, annotating word-for-word or
                     character-for-character correspondences between the two texts.</para>
               </listitem>
               <listitem>
                  <para><emphasis role="bold">TAN-A-lm</emphasis> files express annotations
                     pertaining to lexico-morphology (part-of-speech), for either a single class-1
                     file or a language in general.</para>
               </listitem>
            </orderedlist></para>
         <para>In translation studies, it is common to use the term <emphasis>source</emphasis> (or
               <emphasis>sources</emphasis>) to refer to a translated text and the term
               <emphasis>target</emphasis> to refer to the translation. TAN, however, has been
            designed for situations where it may not be clear which text is the target and which is
            the source. Further, there is a more generic use of <emphasis>source</emphasis> and
               <emphasis>target</emphasis> that prevails in many other contexts. In these
            guidelines, therefore, the term <emphasis role="italic">target</emphasis> never refers
            to a text as such (rather, it normally refers to a file that is being pointed to), and
            when we use the word <emphasis role="italic">source</emphasis>, we are referring only to
            one of the class-1 files upon which a class 2 alignment depends.</para>
         <section xml:id="class_2_common">
            <title>Common Elements</title>
            <section xml:id="class_2_metadata">
               <title>Class 2 Metadata (<code><link linkend="element-head"
                  >&lt;head></link></code>)</title>
               <para>Class-2 files share a few common features in their metadata, mostly to
                  facilitate the human-friendly reference system discussed below.</para>
               <para>All class-2 files have as their sources nothing other than class-1 files.
                  Therefore each <code><link linkend="element-source">&lt;source></link></code> must
                  take the <xref xlink:href="#digital_entity_metadata"/>. </para>
               <para>Editors of class-2 files must be able to name or number word-tokens in a
                  transcription, and to determine an appropriate definition of "token," via an
                  optional <code><link linkend="element-token-definition"
                        >&lt;token-definition></link></code>. See <xref linkend="defining_tokens"
                  />.</para>
               <para>The declaration <code><link linkend="element-numerals"
                     >&lt;numerals&gt;</link></code> at present does not allow you to customize a
                  numeration system for sources. A future release of TAN may support such a
                  feature.</para>
               <para>Inevitably, some class 1 sources for the same work will differ from each other.
                  Perhaps works or div types were not defined with the same IRIs, or perhaps one
                  version follows an idiosyncratic reference system. If sources need to be
                  reconciled, alterations may be specified in <code><link
                        linkend="element-adjustments">&lt;adjustments&gt;</link></code>, which
                  stipulates a set of actions that should be applied to the sources that have been
                  named. Adjustment actions:</para>
               <orderedlist>
                  <listitem>
                     <para><code><link linkend="element-skip">&lt;skip&gt;</link></code>, to allow
                        you to ignore specific <code><link linkend="element-div"
                           >&lt;div&gt;</link></code>s, deeply or shallowly.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="element-rename">&lt;rename&gt;</link></code>, to
                        allow you to rename specific <code><link linkend="element-div"
                              >&lt;div&gt;</link></code>s.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="element-equate">&lt;equate&gt;</link></code>, to
                        allow you to provide synonyms for <code><link linkend="attribute-n"
                              >@n</link></code> values.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="element-reassign">&lt;reassign&gt;</link></code>, to
                        allow you to split leaf <code><link linkend="element-div"
                           >&lt;div&gt;</link></code>s and move their parts elsewhere in the
                        structure. </para>
                  </listitem>
               </orderedlist>
               <para>These adjustment actions allow you to reconcile discordant sources without
                  changing them directly. </para>
               <para>Skips, renames, and equates are first applied to the source as received. If a
                  particular source <code><link linkend="element-div">&lt;div&gt;</link></code> is
                  the target of more than one adjustment action, only the first one will be applied
                  according to action priority: <code><link linkend="element-skip"
                        >&lt;skip&gt;</link></code>, <code><link linkend="element-rename"
                        >&lt;rename&gt;</link></code> based on <code><link linkend="attribute-ref"
                        >@ref</link></code>, <code><link linkend="element-rename"
                        >&lt;rename&gt;</link></code> based on <code><link linkend="attribute-n"
                        >@n</link></code>, then <code><link linkend="element-equate"
                        >&lt;equate&gt;</link></code>. Because of this priority order, some actions
                  might not be performed. For example, if you deeply skip a <code><link
                        linkend="element-div">&lt;div></link></code>, no renaming adjustments will
                  be made to its children. If you have renamed a div, then want to reassign it, you
                  must do so based on the new name, not the original. You should be aware of the
                  consequences of your adjustments.</para>
               <para>After skips, renames, and equates are applied, <code><link
                        linkend="element-reassign">&lt;reassign&gt;</link></code>s are applied to
                  the the newly adjusted source. </para>
               <para>Each adjustment action adds time to the validation routines. On lengthy texts
                  these can become quite time-consuming. Take, for example, the Tanakh / Old
                  Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of
                  these differs from the other in the names of books, and the numeration of some
                  chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To
                  reconcile these three versions, one might write 267 <code><link
                        linkend="element-rename">&lt;rename&gt;</link></code>s and 6 <code><link
                        linkend="element-equate">&lt;equate&gt;</link></code>s. Applying these
                  actions to all three versions can take about a minute (tested on computer with an
                  Intel i5-8250U, 12 GB ram), before any other significant validation checks on the
                        <code><link linkend="element-body">&lt;body></link></code> of the class-2
                  file. Normal validation takes about a minute and a half. If such processing times
                  are unacceptable for your needs, you are advised to keep <code><link
                        linkend="element-adjustments">&lt;adjustments&gt;</link></code>s to a
                  minimum or to apply them to relatively small texts. </para>
               <para>Further, adjustment actions were intended primarily to support the alignment
                  process, and so were designed to apply select changes to sources. If a source must
                  be changed in numerous places to reconcile it with other sources, it might be
                  better to create a new version of the source organized according to the target
                  reference system. Then in both the new and original versions of the class-1 files
                  insert <code><link linkend="element-redivision">&lt;redivision&gt;</link></code>,
                        <code><link linkend="element-predecessor">&lt;predecessor&gt;</link></code>,
                        <code><link linkend="element-successor">&lt;successor&gt;</link></code>, or
                        <code><link linkend="element-see-also">&lt;see-also></link></code> to link
                  the two versions.</para>
               <para>There is a TAN application that remodels one text in the image of another. See
                     <code>applications/remodel/remodel via TAN-T.xsl</code>. The output of that
                  application requires editing, but it can reduce the amount of work required. TAN
                  tools for oXygen's author mode can also be used to correct that newly segmented
                  text. These and related applications are under development, and may not function
                  as expected. Improvement of these tools is scheduled for future releases of
                  TAN.</para>
            </section>
            <section>
               <title>Class 2 Data (<code><link linkend="element-body"
                  >&lt;body></link></code>)</title>
               <para>Data differs greatly between the class 2 formats. However, they all share one
                  thing in common: the <code><link linkend="element-body">&lt;body></link></code>
                  consists of a series of claims, and responsibility for those claims should be
                  attributed to the persons, organizations, or algorithms making the claims.
                  Therefore, each <code><link linkend="element-body">&lt;body></link></code> may
                  take <code><link linkend="attribute-claimant">@claimant</link></code> and perhaps
                        <code><link linkend="attribute-claim-when">@claim-when</link></code>,
                  specifying by IDref who should be credited or blamed with the material. If either
                  attribute is missing, it is assumed that the claims are the responsibility of the
                  persons listed in <code><link linkend="element-file-resp"
                     >&lt;file-resp&gt;</link></code> at the time of the latest date or date-time.
                  The values of <code><link linkend="attribute-claimant">@claimant</link></code> and
                        <code><link linkend="attribute-claim-when">@claim-when</link></code> are
                  weakly inheritable.</para>
            </section>
            <section xml:id="pointer-syntax">
               <title>Class 2 Pointer Syntax: Referencing Texts</title>
               <para>The class 2 formats have been designed to be human readable, particularly text
                  references. In ordinary conversation, when refering to specific parts of a work,
                  we prefer to use the numbers or names of pages, paragraphs, sentences, lines,
                  words, letters, and so forth, and sometimes relational words (e.g., "first"). We
                  might say, for example, "See page 4, second paragraph, the last four words."
                  Sometimes we quote the very text itself: "See page 4, second paragraph, first
                  sentence, second occurence of 'pull'." </para>
               <para>Those familiar conventions are the basis for the TAN pointer syntax, which
                  differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers
                  apply common reference terminology to four strata of a text: works, divisions,
                  word tokens, and characters. <emphasis>Works</emphasis>, defined above (see <xref
                     linkend="conceptual_works"/>), are declared by the <emphasis>source</emphasis>
                  (which may not have more than one work). <emphasis>Divisions</emphasis> are
                  defined by the <code><link linkend="element-div">&lt;div></link></code> structure
                  of each source. <emphasis>Tokens</emphasis> are words of the text in those
                  divisions, defined according to one or more <code><link
                        linkend="element-token-definition">&lt;token-definition&gt;</link></code>s
                  declared in the class-2 file. And <emphasis>characters</emphasis> are defined as
                  individual base letters in a word token (modifier characters are grouped with the
                  preceding base character; see <xref linkend="combining_characters"/>).</para>
               <para>This approach not only makes the syntax human readable but mitigates the effect
                  of changes to the sources. For example, if a <code><link linkend="element-div"
                        >&lt;div></link></code> is deleted, moved, or changed, the alteration
                  affects only references specific to that section; the rest of the reference system
                  remains intact.</para>
               <para>The four parts of TAN's reference system are explained below, but you should
                  consult other parts of the guidelines, or TAN examples, to see how they are used
                  in practice.</para>
               <section>
                  <title>Referencing Works: <code><link linkend="attribute-work"
                     >@work</link></code></title>
                  <para>Class-2 files refer to works via meaningful IDrefs that point to the class-1
                     sources that transcribe the work or work-version, e.g., <code><link
                           linkend="attribute-work">work</link>="hamlet"</code>. The reference is
                     understood to apply not merely to that particular source, but to any TAN-T file
                     that claims to transcribe that work or work-version. (On the relationship
                     between works and work-versions see <xref linkend="domain_model"/>.) Thus, the
                     id of the source-scriptum becomes a proxy or alias for the work. A vocabulary
                     item <code><link linkend="element-work">&lt;work&gt;</link></code> may also be
                     used; its <code><link linkend="attribute-xmlid">@xml:id</link></code> provides
                     a way to refer to a work without requiring a corresponding source.</para>
                  <para>Because TAN-A-tok and TAN-A-lm files deal with source-specific claims, the
                     data for those formats do not refer to works. Only TAN-A <code><link
                           linkend="element-claim">&lt;claim&gt;</link></code>s refer to
                     works.</para>
               </section>
               <section xml:id="referencing-divisions">
                  <title>Referencing Divisions: <code><link linkend="attribute-ref"
                        >@ref</link></code></title>
                  <para>Portions of text, i.e., <code><link linkend="element-div"
                        >&lt;div></link></code>s, perhaps altered if <code><link
                           linkend="element-adjustments">&lt;adjustments&gt;</link></code>s have
                     been invoked (see <xref linkend="metadata_head"/>, are pointed to via
                           <code><link linkend="attribute-ref">@ref</link></code>. A <code><link
                           linkend="attribute-ref">@ref</link></code> is constructed by taking the
                     values of <code><link linkend="attribute-n">@n</link></code> in the <code><link
                           linkend="element-div">&lt;div></link></code> in question along with its
                     ancestor <code><link linkend="element-div">&lt;div></link></code>s, and joining
                     them with non-word characters. For example, <code><link linkend="attribute-ref"
                           >@ref</link>="I.1.1"</code> might point to the following:</para>
                  <para>
                     <programlisting>&lt;div type="act" n="1">
   &lt;div type="scene" n="1">
      <emphasis role="bold">&lt;div type="line" n="1">
         . . . . . .
      &lt;/div></emphasis>
      . . . . . .
   &lt;/div>
   . . . . . .
&lt;/div></programlisting>
                  </para>
                  <para>A <code><link linkend="attribute-ref">@ref</link></code> can express
                     sequences and ranges of <code><link linkend="element-div"
                        >&lt;div></link></code>s. In the example <code><link linkend="attribute-ref"
                           >ref</link>="1.2-4, 1.5"</code>, the hyphen and comma, which are reserved
                     to signify ranges and series, are reserved. <emphasis role="bold">A hyphen
                        always means "from...through" and a comma always means "and"</emphasis>.
                     Take note, if you are accustomed to editing conventions that use the comma as a
                     subordinating punctuation mark. In the TAN format, commas are always
                     paratactic, not hypotactic. For example, if referring to Hamlet, <code><link
                           linkend="attribute-ref">ref</link>="I,2,3"</code> is not a single
                     reference to <code><link linkend="element-div">&lt;div></link></code>, act I
                     scene 2 line 3, but rather three of them: act I, act 2, and act 3 (notice how
                     the commas in the attribute value behave like the commas in the written
                     phrase).</para>
                  <para>The periods (full stops) in <code><link linkend="attribute-ref"
                        >@ref</link>="I.1.1"</code> are hypotactic markers, but they are arbitrary,
                     and could be replaced with any mix of non-word character you like (except the
                     hyphen or comma), including spaces, e.g., <code><link linkend="attribute-ref"
                           >ref</link>="I:1 1"</code>. The numeral system is also arbitrary. You may
                     use any supported numeral systems (see <link xlink:href="#numeration-systems"
                        >section on numeration systems</link>), even if the source uses a different
                     one. Semantic equivalents to the preceding example are <code><link
                           linkend="attribute-ref">ref</link>="A I i"</code> and <code><link
                           linkend="attribute-ref">ref</link>="1:a:I"</code>. Just remember, if you
                     use either the Roman numeral system or alphabetic sequences, include a
                           <code><link linkend="element-numerals">&lt;numerals&gt;</link></code> in
                     the <code><link linkend="element-head">&lt;head></link></code> to specify which
                     system should prevail in case of ambiguities (e.g., whether <code>c</code>
                     means 3 or 100). Roman numerals are the default, but it is a good idea to be
                     explicit.</para>
               </section>
               <section xml:id="attr_pos_and_val">
                  <title>Referencing Tokens: <code><link linkend="attribute-pos">@pos</link></code>
                     and <code><link linkend="attribute-val">@val</link></code></title>
                  <para>To point to a token one normally uses <code><link linkend="element-tok"
                           >&lt;tok></link></code>, with one or more attributes, in three possible
                     configurations:</para>
                  <para>
                     <orderedlist>
                        <listitem>
                           <para><emphasis role="italic"><code><link linkend="attribute-val"
                                       >@val</link></code> or <code><link linkend="attribute-rgx"
                                       >@rgx</link></code> alone</emphasis>: one or more tokens are
                              pointed to by value. For example, <code><link linkend="attribute-val"
                                    >val</link> = "bird"</code>, points to every occurence of the
                              token <code>bird</code>; <code><link linkend="attribute-rgx"
                                    >rgx</link> = "b.+d"</code> finds every word that begins with a
                              b, ends with a d, and has some characters in-between. Every value of
                                    <code><link linkend="attribute-rgx">@rgx</link></code> is
                              implicitly bound to the beginning and end of the string (see
                              below).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="italic"><code><link linkend="attribute-pos"
                                       >@pos</link></code> alone</emphasis>: one or more tokens are
                              pointed to by numerical position, via one or more digits, or the
                              phrase <code>last</code> or <code>last-</code> plus a digit, joined by
                              hyphens or commas. For example, <code>2, 4-6, last-2 - last</code>
                              refers to the second, fourth, fifth, sixth, antepenult, penult, and
                              final tokens in a passage. The numerical value to which the keyword
                                 <code>last</code> resolves depends upon the context length.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="italic"><emphasis role="italic"><code><link
                                          linkend="attribute-val">@val</link></code> or <code><link
                                          linkend="attribute-rgx">@rgx</link></code> combined with
                                          <code><link linkend="attribute-pos"
                                    >@pos</link></code></emphasis></emphasis>: a combination of the
                              previous two methods. For example, <code><link linkend="attribute-val"
                                    >@val</link>="bird" <link linkend="attribute-pos"
                                 >@pos</link>="2, 4"</code> picks the second and fourth occurences
                              of the token <code>bird</code>.</para>
                        </listitem>
                     </orderedlist>
                  </para>
                  <para>During Schematron validation, if <code><link linkend="attribute-pos"
                           >@pos</link></code> is missing, it is assumed to mean <code>*</code> or
                        <code>1 - last</code>; if neither <code><link linkend="attribute-val"
                           >@val</link></code> nor <code><link linkend="attribute-rgx"
                        >@rgx</link></code> appear, the assumption is <code><link
                           linkend="attribute-rgx">@rgx</link></code> with value <code>.+</code>
                     (any characters). That is, <code><link linkend="attribute-pos"
                        >@pos</link></code> by default points to every instance and <code><link
                           linkend="attribute-val">@val</link></code>/<code><link
                           linkend="attribute-rgx">@rgx</link></code> by default points to any
                     string.</para>
                  <para>When using <code><link linkend="attribute-pos">@pos</link></code> make sure
                     you know the context. For example, the attribute combination <code>val="bird"
                        pos="last-1"</code> will produce an error if the token <code>bird</code>
                     does not occur at least two times in the given context.</para>
                  <para>It is advisable to use <code><link linkend="attribute-val"
                        >@val</link></code>, and not merely <code><link linkend="attribute-pos"
                           >@pos</link></code>. If your source's text changes, and there is no
                           <code><link linkend="attribute-val">@val</link></code>, it may be
                     difficult to determine the original intent of a claim, to determine whether
                     changes need to be made. Furthermore, <code><link linkend="attribute-val"
                           >@val</link></code> is generally speaking more efficient to process than
                     is <code><link linkend="attribute-rgx">@rgx</link></code>. A <code><link
                           linkend="attribute-rgx">@rgx</link></code> is more efficient to process
                     only if it replaces numerous instances of <code><link linkend="attribute-val"
                           >@val</link></code>.</para>
                  <para><code><link linkend="attribute-rgx">@rgx</link></code> is a regular
                     expression that must match an entire word-token. For example, <code><link
                           linkend="attribute-rgx">@rgx</link>="re.d"</code> will match the tokens
                     "rend" and "read" but will not match "already", "rends", or "bread". If you
                     wish to allow for characters at the beginning or end, use
                        <code>".*re.d.*"</code>. For more on regular expressions, see <xref
                        linkend="regular_expressions"/>.</para>
               </section>
               <section>
                  <title>Referencing Characters: <code><link linkend="attribute-chars"
                        >@chars</link></code></title>
                  <para>Individual letters are always specified by <code><link
                           linkend="attribute-chars">@chars</link></code>, which points to a
                     specific position, e.g., <code><link linkend="attribute-chars">chars</link>="2,
                        7, last"</code>. Combining characters are excluded from these counts; see
                        <xref xlink:href="#combining_characters"/>.</para>
               </section>
            </section>
         </section>
         <section xml:id="TAN-A">
            <title>Division-Based Annotations and Alignments (<code><link linkend="element-TAN-A"
                     >&lt;TAN-A></link></code>)</title>
            <para>TAN-A is the format for macroscopic, division-based alignment and annotations. It
               is dedicated to aligning any number of versions of any number of works on the basis
               of <code><link linkend="element-div">&lt;div></link></code>s in its sources. The A
               also stands for annotations, because the TAN-A format allows you to make general
               assertions, usually but not necessarily about texts. TAN-A is a type of advanced RDF
               for textual scholarship (see <xref xlink:href="#rdf_and_lod"/>).</para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a TAN division-based alignment file is <code><link
                        linkend="element-TAN-A">&lt;TAN-A></link></code>.</para>
               <para>TAN-A's <code><link linkend="element-head">&lt;head></link></code> has zero or
                  more <code><link linkend="element-source">&lt;source></link></code>s.</para>
               <para>Any concepts that will be mentioned in the <code><link linkend="element-claim"
                        >&lt;claim&gt;</link></code>s (the only children of <code><link
                        linkend="element-body">&lt;body></link></code>) need to be supplied in
                        <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code>.</para>
            </section>
            <section xml:id="tan-a_body">
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A file
                  takes, in addition to the customary optional attributes (see <xref
                     linkend="edit_stamp"/>), <code><link linkend="attribute-claimant"
                        >@claimant</link></code>, <code><link linkend="attribute-object"
                        >@object</link></code>, <code><link linkend="attribute-subject"
                        >@subject</link></code>, or <code><link linkend="attribute-verb"
                        >@verb</link></code>, stipulating the default values for the enclosed
                  claims.</para>
               <para>The rest of the body consists of zero or more <code><link
                        linkend="element-claim">&lt;claim&gt;</link></code>s, each of which
                  represents one or more claims. Claims can be used for a variety of purposes,
                  e.g.,:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para>to list quotations and allusions;</para>
                     </listitem>
                     <listitem>
                        <para>to indicate which passages deal with what general subjects and
                           topics;</para>
                     </listitem>
                     <listitem>
                        <para>to connect commentary or notes from one source to another;</para>
                     </listitem>
                     <listitem>
                        <para>to indicate where other scripta have different readings (apparatus
                           criticus).</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para><code><link linkend="element-claim">&lt;claim&gt;</link></code>'s data model is
                  inspired by the Resource Description Framework (RDF; see <xref
                     linkend="rdf_and_lod"/>), where each statement consists of three items termed a
                  subject, a predicate, and an object. The first and third are thought of as nodes,
                  and the second as a connector (or edge) between the nodes. RDF follows a graph
                  model, where the connector (edge) always links exactly two nodes.</para>
               <para>RDF is adequate for but a limited range of scholarly assertions. An RDF
                  statement lacks context or qualifiers. No RDF statement can indicate who made the
                  assertion, or when, or if it was uttered with any doubt or nuance. Sometimes we
                  wish to claim a bare negation, e.g., "Aristotle was not the author of <emphasis>De
                     mundo</emphasis>"—which cannot be expressed in RDF.</para>
               <para>TAN's <code><link linkend="element-claim">&lt;claim&gt;</link></code> extends
                  the graph RDF model into a hypergraph, where the connector (edge) links two or
                  more nodes. The following adjustments are made:<orderedlist>
                     <listitem>
                        <para>Every claim <emphasis>must</emphasis> have at least one <emphasis
                              role="bold">claimant</emphasis>, some person, organization, or
                           algorithm to be credited/blamed for the assertion.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim <emphasis>must</emphasis> have at least one <emphasis
                              role="bold">subject</emphasis>, the topic of the claim.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim <emphasis>must</emphasis> have at least one <emphasis
                              role="bold">verb</emphasis> (in RDF called
                              <emphasis>predicate</emphasis>), specifying something about the
                           subject.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold"
                              >adverb</emphasis>, qualifying the verb.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may assert a level or range of <emphasis role="bold"
                              >certainty</emphasis>, between zero and one, reflecting how certain
                           the claimant is of the claim.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold"
                              >object</emphasis>, an entity or value expected by the verb.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold">temporal
                              qualifier</emphasis>, restricting the claim to a specific time.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold">locative
                              qualifier</emphasis>, restricting the claim to a specific geographical
                           region.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have other components, if so defined by the verb.
                           Currently, this entails for select verbs a language qualifier
                                 (<code><link linkend="attribute-in-lang">@in-lang</link></code>,
                                 <code><link linkend="element-in-lang"
                           >&lt;in-lang&gt;</link></code>) and a reference qualifier (<code><link
                                 linkend="element-at-ref">&lt;at-ref&gt;</link></code>).</para>
                     </listitem>
                  </orderedlist>Items 1-3 above are required parts of any claim. Items 4-9 may be
                  rendered as being required, optional, or disallowed by a <code><link
                        linkend="element-verb">&lt;verb&gt;</link></code>'s definition. For example,
                  a <code><link linkend="element-verb">&lt;verb&gt;</link></code> representing an
                  idea that in normal discourse is intransitive (e.g., sleep) can be defined such
                  that <code><link linkend="element-object">&lt;object></link></code> is not
                  allowed. </para>
               <para>Furthermore, a <code><link linkend="element-verb">&lt;verb&gt;</link></code>
                  may be defined to restrict what kinds of objects or subjects are allowed. For
                  example, the standard TAN verb <code>lacks_text_at</code> (see
                     <code>vocabularies/verbs.TAN-voc.xml</code>) is defined to allow only scripta
                  as a subject. An object is not allowed. A <code><link linkend="element-claim"
                        >&lt;claim></link></code> with this verb expects one or more <code><link
                        linkend="element-at-ref">&lt;at-ref&gt;</link></code>s, which restricts the
                  claim to a particular passage in a TAN-T file. A <code><link
                        linkend="element-verb">&lt;verb&gt;</link></code> can specify that an object
                  must be data, and it can also define the type of data allowed and its permitted
                  lexical form. </para>
               <para>Claims may refer to other claims. That is, <code><link linkend="element-claim"
                        >&lt;claim></link></code>s can nest inside each other (e.g., X claims that Y
                  claims that Z claims that...). Or a <code><link linkend="element-claim"
                        >&lt;claim></link></code> may take an <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code>, whose value can then be cited as the object or
                  subject of any other <code><link linkend="element-claim"
                  >&lt;claim></link></code>.</para>
               <para>If a <code><link linkend="element-claim">&lt;claim></link></code> is about a
                  work or source in general, as a whole, one or more IDrefs may be placed in
                        <code><link linkend="attribute-subject">@subject</link></code> or
                        <code><link linkend="attribute-object">@object</link></code>. But if the
                  claim is about a specific part of the textual object, then more information is
                  needed, so the attributes cannot be used.</para>
               <para>Such textual references come in three flavors: assertions pertaining to a work,
                  assertions pertaining to a work in only some versions, and assertions pertaining
                  to scripta. In the first case, <code><link linkend="element-subject"
                        >&lt;subject></link></code> or <code><link linkend="element-object"
                        >&lt;object></link></code> must take <code><link linkend="attribute-work"
                        >@work</link></code>, with IDrefs pointing to vocabulary items for
                        <code><link linkend="element-work">&lt;work&gt;</link></code>s. In the
                  second case, <code><link linkend="attribute-src">@src</link></code> is used,
                  pointing by IDref to the applicable <code><link linkend="element-source"
                        >&lt;source></link></code>s. In the third case <code><link
                        linkend="attribute-scriptum">@scriptum</link></code> is used, pointing to
                  vocabulary items for <code><link linkend="element-scriptum"
                        >&lt;scriptum&gt;</link></code>. Remember, you may combine commonly grouped
                  IDrefs in an <code><link linkend="element-alias"
                  >&lt;alias&gt;</link></code>.</para>
               <para>A <code><link linkend="attribute-work">@work</link></code> means that the claim
                  applies to any versions of the work, whether a source or not; a <code><link
                        linkend="attribute-src">@src</link></code> specifies that the claim applies
                  only to the specific <code><link linkend="element-source"
                     >&lt;source></link></code>. In each case, <code><link linkend="element-subject"
                        >&lt;subject></link></code> or <code><link linkend="element-object"
                        >&lt;object></link></code> may be given more attributes and elements to
                  restrict the claim to specific parts of the work or source, with <code><link
                        linkend="attribute-ref">@ref</link></code>, <code><link
                        linkend="element-tok">&lt;tok></link></code>, <code><link
                        linkend="attribute-val">@val</link></code>, <code><link
                        linkend="attribute-pos">@pos</link></code>, and <code><link
                        linkend="attribute-chars">@chars</link></code>, following the conventions
                  used in pointing to parts of texts (see <xref xlink:href="#pointer-syntax"
                  />).</para>
               <para>If a <code><link linkend="element-subject">&lt;subject></link></code> or
                        <code><link linkend="element-object">&lt;object></link></code> points via
                        <code><link linkend="attribute-scriptum">@scriptum</link></code> to a
                  scriptum, specifying the claim necessarily takes a different approach than that
                  used for <code><link linkend="attribute-work">@work</link></code> or <code><link
                        linkend="attribute-src">@src</link></code>. Bear in mind, it is encouraged
                  in these guidelines to avoid scriptum-oriented methods of dividing class 1 files.
                  Therefore, clarifying a portion of a scriptum (e.g., a particular manuscript folio
                  number) requires an apparatus that likely does not correspond to a TAN file.
                  Therefore, a a <code><link linkend="element-subject">&lt;subject></link></code> or
                        <code><link linkend="element-object">&lt;object></link></code> with a
                        <code><link linkend="attribute-scriptum">@scriptum</link></code> can be
                  restricted through descendant <code><link linkend="element-div"
                     >&lt;div></link></code>s that specify via <code><link linkend="attribute-n"
                        >@n</link></code> and <code><link linkend="attribute-type"
                     >@type</link></code> a specific region on the scriptum. These scriptum filters,
                  unlike TAN-T <code><link linkend="element-div">&lt;div></link></code>s, are always
                  empty; their sole purpose is to point in native terms to a specific region on a
                  scriptum.</para>
               <para>Multiple values in any component of a <code><link linkend="element-claim"
                        >&lt;claim></link></code> are distributed, which means that one <code><link
                        linkend="element-claim">&lt;claim></link></code> might contain multiple
                  assertions. For example, <code>&lt;claim subject="A B" verb="taught promoted"
                     object="X Y Z"/></code> has within it twelve claims (the combinatory
                  permutations of the three attributes' individual values). The exception to this
                  general rule is <code><link linkend="attribute-adverb">@adverb</link></code>,
                  whose multiple values are taken as ampliative and restrictive. For example,
                     <code>&lt;claim subject="A" adverb="probably not" verb="taught"
                     object="X"/></code> is a single claim, not two, even though <code><link
                        linkend="attribute-adverb">@adverb</link></code> has two values.</para>
               <para>A limited set of verbs have been defined in standard TAN vocabulary; see <xref
                     xlink:href="#vocabularies-verbs"/>. The strictures defined in these verbs are
                  checked during Schematron validation. For a brief discussion on defining your own
                  verbs in a TAN-voc file see <xref linkend="tan-voc-data"/>.</para>
            </section>
         </section>
         <section xml:id="tan-a-tok">
            <title>Token-Based Annotations and Alignments (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>)</title>
            <para>TAN-A-tok files facilitate the microscopic alignment of two related sources. The
               format is intended to allow you to specify exactly where, how, and why two
               transcriptions align, and to do so on the most granular level possible. TAN-A-tok
               files also allow you to express levels of confidence or alternative opinions. The two
               class-1 sources should be two different versions of the same work. Most often, one
               will be a translation of the other, but the format can be used for two versions of
               the text in the same language, e.g., paraphrase, revision.</para>
            <para>Creators and editors of TAN-A-tok files should be able to read the languages of
               their sources and to explain as precisely as possible the relationship between the
               two sources. They should be prepared to think about and specify types of textual
               reuse. TAN-A-tok files tend to be more demanding to create and edit than are TAN-A
               files because of the level of detail involved.</para>
            <para>To simplify the file, token alignment is restricted to two texts, referred to
               jointly as a <emphasis role="italic">bitext</emphasis>. Each half of the bitext must
               be a TAN-T(EI) file. It is assumed that those two sources share some special
               relationship, direct or indirect, and relate through one or more types of textual
               reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such
               as literal translations, may line up quite nicely word for word. Others, such as
               paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at
               all. Annotating a bitext is oftentimes not easy, and requires you to consider and
               declare assumptions you have made in two key areas: the relationship that holds
               between two scripta and the types of reuse that was involved in turning one version
               into the other (or a common ancestor into both).</para>
            <para><emphasis role="bold">Relationship of sources' scripta</emphasis>. What is the
               physical relationship or history that connects the two sources' scripta? Is one a
               direct descendant (copy) of the other? If not, what common ancestor do they share?
               Here you should consider the material aspect of the bitext, because you are trying to
               answer how object A's text relates to object B's. See <xref
                  xlink:href="#vocabularies-bitext-relations"/>.</para>
            <para><emphasis role="bold">Types of reuse</emphasis>. What categories of text reuse do
               you consider operative? Users of your data should be informed of the paradigm you
               bring to your analysis. You may wish to keep your categories nondescript and somewhat
               vague, using generic terms such as <emphasis>translation</emphasis>,
                  <emphasis>paraphrase</emphasis>, <emphasis>quotation</emphasis>, without much
               specificity. On the other hand, you may subscribe to a detailed view of text reuse.
               Perhaps you have adopted field-specific categories such as <emphasis>obligatory
                  explicitation</emphasis>, <emphasis>optional explicitation</emphasis>,
                  <emphasis>pragmatic explicitation</emphasis>, or <emphasis>translation-inherent
                  explicitation</emphasis>. You may also wish to declare secondary types of reuse,
               such as <emphasis role="italic">scribal omission</emphasis> or <emphasis
                  role="italic">dittography</emphasis>, to declare secondary types of reuse that may
               have intervened. You must declare at least one type of reuse. Or you may use those
               that are built into the TAN format. See <xref xlink:href="#vocabularies-reuse-types"
               />.</para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a token-based alignment file is <code><link
                        linkend="element-TAN-A-tok">&lt;TAN-A-tok></link></code>.</para>
               <para>The TAN-A-tok header builds upon the core and class 2 headers (see <xref
                     linkend="metadata_head"/> and <xref linkend="class_2_metadata"/>).</para>
               <para>TAN-A-tok files take exactly two <code><link linkend="element-source"
                        >&lt;source></link></code>s. The sequence is arbitrary. Each <code><link
                        linkend="element-source">&lt;source></link></code> must take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>.</para>
               <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
                  takes, in addition to all the elements allowed in class-2 files (see <xref
                     linkend="class_2_metadata"/>), two elements unique to TAN-A-tok: <code><link
                        linkend="element-bitext-relation">&lt;bitext-relation></link></code> and
                        <code><link linkend="element-reuse-type">&lt;reuse-type></link></code>. The
                  former describes the genealogical relationship between each source's scripta. The
                  second attends to the qualitative aspect of the bitext relationship. See
                  above.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-tok
                  file takes, in addition to the customary optional attributes (see <xref
                     linkend="edit_stamp"/>), required <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code> and
                        <code><link linkend="attribute-reuse-type">@reuse-type</link></code>, which
                  take one or more IDrefs from <code><link linkend="element-bitext-relation"
                        >&lt;bitext-relation></link></code> and <code><link
                        linkend="element-reuse-type">&lt;reuse-type></link></code>, indicating the
                  default values that govern the alignment. </para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-align">&lt;align></link></code>s,
                  each of which collects sets of <code><link linkend="element-tok"
                     >&lt;tok></link></code>s from one or both sources, known collectively as a
                     <emphasis role="italic">token cluster</emphasis>. Clusters may overlap, to
                  handle translations in which words fall in one-to-one, one-to-many, many-to-one,
                  and many-to-many relationships. The independence of token clusters allows you to
                  register differences of opinion about the same set of tokens. An <code><link
                        linkend="element-align">&lt;align></link></code> may take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>, in case you or someone else
                  wishes to refer to a particular <code><link linkend="element-align"
                        >&lt;align></link></code>.</para>
               <para>Nothing should be inferred from silence in a TAN-A-tok file. There is no
                  requirement that everything in a source <emphasis>must </emphasis>be encoded or
                  described. In writing and editing a TAN-A-tok file you do not commit yourself to
                  saying everything possible about the bitext. You might choose to encode only a few
                  token clusters. Tokens that are not referred to should not be interpreted as gaps
                  in a translation. All that can be inferred is that the creators and editors of the
                  TAN-A-tok file have said nothing about the tokens. (See discussion on <link
                     linkend="accuracy-precision-comprehensiveness">comprehensiveness</link>.) In
                  fact it is oftentimes preferable to have a TAN-A-tok file that points to only a
                  selection of tokens; a file with tens of thousands of <code><link
                        linkend="element-align">&lt;align></link></code>s could take a very long
                  time to validate.</para>
               <para>Any token may be the object of as many <code><link linkend="element-align"
                        >&lt;align></link></code>s as you like. In fact, this is preferred if you
                  wish to register competing claims or alternatives.</para>
               <para>If you wish to declare that one or more words in a source were omitted from a
                  translation or inserted into one—that is, words in one source have no match in the
                  other—you must do so through a <emphasis role="italic">one-sided
                     alignment</emphasis>, i.e., a token cluster that has tokens from only one
                  source. A one-sided alignment implies insertions or omissions.</para>
               <para>If there are multiple values in <code><link linkend="attribute-reuse-type"
                        >@reuse-type</link></code> or <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code>, the
                  intersection, not the union, of those values is to be understood. For example,
                     <code>reuse-type="translation paraphrase"</code> would indicate that the token
                  cluster results from an activity that is both translation and paraphrase. </para>
               <para>Commonly, <code><link linkend="element-tok">&lt;tok></link></code>s include
                        <code><link linkend="attribute-ref">@ref</link></code>, pointing to a leaf
                        <code><link linkend="element-div">&lt;div></link></code>. But this is not
                  required. The <code><link linkend="attribute-ref">@ref</link></code> may point to
                  a <code><link linkend="element-div">&lt;div></link></code> that takes other
                        <code><link linkend="element-div">&lt;div></link></code>s, or <code><link
                        linkend="attribute-ref">@ref</link></code> may be altogether absent.</para>
            </section>
         </section>
         <section xml:id="tan-a-lm">
            <title>Lexico-Morphology (<code><link linkend="element-TAN-A-lm"
                  >&lt;TAN-A-lm&gt;</link></code>)</title>
            <para>TAN-A-lm files are used to annotate the lexical and morphological character of
               individual tokens or morphemes. </para>
            <para>These files have two kinds of dependencies: a class 1 source (optional) and the
               grammatical rules defined in one or more TAN-mor files. This section therefore should
               be read in close conjunction with <xref linkend="TAN-mor"/>).</para>
            <para>TAN-A-lm files are either <emphasis>source-specific</emphasis> or
                  <emphasis>language-specific</emphasis>. Source-specific TAN-A-lm files depend
               exclusively upon one class-1 source. Language-specific TAN-A-lm files depend upon an
               unknown number of sources. Some language-specific TAN-A-lm files might be based upon
               a small, specific corpus, others upon a vast, general one. Source-specific TAN-A-lm
               files are useful for analyzing closely one particular text. Language-specific ones
               are useful for building language resources for computer applications.</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>Editors of TAN-A-lm files should understand the vocabulary and grammar of the
                  languages of their sources. They should have a good sense of the rules established
                  by the lexical and grammatical authorities adopted. They should be familiar with
                  the conventions and assumptions of the TAN-mor files being used.</para>
               <para>Although you must assume the point of view of a particular grammar and lexicon,
                  you need not hold to a single one. In addition, you may bring to the analysis your
                  own expertise and supply lexical headwords unattested in published
                  authorities.</para>
               <para>Although TAN-A-lm files are simple, they can be laborious to write and edit,
                  more than any other type of TAN file. They can also be hard to read if the
                  morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file
                  to use tools to create and edit the data.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a lexico-morphological file is TAN-A-lm.</para>
               <para>If the file is source-specific, <code><link linkend="element-source"
                        >&lt;source></link></code> points to the one and only TAN-T(EI) file that is
                  the object of analysis. If the file is language-specific, <code><link
                        linkend="element-for-lang">&lt;for-lang></link></code> is used in the
                  declarations section of the <code><link linkend="element-head"
                     >&lt;head></link></code> to indicate the languages that are covered. For
                  language-specific TAN-A-lm files, this part of the <code><link
                        linkend="element-head">&lt;head></link></code> may also include <code><link
                        linkend="element-tok-starts-with">&lt;tok-starts-with&gt;</link></code> and
                        <code><link linkend="element-tok-is">&lt;tok-is&gt;</link></code>, which
                  improve performance when validating and processing numerous or large files.</para>
               <para>There is at present no mechanism for automatically reconstructing the corpus
                  that underlies a language-specific TAN-A-lm file. Such a mechanism may be provided
                  in a future version of TAN.</para>
               <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
                  takes the elements other class-2 files take (see <xref linkend="class_2_metadata"
                  />. It also permits two elements unique to TAN-A-lm: <code><link
                        linkend="element-lexicon">&lt;lexicon></link></code> (optional) and
                        <code><link linkend="element-morphology">&lt;morphology></link></code>
                  (mandatory). Any number of lexica and morphologies may be declared; the order is
                  inconsequential. </para>
               <para>There is, at present, no TAN format for lexica and dictionaries. So even if a
                  digital form of a dictionary is identified through the <xref
                     linkend="digital_entity_metadata"/>, the Schematron validation routine will not
                  attempt to check the TAN-A-lm data against the lexical authorities cited. </para>
               <para>Because you or other TAN-A-lm editors are likely to be authorities in your own
                  right, <code><link linkend="element-person">&lt;person&gt;</link></code> can be
                  treated as if a <code><link linkend="element-lexicon">&lt;lexicon></link></code>,
                  and be referred to by <code><link linkend="attribute-lexicon"
                     >@lexicon</link></code>.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-lm
                  file takes, in addition to the customary optional attributes found in other TAN
                  files (see <xref linkend="edit_stamp"/>), <code><link linkend="attribute-lexicon"
                        >@lexicon</link></code> and <code><link linkend="attribute-morphology"
                        >@morphology</link></code>, to specify the default lexicon and
                  grammar.</para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-ana">&lt;ana></link></code>s
                  (short for analysis), each of which matches one or more tokens (<code><link
                        linkend="element-tok">&lt;tok&gt;</link></code>) to one or more lexemes or
                  morphological assertions (<code><link linkend="element-lm"
                     >&lt;lm&gt;</link></code>, which takes <code><link linkend="element-l"
                        >&lt;l&gt;</link></code>s and <code><link linkend="element-m"
                        >&lt;m&gt;</link></code>s). </para>
               <para>An <code><link linkend="element-ana">&lt;ana></link></code> may take a
                        <code><link linkend="attribute-tok-pop">@tok-pop</link></code>, to specify
                  the number of tokens that the assertion applies to. This is particularly helpful
                  for language-specific files based upon a limited corpus of texts, where the
                  underlying data for the assertion might be difficult or impossible to retrieve.
                  The token population can be used to assign levels of certainty, or to compare
                  statistical profiles of one TAN-A-lm file against another.</para>
               <para>If you wish to point to a linguistic token that straddles more than one token,
                  you should use multiple <code><link linkend="element-tok">&lt;tok></link></code>s,
                  wrapping them in a <code><link linkend="element-group">&lt;group></link></code>. </para>
               <para>Any token may be the object of as many <code><link linkend="element-ana"
                        >&lt;ana&gt;</link></code>s as you like. In fact, this is preferred if you
                  wish to register competing claims or alternatives.</para>
               <para>Claims within an <code><link linkend="element-ana">&lt;ana&gt;</link></code>
                  are distributed. That is, every combination of <code><link linkend="element-l"
                        >&lt;l&gt;</link></code> and <code><link linkend="element-m"
                        >&lt;m&gt;</link></code> (governed by <code><link linkend="element-lm"
                        >&lt;lm&gt;</link></code>) is asserted to be true for every <code><link
                        linkend="element-tok">&lt;tok></link></code> or <code><link
                        linkend="element-group">&lt;group></link></code>. </para>
               <para>Many TAN-A-lm files will be generated by an algorithm that automatically lists
                  all possible morphological values of each token. It is advised that such automatic
                  calculations always include in their output <code><link linkend="attribute-cert"
                        >@cert</link></code>, with weighted values. That is, if an algorithm
                  identifies two possible lexico-morphological profiles for a word, but one occurs
                  nine times more than the other, then it is advised that this be reflected in the
                  two resultant elements, e.g.: <code>&lt;lm cert="0.9">...&lt;/lm></code> and
                     <code>&lt;lm cert="0.1">...&lt;/lm></code>. If an algorithm is written with a
                  more sophisticated way to weigh possibilities, then adjust the value of
                        <code><link linkend="attribute-cert">@cert</link></code> accordingly. Be
                  certain that the <code><link linkend="element-algorithm"
                     >&lt;algorithm&gt;</link></code> is credited in the <code><link
                        linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> and in a
                        <code><link linkend="element-resp">&lt;resp></link></code>.</para>
               <para>As with TAN-A-tok files, not every word needs to be explained or described. In
                  fact, this is oftentimes undesirable, to avoid files that are overly long and
                  time-consuming to validate or process.</para>
               <para>A TAN-A-lm file is rendered more efficient when claims can be grouped. If a
                  particular token always has a particular lexico-morphological profile, this can be
                  declared once, in a <code><link linkend="element-tok">&lt;tok></link></code> that
                  does not have <code><link linkend="attribute-ref">@ref</link></code>, or it can be
                  specified through a compound <code><link linkend="attribute-ref"
                     >@ref</link></code>. You do not need to provide a <code><link
                        linkend="element-tok">&lt;tok></link></code> for every leaf div. In fact,
                  such an approach can result in inefficient validation and processing. </para>
               <para>For example, in early versions of TAN, the lexico-morphogical values of the
                  Greek Septuagint (8.3 MB) were converted to a TAN-A-lm file of 407,811 <code><link
                        linkend="element-tok">&lt;tok></link></code>s grouped in 52,703 <code><link
                        linkend="element-ana">&lt;ana&gt;</link></code>s (25.8 MB). Early 2020
                  validation routines took about 25 minutes (2018 validation routines took hours).
                  That particular TAN-A-lm file itemized every single token in the text. It was
                  revised to be more declarative along the lines advocated above. If a particular
                  token had only one lexico-morphological profile throughout the text, then every
                  instance was reduced to a single <code><link linkend="element-ana"
                        >&lt;ana&gt;</link></code>, with no <code><link linkend="attribute-ref"
                        >@ref</link></code> in <code><link linkend="element-tok"
                     >&lt;tok></link></code>. When a particular token value had different
                  lexico-morphological profiles, <code><link linkend="attribute-ref"
                     >@ref</link></code> targeted the rootmost <code><link linkend="element-div"
                        >&lt;div></link></code>. This revision resulted in a smaller file (15.8 MB;
                  158,376 <code><link linkend="element-tok">&lt;tok></link></code>s in 54,335
                        <code><link linkend="element-ana">&lt;ana&gt;</link></code>s) that validated
                  in about a third of the time (8.5 minutes).</para>
               <para>In general, there is always a trade-off between convenience and efficiency. If
                  your priority is speed, you should break a large file into several smaller ones,
                  perhaps recombining them in a master file via <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> (see <xref
                     linkend="inclusions-and-vocabularies"/>).</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_3">
         <title>Class-3 TAN Files, Varia</title>
         <para>This chapter provides general background to class-3 TAN files, which are devoted to
            formats that do not fit the other two classes. For detailed discussion of specific
            elements and attributes, see <xref linkend="elements-attributes-and-patterns"/>.</para>
         <section xml:id="TAN-voc">
            <title>Vocabulary (<code>TAN-voc</code>)</title>
            <para>All too often, a project has a set of vocabulary it draws from time and again. To
               repeat the <xref xlink:href="#pattern-iri_and_name"/> can be both tedious and
               treacherous. If a project with hundreds of TAN files decides to change or augment its
               vocabulary it could take a long time to find and make all the changes, everywhere and
               consistently.</para>
            <para>The TAN-voc format addresses that problem. It is intended to allow a project to
               define, edit, and augment the IRI + name patterns for recurrent vocabulary. TAN
               supplies several standard TAN-voc files under the subdirectory
                  <code>vocabularies</code>, supporting commonly used concepts such token
               definitions, div types, licenses, and many more. For a complete list of predefined
               TAN keywords, see <xref linkend="vocabularies-master-list"/></para>
            <para>It is quite common for a person or team to build vocabulary items in the course of
               developing a corpus, which means that TAN-voc files tend to changed as the project
               progresses. You can organize your vocabulary in whatever manner makes sense. You
               might create one large TAN-voc file for all vocabulary or one file per type of
               vocabulary, each independent of the other. Each approach has strengths and
               weaknesses. The latter, one TAN-voc file per type of vocabulary, can create quite a
               bit of extra work. Every TAN file that draws from the vocabulary must insert one
                     <code><link linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> for
               each relevant TAN-voc file. The best approach we have found is to have one relatively
               small master TAN-voc file, which includes other TAN-voc files via <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code>s (along with
                  <code>&lt;</code><code><link linkend="element-group">group</link></code><code>
                  include="[IDREFS]"/></code> or <code>&lt;</code><code><link linkend="element-item"
                     >item</link></code><code> include="[IDREFS]"/></code>, pointing to the IDrefs
               of the included TAN-voc files).</para>
            <para>For more details on how this format relates to other TAN formats, see <xref
                  linkend="inclusions-and-vocabularies"/>.</para>
            <section>
               <title>Root Element and Head</title>
               <para>A TAN-voc file has <code><link linkend="element-TAN-voc"
                     >&lt;TAN-voc&gt;</link></code> as the root element.</para>
               <para>The <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code> of a TAN-voc file takes, in addition to core
                  vocabulary items, any number of <code><link linkend="element-group-type"
                        >&lt;group-type&gt;</link></code>s. </para>
               <para>A TAN-voc file may draw directly from the vocabulary in its body, as if it were
                  referring to itself via <code><link linkend="element-vocabulary"
                        >&lt;vocabulary&gt;</link></code>.</para>
            </section>
            <section xml:id="tan-voc-data">
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-voc
                  file consists simply of <code><link linkend="element-item"
                     >&lt;item&gt;</link></code>s or <code><link linkend="element-verb"
                        >&lt;verb&gt;</link></code>s, perhaps gathered into groups via <code><link
                        linkend="element-group">&lt;group&gt;</link></code> or <code><link
                        linkend="attribute-group">@group</link></code>. These groups have, at
                  present, no effect upon other TAN files that use them, but they have been valuable
                  in certain applications. For example, the standard TAN-voc file for <code><link
                        linkend="element-div-type">&lt;div-type&gt;</link></code>
                     (<code>vocabularies/div-types.TAN-voc.xml</code>) groups textual division types
                  into a rudimentary typology that allows applications to decide programmatically
                  whether a particular division should be treated as a block or inline element, or
                  whether it should be indented.</para>
               <para>The <code><link linkend="attribute-affects-attribute"
                     >@affects-attribute</link></code> or <code><link
                        linkend="attribute-affects-element">@affects-element</link></code>, both
                  weakly inheritable, defines the scope of the vocabulary items, i.e., what elements
                  or attributes can the items be legitimately used for. The vocabulary item will be
                  eligible only for specified attributes or elements.</para>
               <para>Nearly all <code><link linkend="element-item">&lt;item&gt;</link></code>s in a
                  TAN-voc file contain the IRI + name pattern. The only exceptions are <code><link
                        linkend="element-item">&lt;item&gt;</link></code>s pertaining to token
                  definitions, which instead of <code><link linkend="element-IRI"
                     >&lt;IRI></link></code>s take <code><link linkend="element-token-definition"
                        >&lt;token-definition&gt;</link></code>s. See <xref
                     linkend="defining_tokens"/>.</para>
               <para><code><link linkend="element-verb">&lt;verb&gt;</link></code> includes, in
                  addition to the IRI + name pattern, the option to have <code><link
                        linkend="element-constraints">&lt;constraints&gt;</link></code> added. Those
                  constraints define what components are permitted in any <code><link
                        linkend="element-claim">&lt;claim&gt;</link></code> that uses the
                        <code><link linkend="element-verb">&lt;verb&gt;</link></code>. At this time,
                  verb constraints are at an early phase of development. Only those constraints that
                  mirror standard TAN vocabulary for verbs,
                     <code>vocabularies/verbs.TAN-voc.xml</code>, will be supported during
                  validation. Study that file for examples of how to build a <code><link
                        linkend="element-verb">&lt;verb&gt;</link></code>. See <xref
                     linkend="tan-a_body"/> on the use of verbs in a TAN-A file.</para>
            </section>
         </section>
         <section xml:id="TAN-mor">
            <title>Morphological Concepts and Patterns (<code>TAN-mor</code>)</title>
            <para>TAN-mor files are used to delineate the morphological characteristics or features
               of a given language, to assign codes to those features, and to define rules governing
               the application of those codes. It is a kind of Schematron for the grammar of human
               languages. </para>
            <para>The format allows specificity, flexibility, and responsiveness. Grammatical rules
               may be constructed to return warnings and error messages to users who use a code or
               pattern incorrectly, or not in accordance with best practices. Such rules may be
               qualified, or made contingent upon certain conditions.</para>
            <para>This chapter should be read in close conjunction with <xref linkend="tan-a-lm"
               />.</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>Certain assumptions and recommendations are made regarding morphology files,
                  complementing the more general ones; see <xref linkend="design_principles"
                  />.</para>
               <para>TAN-mor files are restricted exclusively to describing the categories and rules
                  for the grammar of a natural language. Editors of these files should be well
                  versed with the grammar of the languages they are describing, and generally
                  acquainted with how the grammars of comparable languages work.</para>
               <para>The TAN-mor format has been designed with the assumption that patterns of word
                  inflection and formation can be categorized, classified, named, and described. It
                  has also been assumed that scholars may reasonably differ, perhaps radically, on
                  how categories should be defined and applied. TAN-mor allows scholars to declare
                  clearly their operative assumptions and views. It is up to other users to decide
                  whether or not to adopt them.</para>
               <para>The TAN-mor format has also been designed to cater to two different approaches
                  to morphological codes: categorized or uncategorized. </para>
               <para>Categorized codes are interpreted according to position. <code>a b c</code>
                  would mean something different than <code>c b a</code>. For example, Perseus
                     (<link xlink:href="http://www.perseus.tufts.edu/hopper/"/>) adopts categorized
                  codes for morphological analysis of Greek, Latin, and other highly inflected
                  languages. Every code has ten positions, each one corresponding to a major
                  grammatical category, with the first two being the major and minor parts of
                  speech, and the subsequent categories devoted to person, number, tense, and so
                  forth. Each word that is analyzed must have a value, even if a hyphen or null. A
                     <code>d</code> in one position means something different from a <code>d</code>
                  in another.</para>
               <para>Uncategorized codes, on the other hand, assign one unique code to each
                  grammatical feature. In this approach, codes may be combined and arranged at will.
                     <code>a b c</code> would be identical to <code>c b a</code>. This approach is
                  viable for any language (including highly inflected ones such as Greek or Latin),
                  but it is in practice most often found serving languages that are not highly
                  inflected, e.g., the Brown and Penn sets for English.</para>
               <para>TAN-mor morphological codes may not include either the space or the hyphen, and
                  unlike IDrefs, they are case insensitive. The codes <code>NOUN</code> and
                     <code>noun</code> are interchangeable.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a morphological rule file is <code><link
                        linkend="element-TAN-mor">&lt;TAN-mor></link></code>.</para>
               <para>Zero or more <code><link linkend="element-source">&lt;source></link></code>s
                  describe the grammars or related works that account for the morphological rules.
                  If the categories, codes, and rules are not based upon any published work, then
                        <code><link linkend="element-source">&lt;source></link></code> may be
                  omitted. Any TAN-mor file without a source may be inferred to be based upon the
                  personal knowledge of the persons or organizations identified in <code><link
                        linkend="element-file-resp">&lt;file-resp&gt;</link></code>.</para>
               <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
                  is populated with the grammatical <link linkend="element-feature"
                        ><code>&lt;feature></code></link>s that are allowed grammatical concepts in
                  the language, and they are asigned codes via <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code>. Because a grammatical feature is not allowed in a
                  TAN-mor file until it is explicitly declared in a <link linkend="element-feature"
                        ><code>&lt;feature></code></link>, <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code> might simply repeat the value of <link
                     linkend="attribute-which"><code>@which</code></link>.</para>
               <para>TAN has a standard vocabulary file for grammatical features:
                     <code>vocabularies/features.TAN-voc.xml</code>. This vocabulary file encodes
                  746 vocabulary items corresponding to core grammatical features declared in the
                  OLiA Reference Model for Morphology, Morphosyntax and Syntax (<link
                     xlink:href="http://purl.org/olia/olia.owl"/>). See <xref
                     linkend="vocabularies-features"/>.</para>
               <para>If you wish to incorporate into your codes characters that are not allowed in
                        <code><link linkend="attribute-xmlid">@xml:id</link></code>, e.g.,
                     <code>$</code> or <code>:</code>, you should create an <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code>, whose <code><link
                        linkend="attribute-id">@id</link></code> allows such values. <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code> of course can be used to
                  assign multiple grammatical features to a single id.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-mor
                  file takes the customary optional attributes found in other TAN files (see <xref
                     linkend="edit_stamp"/>). </para>
               <para>Within <code><link linkend="element-body">&lt;body></link></code>, you begin
                  with a language declaration: one or more <code><link linkend="element-for-lang"
                        >&lt;for-lang></link></code>s. </para>
               <para>After the language declaration come rules: zero or more <code><link
                        linkend="element-where">&lt;where&gt;</link></code>s declaring rules to be
                  followed for the feature codes.  <code><link linkend="element-where"
                        >&lt;where&gt;</link></code> has attributes that establish the context under
                  which its enclosed rules are operative. Those rules are found in the enclosed
                        <code><link linkend="element-assert">&lt;assert></link></code>s or
                        <code><link linkend="element-report">&lt;report></link></code>s, which
                  declare rules that must be followed, or must never be followed, by any dependent
                  TAN-A-lm file. </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> and
                        <code><link linkend="element-report">&lt;report></link></code> will be
                  checked only if the conditions declared by the attributes in the enclosing
                        <code><link linkend="element-where">&lt;where&gt;</link></code> are met in
                  the context of a given <code><link linkend="element-m"
                  >&lt;m></link></code>:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><code><link linkend="attribute-m-matches">@m-matches</link></code>
                           (regular expression): <code><link linkend="element-m"
                              >&lt;m></link></code> matches the pattern. </para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-tok-matches">@tok-matches</link></code>
                           (regular expression): one of the values of <code><link
                                 linkend="element-tok">&lt;tok></link></code> in the given
                                 <code><link linkend="element-ana">&lt;ana&gt;</link></code> matches
                           the pattern (regular expression).</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-m-has-features"
                              >@m-has-features</link></code> (space-delimited strings): <code><link
                                 linkend="element-m">&lt;m></link></code> has the specified
                           features.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-m-has-how-many-features"
                                 >@m-has-how-many-features</link></code> (integer): <code><link
                                 linkend="element-m">&lt;m></link></code> has the given number of
                           features.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> also has one
                  or more of the truth conditions above. If the test proves false in a given
                        <code><link linkend="element-m">&lt;m></link></code> then the <code><link
                        linkend="element-m">&lt;m></link></code> will be marked as erroneous and the
                  message included by the <code><link linkend="element-assert"
                     >&lt;assert></link></code> should be returned.</para>
               <para><code><link linkend="element-report">&lt;report></link></code> has the same
                  effect, but the test looks for the opposite boolean value: the error and message
                  will be returned only if the test proves true.</para>
               <para>After the rules come a structure declaration (if relying upon structured
                  codes): zero or more <code><link linkend="element-category"
                     >&lt;category></link></code>s . Each one sorts <link linkend="element-feature"
                        ><code>&lt;feature></code></link>s into groups, assigning them <code><link
                        linkend="attribute-code">@code</link></code> values that are unique within
                  the <code><link linkend="element-category">&lt;category></link></code>. Sequence
                  is important. The first <code><link linkend="element-category"
                        >&lt;category></link></code> defines the features allowed in the first code
                  position, the second in the second, and so forth.</para>
               <para>See sample TAN-mor files in the <code>examples</code> directory.</para>
            </section>
         </section>
         <section xml:id="catalog-files">
            <title>TAN Catalog Files (<code>collection</code>)</title>
            <para>TAN catalog files are used to locate relevant TAN files and to support the XSLT
               function <code>collection()</code>. They catalog or index any TAN files within a
               local directory and perhaps its subdirectories. </para>
            <para>These catalog files must always be named <code>catalog.tan.xml</code>. They depart
               from all other TAN files in their structure. They have no namespace. They have
               neither body nor head. Rather, they are patterned off the catalog.xml description
               provided by Saxonica (<link xlink:href="https://www.saxonica.com"/>).</para>
            <para>Any XML file passed to the stylesheet <code>applications/create/create TAN catalog
                  file.xsl</code> will automatically generate one of these files, cataloging all the
               files in the local directory.</para>
            <para>The root element of a catalog file is <code><link linkend="element-collection"
                     >&lt;collection></link></code>, with children <code><link linkend="element-doc"
                     >&lt;doc></link></code>s that hold simple metadata about the TAN files that are
               in a directory and its subdirectories. Only TAN files may be registered in a
                     <code><link linkend="element-doc">&lt;doc></link></code>. A <code><link
                     linkend="element-doc">&lt;doc></link></code> may include other material such as
               each file's resolved <code><link linkend="element-head">&lt;head></link></code>, but
               this is not mandated.</para>
         </section>
      </chapter>
   </part>
   <part xml:id="working_with_tan">
      <title>Working with the Text Alignment Network</title>
      <chapter>
         <title>Working with TAN Files</title>
         <para>This chapter presents ways to manage, create, edit, and share TAN files. The material
            discussed here is non-normative. That is, these are suggestions based upon the
            experience of TAN users. </para>
         <para>Descriptions in this chapter are both brief and general. To understand better the
            underlying framework, study the files in the subdirectory <code>functions</code>, or
            their reformatted versions in the chapter <xref
               xlink:href="#variables-keys-functions-and-templates"/>. </para>
         <section xml:id="local_setup">
            <title>Local Setup</title>
            <para>TAN can be downloaded from a master data repository listed at <link
                  xlink:href="http://textalign.net/"/>. The project has been developed using the
               version-control software Git. Whether you download the files directly or use Git,
               place the TAN code wherever is most convenient on your computer. </para>
            <para>The TAN files you create may be set up in whatever structure you want. Because TAN
               files are meant to be shared and interlinked, it is beneficial to develop predictable
               directory structures. In the 2018 version of TAN, advice was given on how to organize
               directories and files. But experience with a variety of projects, each with their own
               needs and preferences, has shown that such advice is shortsighted. One point does
               still seem valid, however: keep your TAN libraries separate from the core TAN
               files.</para>
            <para>Many TAN projects will find it necessary to work with dozens of versions of a
               particular work, and it is easy to get confused as to what file does what. In
               projects with many text versions, it is recommendad that your names for class-1 files
               (the filename, not the <code><link linkend="attribute-id">@id</link></code>; see
                  <xref xlink:href="#tan-file-id"/>) start with an acronym or short abbreviation for
               the author and work, followed by the language code, the last name of the
               editor/author of the scriptum, the date when the scriptum was created or published.
               If you have multiple TAN files that refer to each other via <code><link
                     linkend="element-redivision">&lt;redivision&gt;</link></code>, because each has
               a different reference system, you may need to include that in the filename. Some examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.grc.1949.minio-paluello.ref-logical.xml</code> (Aristotle's
                        Categories, in Greek, 1949, edition by Minio Paluello, following a reference
                        system based on semantic units [paragraphs, sentences, independent
                        clauses]).</para>
                  </listitem>
                  <listitem>
                     <para><code>apocr.eng.kjv.1760.xml</code> (apocrypha, English, King James
                        Version, 1760 edition)</para>
                  </listitem>
                  <listitem>
                     <para><code>tlg0059.tlg031.perseus-grc1-Pl.Ti.xml</code> (Plato's
                           <emphasis>Timaeus</emphasis> in Greek). This filename has some
                        duplication in that <code>tlg0059</code> already implies <code>Pl</code> and
                           <code>tlg031</code>, <code>Ti</code>, but only die-hard users of the
                        Thesaurus Linguae Graecae know the meaning of the numerical codes.</para>
                  </listitem>
                  <listitem>
                     <para><code>pl.ti.grc.1905.burnet.stephanus.xml</code> (Plato's
                           <emphasis>Timaeus</emphasis> in Greek). This filename is an alternative
                        way to construct the previous example.</para>
                  </listitem>
               </itemizedlist></para>
            <para>Class-2 files are tougher. They together multiple files and concepts, so filenames
               could become very long or unpredictably structured, especially if trying to express
               which class-1 sources they use. At this time, the best recommendation is to make sure
               that each class-2 file is put into its own subdirectory, separate from class-1 files,
               and given a brief but meaningful name that points to the research question that
               motivated its creation. Some examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml</code> (a
                        sample of lexico-morphological data for Aristotle's
                           <emphasis>Categories</emphasis>, in Greek)</para>
                  </listitem>
                  <listitem>
                     <para><code>nt.grc-syr.selections.TAN-A-tok.xml</code> (a selection of
                        word-for-word correspondences between the Syriac and Greek New
                        Testaments)</para>
                  </listitem>
                  <listitem>
                     <para><code>plato.general.TAN-A.xml</code> (a general alignment and annotation
                        file on Plato's works)</para>
                  </listitem>
               </itemizedlist></para>
            <para>Class-3 filenames are a bit easier. It is recommended that TAN-mor files begin
               with the language code then an acronym for the person or group responsible for
               creating the features. TAN-voc files are written generally to serve a specific
               project or collection, so the collection name and the type of vocabulary should
               suffice. Examples:<itemizedlist>
                  <listitem>
                     <para><code>eng.example.com,2014.1.xml</code> (tagging scheme #1 for English,
                        by the owner of the domain <code>example.com</code> in 2014)</para>
                  </listitem>
                  <listitem>
                     <para><code>ar.cat.general.TAN-voc.xml</code> (general vocabulary items for a
                        project for Aristotle's <emphasis>Categories</emphasis>)</para>
                  </listitem>
               </itemizedlist></para>
            <para>If you have a local copy of someone else's TAN collection, and you wish to create
               TAN files that depend on them, you are in all likelihood going to use relative URLs
               pointing to copies of the files stored on your local drive. It is recommended that
               you also point to the master versions through absolute URLs in extra <code><link
                     linkend="element-location">&lt;location></link></code>s. The validation routine
               checks only the first document available. From time to time, you might comment out
               the first <code><link linkend="element-location">&lt;location></link></code> and run
               the validation process again. This will tell you if there have been any updates since
               you last accessed the file. Or you should occasionally validate other TAN files you
               have downloaded. If the <code><link linkend="element-master-location"
                     >&lt;master-location></link></code> is intact, you will be notified of any
               updates.</para>
            <para>In a given project, you are likely to repeat basic information, particularly
                     <code><link linkend="element-person">&lt;person></link></code>, <code><link
                     linkend="element-role">&lt;role></link></code>, and <code><link
                     linkend="element-work">&lt;work&gt;</link></code>. such as elements with the
                  <xref linkend="pattern-iri_and_name"/>, consider moving those to a project TAN-voc
               file. It is almost always preferable to develop TAN-vocs before resorting to
                     <code><link linkend="element-inclusion">&lt;inclusion></link></code>s.
                     <code><link linkend="element-inclusion">&lt;inclusion></link></code>s are
               powerful, but they can become quickly complex and confusing to navigate.</para>
            <section>
               <title>Using TAN with Oxygen XML Editor</title>
               <para>If you use an advanced XML editor such as oXygen, you can set up a project so
                  that TAN validation files can be easily located and validation can be
                  automatically applied. A sample oXygen project file is included within the TAN
                  library to get you started. You may wish to create a copy of that project file for
                  yourself before developing it.</para>
               <para>TAN also includes select oXygen frameworks files, which provides editing tools
                  for oXygen's Author mode. The Author mode includes a variety of editing tools,
                  primarily for class-1 files. After opening the supplied Oxygen project file,
                     <code>tan.xpr</code>, use Author mode to view at a sample TAN file and look at
                  the options in the menu, the toolbars, and the context-click menu, to see what is
                  possible.</para>
               <para>Both the project file and the frameworks files are in their early infancy, and
                  are therefore incomplete and imperfect. They have tremendous potential for
                  development, slated for future versions of TAN.</para>
            </section>
         </section>
         <section>
            <title>Creating and populating TAN files</title>
            <para>TAN is a representational format. Every TAN file models some source.</para>
            <para>If those sources are non-digital, it is a relatively straightforward task to
               create and populate a TAN file. You just start editing everything by hand. In some
               cases, you might get a head start with an algorithm. For example, optical character
               recognition (OCR) on an edition might give you a dirty but useful start for a TAN-T
               file. Applying OCR to a printed index of quotations might get you the basic start to
               a TAN-A file. Despite the computer's assistance, the majority of the task will be
               spent in correcting any conversions. Thoughtful, scholarly attention is critical to
               making these files suitable for use.</para>
            <para>In many other cases, you are trying to take something that already exists
               digitally and convert it into a TAN format. If you find a Word file, a web page, or a
               plain text file that can serve as the basis for a TAN file, a common first impulse is
               to copy the desired content, paste it into the body of an empty TAN file, then
               manually correct the material. That solution is quick and easy, but short-sighted.
               You may find that you made a major mistake, and you have done so much work, you
               cannot backtrack. Perhaps you have accidentally deleted all punctuation when you
               didn't mean to. Or you eliminated line breaks that you didn't realize at the time
               were useful signals about where <code><link linkend="element-div"
                  >&lt;div></link></code>s should be separated. Even if all goes well, after all
               that hard work you might be find out that the pre-TAN data sources you started out
               with have been updated, with other things corrected. If any significant time has
               elapsed, you may have forgotten what procedure you followed to convert the data. And
               even if you remember, you will have to repeat the steps again, and dread the day when
               those pre-TAN sources are updated yet again. </para>
            <para>In these cases, it is advised to think not about fixing the files, but rather
               about developing a system to fix the files. Your goal try to create a digital
               pipeline/workflow that can be applies when needed, so that changes to those pre-TAN
               versions can be channeled into your TAN library. If you or a project member has
               experience in XSLT, it is a good idea to develop stylesheets to convert the data to
               TAN. When you find mistakes such as those described above, no harm is done. You can
               simply adjust and re-run your process, each time getting better and better results.
               An XSLT-based approach requires extra work, initially. Establishing a stable
               transformation process can be time consuming, since it requires repeated sequences of
               trial, error, and diagnosis. But the investment pays off in the long run, especially
               if you are dealing with dozens, hundreds, or thousands of files. The routines you
               write for one set of files might be useful for the next.</para>
            <para>Here is one approach. Create a template skeleton TAN file that resembles your
               desired output. Develop a XSLT stylesheet that does the following:</para>
            <orderedlist>
               <listitem>
                  <para>Fetches the pre-TAN file (main input).</para>
               </listitem>
               <listitem>
                  <para>Puts the main input in an XML tree, then applies select alterations.</para>
               </listitem>
               <listitem>
                  <para>Fetches the template TAN file.</para>
               </listitem>
               <listitem>
                  <para>Pushes the altered pre-TAN content into the template file.</para>
               </listitem>
               <listitem>
                  <para>Saves the infused template, either as the primary output, or as a result
                     document.</para>
               </listitem>
            </orderedlist>
            <para>One of the challenges to this method is that the pre-TAN input might not be XML,
               in which case it cannot be the initial, catalyzing input to the XSLT file. But that
               is fine. For such conversions, you can make your XSLT file a MIRU (main input
               resolved uris) stylesheet. A MIRU stylesheet has as its catalyzing input any XML
               file, including itself. That initial, catalyzing input is unimportant, because a MIRU
               stylesheet, through global parameters and variables that point to resolved uris,
               fetches the main input. For an example of a MIRU stylesheet, see
                  <code>applications/compare/compare TAN class 1 files.xsl</code>.</para>
            <para>The method described above has been used successfully to handle several different
               kinds of conversion, including ones where the source files are updated very
               frequently. In such scenarios, the traditional cut-paste-and-edit method is not only
               unproductive; it is foolish.</para>
            <para>Writing transformations can be laborious at first. Finding the best way to handle
               and manipulate a pre-TAN file is an intellectual challenge with multiple solutions.
               But there is a good chance that some of the labor you have in mind has already been
               done for you in a TAN function (see <xref
                  xlink:href="#variables-keys-functions-and-templates"/>) or application (see the
               subdirectory <code>applications</code>).</para>
         </section>
         <section xml:id="validating_tan_files">
            <title>The TAN Validation Process</title>
            <para>TAN files are validated when the file, along with its associated TAN schemas, are
               passed to a validation engine. Validation can be set up either by pointing explicitly
               to the schemas within a TAN file (via <code>&lt;?xml-model ?></code> statements in
               the prolog), or by setting up an oXygen project or framework to automatically apply
               the schemas to TAN files (see <xref xlink:href="#local_setup"/>). There are two types
               of TAN validation.</para>
            <para>Structural validation is conducted through RELAX-NG files that define the
               attributes, elements, and patterns that are allowed or required in a given TAN
               format. These files are kept in the <code>schemas</code> project subdirectory. If you
               are editing a TAN-T file, for example, its RELAX-NG schema is
                  <code>schemas/TAN-T.rnc</code>. The RELAX-NG files are written principally in the
               compact syntax (<code>.rnc</code>), then converted to XML syntax (<code>.rng</code>).
               The TAN-TEI format is an exception. The schema begins with
                  <code>schemas/TAN-TEI.odd</code>. This file, linked as it is with the other
               RELAX-NG files, is processed by TEI stylesheets to generate the master
                  <code>TAN-TEI.rnc</code> and <code>TAN-TEI.rng</code> files that validate TAN-TEI
               files. The ODD file is processed against TEI All, the largest of the TEI formats, in
               the version available at the time of the release of a given TAN version.</para>
            <para>The second type of validation uses Schematron to check rules that cannot be
               expressed in RELAX-NG, e.g., no <code><link linkend="attribute-when"
                  >@when</link></code> should have a date in the future. More than one hundred types
               of errors are checked during Schematron validation. For a comprehensive list see
                  <link xlink:href="../functions/errors/TAN-errors.xml"/>. Some of these errors can
               be quite time-consuming for a computer to check. For example, if a class-1 file has a
                     <code><link linkend="element-redivision">&lt;redivision&gt;</link></code>, the
               text should be identical. On short texts, the test can be made in seconds; on longer
               texts it might take minutes. Therefore Schematron validation allows three different
               levels: terse, normal, and verbose. The names reflect not only how fast each phase
               takes but how much feedback is provided.</para>
            <para>The Schematron files themselves are rather small. The majority of the work is done
               by a large library of XSLT code that takes the file, resolves it, and expands it,
               inserting errors and help messages along the way. A greatly reduced version of the
               expanded file is then passed back to the Schematron processor as a global variable.
               The Schematron processor returns as messages any errors or warnings found in the
               generated file, and for any suggested corrections (also embedded as children), it
               returns a Schematron Quick Fix.</para>
            <para>TAN's Schematron validation is more computationally intensive than is its
               RELAX-NG. The longer and more complex your file and its dependencies, the longer its
               validation will be. Files such as the Ring-a-roses examples in the
                  <code>examples</code> subdirectory will take a split second to validate, but a
               TAN-T file of the Old Testament of the King James Version has been known to take
               about 33 seconds to validate in the normal phase (the whole Bible about a minute). A
               TAN-A-lm file with a full morphological analysis of that long TAN-T file will take a
               long time to validate. </para>
            <para>Tests were performed on TAN-A file that had three very large TAN-T sources (each
               about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims, Schematron
               validation under the normal phase took about 13 seconds (run on oXygen 22.1 on a
               Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of claims was
               expanded to 546, the same process took 63 seconds. When the file had 5,421 claims,
               the file took 78 minutes, 45 seconds to validate.<note>
                  <para>Much of the expansion is due to the Schematron process itself. The XSLT
                     component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23
                     minutes 57 seconds, respectively. The Schematron component becomes more
                     time-consuming faster than does that of the XSLT.</para>
               </note></para>
            <para>In future versions of TAN this process will be further optimized (the figures
               above are a very significant improvement over 2018 figures). For now, you must make
               decisions that pit speed against convenience. If you wish to have validation happen
               quickly, break files into smaller ones, perhaps to be joined later in a single TAN
               file via <code><link linkend="element-inclusion">&lt;inclusion></link></code>s.
               Validating ten component files each with ten thousand elements will take aggregately
               less time than validating one long file with one hundred thousand elements. Had the
               example TAN-A file mentioned above been split into 43 different files, the entire
               collection would have been validated in less than 12% of the time.</para>
            <para>The process behind Schematron validation can be used not only for validation but
               for other applications, so should be explained. Any TAN file that is processed by the
               TAN XSLT library goes through two major transformations.</para>
            <para>The first transformation <emphasis>resolves</emphasis> the file. The goal is to
               get the file into a state where it can be evaluated without having to consult any
                     <code><link linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or
                     <code><link linkend="element-inclusion">&lt;inclusion></link></code>
               dependencies. (See <xref xlink:href="#inclusions-and-vocabularies"/> for background
               on TAN's approach to inclusion.) This process also does some basic file-specific
               normalization; it will: <orderedlist>
                  <listitem>
                     <para>Prepare the file. This includes evaluating <code><link
                              linkend="element-alias">&lt;alias&gt;</link></code>, stamping the root
                        element with a base URI (the path location of the file), and every element
                        with a <code>@q</code> (an arbitrary name), which contains a unique
                        identifier. This identifier is used by the Schematron file match an element
                        with any error messages in the corresponding element in the XSLT
                        output.</para>
                  </listitem>
                  <listitem>
                     <para>Identify those nodes that need to be changed by <code><link
                              linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or
                              <code><link linkend="element-inclusion">&lt;inclusion></link></code>
                        dependencies.</para>
                  </listitem>
                  <listitem>
                     <para>Insert required components from <code><link linkend="element-vocabulary"
                              >&lt;vocabulary&gt;</link></code>s or <code><link
                              linkend="element-inclusion">&lt;inclusion></link></code>s through the
                        following method:<orderedlist numeration="loweralpha">
                           <listitem>
                              <para>Relevant external vocabulary items are inserted into the
                                       <code><link linkend="element-head"
                                 >&lt;head&gt;</link></code>, either as descendants of the
                                 appropriate <code><link linkend="element-vocabulary"
                                       >&lt;vocabulary&gt;</link></code> or if derived from TAN
                                 standard vocabulary as new <code>&lt;tan-vocabulary></code>
                                 elements immediately following the <code><link
                                       linkend="element-vocabulary-key"
                                    >&lt;vocabulary-key></link></code>. All vocabulary items are
                                 imprinted with an <code>&lt;id></code>, to facilitate rapid
                                 retrieval of vocabulary. Any vocabulary <code><link
                                       linkend="element-name">&lt;name&gt;</link></code> that is not
                                 normalized is given a copy that is name-normalized (signaled by
                                    <code>@norm</code>): lower-case, hyphens and underscores changed
                                 to spaces, and space-normalized.</para>
                           </listitem>
                           <listitem>
                              <para>Any element with <code><link linkend="attribute-include"
                                       >@include</link></code> is replaced by the elements of the
                                 same name found in the target inclusion document. In addition,
                                       <code><link linkend="element-inclusion"
                                    >&lt;inclusion></link></code> is populated with any vocabulary
                                 items required to resolve the newly included material (recursively,
                                 if that inclusion requires other inclusions). This last point is
                                 important, because all IDrefs must be interpreted in light of the
                                 original context. IDrefs are brought into the host document, so
                                 when you use <code><link linkend="element-inclusion"
                                       >&lt;inclusion></link></code> you must ensure there are no id
                                 conflicts.</para>
                           </listitem>
                        </orderedlist></para>
                  </listitem>
                  <listitem>
                     <para>Normalize all numbers in original components (i.e., excluding included
                        elements or vocabulary items) as Arabic numerals.</para>
                  </listitem>
               </orderedlist></para>
            <para>Files are resolved recursively. That is, no <code><link
                     linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code> components are
               imported until the files pointed to are themselves first resolved.</para>
            <para>Numerals fall at the end of the process because they might need to be resolved in
               light of resolved vocabulary and inclusions. </para>
            <para>The description above is necessarily generalized. For details consult the function
               library, particularly <link
                  xlink:href="../functions/incl/TAN-core-resolve-functions.xsl"/>. In cases where
               there is a conflict between the code and the description above, the code is to be
               interpreted as more current and authoritative.</para>
            <para>The second transformation <emphasis>expands</emphasis> the file. The goal is to
               unpack the components of a resolved document and identify any errors along the way
               (see the <link xlink:href="../functions/errors/TAN-errors.xml">master list of
                  errors</link>). There are three levels of expansion, corresponding to the three
               levels of Schematron validation: terse, normal, and verbose.</para>
            <para>In terse expansion, for each value of an attribute, an element with the
               attribute's name is placed within the parent (e.g., <code><link
                     linkend="attribute-type">@type</link>="a b"</code> produces
                  <code>&lt;type>a&lt;/type></code> and <code>&lt;type>b&lt;/type></code>). If the
               value is an IDref, and it points to an alias, a copy is made for the IDref of each
               target vocabulary item. If an id reference does not point to a vocabulary item of the
               expected type, an error message is also copied in the parent. Any values that are
               ranges are expanded, if need be. Select networked files are checked for basic
               validity. Class-2 files include a special set of rounds during terse validation,
               where their sources are adjusted, and then checked against specific references made
               in the class-2 file. (See <xref xlink:href="#pointer-syntax"/>.) In terse expansion,
               all pointing mechanisms are checked, to make sure they point to a valid location.
               Because of this basic requirement, some terse expansion can take a long time on
               lengthy files, or ones with complex <code><link linkend="element-adjustments"
                     >&lt;adjustments&gt;</link></code>.</para>
            <para>Normal expansion builds on terse expansion by interrogating networked files more
               closely. Any errors that were reported during the terse stage but were suppressed to
               avoid clutter are enabled. </para>
            <para>Verbose expansion generally attends to procedures that are complex, or are not
               critical to validation. For example, a <code><link linkend="element-model"
                     >&lt;model&gt;</link></code> of a class-1 file will be checked, to find
               references that one has but is lacking in the other. A class-1 <code><link
                     linkend="element-redivision">&lt;redivision&gt;</link></code> will be analyzed,
               to make sure that the two transcriptions are identical. A catalog file in the same
               directory will be checked, to see if it has faulty entries.</para>
            <para>Many errors lend themselves to solutions that can be recommended by the TAN
               function library. Some solutions are returned to the Schematron validation method as
               Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs (e.g.
               oXygen XML Editor) can then prompt users to fix an errant section with a quick
               replacement. For example, if text has not been NFC Unicode-normalized, an SQF will
               allow a user to make the change in two clicks. Thus, TAN validation does not merely
               tell you what the problems are; it tries to help fix them.</para>
            <para>The term "expansion" describes the process but possibly not the output. If the
               global parameter <code>$is-validation</code> is true, then in the course of expanding
               the file the TAN templates will abandon any parts that are no longer needed. The
               output is normally much smaller than the input file, restricted as it is to the root
               element and elements that have been marked with errors, warnings, or fixes. So
               although during validation the file is really being expanded, at the end only a small
               portion of the expanded file is returned to the Schematron processor, to expedite
               validation. But if <code>$is-validation</code> is false (the default value, if the
               file is not being validated), the entire expanded file and its dependencies are
               returned. Such output can be very useful in applications.</para>
            <para>The description above of file expansion is necessarily generalized. For details
               consult the function library, particularly <link
                  xlink:href="../functions/incl/TAN-core-expand-functions.xsl"/>. </para>
            <para>The validation rules have been tested not only on the files in the
                  <code>examples</code> subdirectory, but more importantly upon the files in
                  <code>functions/errors</code>. The files there attempt to provide at least one
               example of every error, and they are validated in reverse: a file is valid if and
               only if every error has a corresponding comment signaling the error.</para>
         </section>
         <section>
            <title>Sharing TAN files</title>
            <para>TAN files have been designed to be shared and linked, just like any network of
               files. Most often, TAN files will be created and distributed as collections, not
               single files.</para>
            <para>One way to distribute a collection is by making it available as a repository via
               Git or some other version control software (VCS). This approach has many advantages.
               The files become available to anyone who wants them, and the editorial history is
               preserved. VCS features and tools are extremely fast and useful, and they allow users
               to modify TAN collections without impacting the original source.</para>
            <para>Collections may also be distributed through shared syncing services (e.g., Drive,
               Box, or Dropbox), or put on a Web server. In the latter case, it may be difficult for
               users to browse or download wholesale. In that case, you may wish to expose the
               collection as a compressed ZIP archive. This saves on your server's bandwidth, and it
               still exposes the files for XML processing. But a ZIP archive is not suitable for
               linking from one TAN file to another, nor is it appropriate as a <code><link
                     linkend="element-master-location">&lt;master-location></link></code>. Unpacking
               a compressed file requires writing to the disk, which is considered to be a security
               risk during validation, and so is disallowed. Such zipped archives are good ways to
               distribute a collection, but they should not be used as a primary repository or a
               master location.</para>
         </section>
         <section xml:id="tan-stylesheets-and-function-library">
            <title>Doing things with TAN files</title>
            <para>TAN files are suited for dozens of types of applications, many of which at this
               point are only imagined or being written. The subdirectory <code>applications</code>
               is populated with folders named with actions you might want to perform on a TAN file,
               and they contain XSLT stylesheets that give you but a taste of what is possible.
               Because the applications in that directory are still under development, this section
               is devoted not to the specifics of those applications but to the theoretical
               background behind practical applications of TAN files. It is aimed particularly at
               those readers who are comfortable working with XSLT or related XML technologies, and
               want to do something important and useful with their TAN files.</para>
            <para>The Schematron validation process was designed with a view to the next steps in
               practical applications. The extensive function library upon which validation is based
               provides a foundation for a variety of applications. When developing an application,
               the first point of order is normally to find an entry point in the
                  <code>functions</code> subdirectory to the TAN function library. In that
               directory, each XSLT file is named after one of the TAN formats. Point via
                  <code>&lt;xsl:include></code> to the file that most resembles your main input.<note>
                  <para>You could also try to fetch the TAN library via
                     <code>&lt;xsl:import></code>, but results may be erratic, particularly if you
                     have not put the import command in the right order, or if templates in your
                     master stylesheet override templates in the TAN library.
                        <code>&lt;xsl:include></code> is always a more certain option.</para>
               </note> If you point to <link xlink:href="../functions/TAN-A-functions.xsl"/>, you
               will have most of the functions and templates used for both class-1 and class-2
               files. It tends to be a good default entry point if you are uncertain which master
               function file to use.</para>
            <para>It is also common to include <link
                  xlink:href="../functions/TAN-extra-functions.xsl"/>, which is the entry point for
               all TAN functions, global variables, and templates that do not play a role in
               Schematron validation. Those extra functions include many global variables that are
               excluded from the core TAN library, so as not to encumber Schematron
               validation.</para>
            <para>You should also pay attention to the files in the subdirectory
                  <code>parameters</code>. Some of the global parameters there can be used
               profitably to change the way an application runs.</para>
            <para>All XSLT transformations require at least four components:<orderedlist>
                  <listitem>
                     <para>an input XML file</para>
                  </listitem>
                  <listitem>
                     <para>an XSLT file</para>
                  </listitem>
                  <listitem>
                     <para>a URL for the output</para>
                  </listitem>
                  <listitem>
                     <para>an XSLT engine (e.g., Saxon HE) to process #1 against #2 and send the
                        output to #3.</para>
                  </listitem>
               </orderedlist></para>
            <para>Although #1 is the principal or catalyzing input, it need not be the main input.
               Sometimes an XSLT application is written with an eye toward non-XML as the main
               input. In such cases, it is impossible for the main input to be the catalyzing input.
               Furthermore, although there is only one principal output document, an application may
               need to create many other output documents. Those are normally created through
                  <code>&lt;xsl:result-document></code>. So in any XSLT operation, there are really
               two possible types of input and two types of output. We use the terms
                  <emphasis>catalyzing input</emphasis> for #1 and <emphasis>secondary
                  input</emphasis> for input that is added during the process. We use the term
                  <emphasis>primary output</emphasis> for #3 and <emphasis>secondary
                  output</emphasis> for any other output created along the way. The terms
                  <emphasis>primary</emphasis> and <emphasis>secondary</emphasis> refer only to
               their position in the process, not their importance. Indeed, there are XSLT
               applications where the secondary input and secondary output are far more important
               than the catalyzing input or primary output. In its documentation, an XSLT file
               should indicate whether the <emphasis>main input</emphasis> is the catalyzing input,
               the secondary input, or both, and whether the <emphasis>main output</emphasis> is the
               primary output, the secondary output, or both.</para>
            <para>When developing an application where the main input is a TAN file, it is often
               best to start with it in its resolved or expanded state. (See <xref
                  xlink:href="#validating_tan_files"/> on resolving and expanding TAN files.) If
               that TAN file is the catalyzing input, use the global variables <code><link
                     linkend="variable-self-resolved">$self-resolved</link></code> and <code><link
                     linkend="variable-self-expanded">$self-expanded</link></code>. If it is
               secondary input, use <code><link linkend="function-resolve-doc"
                     >tan:resolve-doc</link>()</code> and <code><link linkend="function-expand-doc"
                     >tan:expand-doc</link>()</code>.</para>
            <para>For a class-2 file, <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code> or the output of <code><link
                     linkend="function-expand-doc">tan:expand-doc</link>()</code> is a sequence of
               documents, starting with an expansion of the class-2 file itself, followed by
               expansions of its dependencies (TAN-T or TAN-mor). Its expanded class-1 sources will
               be tokenized where required, and marked with anchors for each reference in the
               class-2 file. If a token straddles leaf <code><link linkend="element-div"
                     >&lt;div&gt;</link></code>s, the token will be reconstituted by moving the tail
               of the token up. These expanded sources are excellent candidates for other types of
               transformation. For example, HTML pages can be created to integrate class-2
               annotations and their class-1 sources, in a variety of ways.</para>
            <para>At the verbose level, an expanded TAN-A file will conclude its <code><link
                     linkend="variable-self-expanded">$self-expanded</link></code> sequence with one
               or more documents with a root element <code>&lt;TAN-T_merge></code>, one file per
               detected work. A TAN-T_merge file has one <code><link linkend="element-head"
                     >&lt;head&gt;</link></code> per class-1 source that has been merged, and the
                     <code><link linkend="element-body">&lt;body></link></code> contains a master
               set of <code><link linkend="element-div">&lt;div&gt;</link></code>s that merge all
               the other sources' <code><link linkend="element-div">&lt;div&gt;</link></code>s that
               share the same reference, after all <code><link linkend="element-adjustments"
                     >&lt;adjustments&gt;</link></code> have been made. Each leaf <code><link
                     linkend="element-div">&lt;div&gt;</link></code> in each source appears in the
               appropriate place, but as a child of a common <code><link linkend="element-div"
                     >&lt;div&gt;</link></code> that encompasses all other leaf <code><link
                     linkend="element-div">&lt;div&gt;</link></code>s with the same reference. For
               each version's leaf div, <code><link linkend="attribute-type">@type</link></code> is
               changed to <code>#version</code>, and other markers signify which source it
               corresponds to. A TAN-T_merge file is a good basis building parallel displays or
               statistical analyses. These merge files can be created on an ad hoc basis through the
               function <code><link linkend="function-merge-expanded-docs"
                     >tan:merge-expanded-docs</link>()</code>, applied to individual class-1 files,
               after expansion.</para>
            <para>If you are fetching other TAN files as secondary input, and you want to work with
               them, use <code><link linkend="function-resolve-doc">tan:resolve-doc</link>()</code>
               and <code><link linkend="function-expand-doc">tan:expand-doc</link>()</code>, which
               will put the files in their resolved and expanded states. You must resolve a TAN file
               before you try to expand it.</para>
            <para>If you wish to create a TAN file as output (whether primary or secondary), it is
               advised that you prepare ahead of time a skeleton TAN file, introduce that skeleton
               as secondary input, infuse it with the new content, and let it become the primary or
               secondary output. Because the application you are using to create a TAN file is
               responsible for creating that file, and because responsibility for TAN files should
               be documented, the algorithm used to create that new TAN file should be declared in
               the <code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
               and credited with a <code><link linkend="element-resp">&lt;resp></link></code>, and a
                     <code><link linkend="element-change">&lt;change></link></code> should be
               entered in the change log. Users of the file will be warned, during Schematron
               validation, that the last change was made by an algorithm. </para>
            <para>If you are working with a TAN file as catalyzing input, you may want to take
               advantage of some other global variables derived from its key files (see <xref
                  linkend="inclusions-and-vocabularies"/>):</para>
            <para>
               <table frame="all">
                  <title>Global variables for networked files</title>
                  <tgroup cols="4">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                     <colspec colname="newCol4" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry/>
                           <entry>Raw (first document available)</entry>
                           <entry>Resolved</entry>
                           <entry>Expanded</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-inclusion"
                                 >&lt;inclusion></link></code></entry>
                           <entry>—</entry>
                           <entry><code><link linkend="variable-inclusions-resolved"
                                    >$inclusions-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-vocabulary"
                                 >&lt;vocabulary></link></code></entry>
                           <entry>—</entry>
                           <entry><code><link linkend="variable-vocabularies-resolved"
                                    >$vocabularies-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-source"
                              >&lt;source></link></code></entry>
                           <entry>—</entry>
                           <entry><code><link linkend="variable-sources-resolved"
                                    >$sources-resolved</link></code></entry>
                           <entry><code><link linkend="variable-self-expanded"
                                 >$self-expanded</link>[tan:TAN-T]</code></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-see-also"
                              >&lt;see-also></link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-1st-da"
                                    >$see-alsos-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-resolved"
                                    >$see-alsos-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>The column labeled "raw" lists variables that hold the first documents available,
               without alteration. Variables in the next column hold the resolved form, following
               the same process described above for <code><link linkend="variable-self-resolved"
                     >$self-resolved</link></code>. The resolved forms of <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                     linkend="element-vocabulary">&lt;vocabulary></link></code> are sufficient for
               validation, therefore they do not have expanded versions. Expanded sources are always
               bundled with their class-2's <code><link linkend="variable-self-expanded"
                     >$self-expanded</link></code>.</para>
            <para>For relatively simple applications, a resolved file is sufficient. But even then,
               there will be places where you will want to fetch the vocabulary bound to a
               particular attribute or element. One of the more important functions to familiarize
               yourself with is <code><link linkend="function-vocabulary"
                  >tan:vocabulary()</link></code>, which can be used to get the IRI + name pattern
               of a specific node, or to get all the vocabulary available for a given type.</para>
            <para>Some developers will find even <code><link linkend="function-vocabulary"
                     >tan:vocabulary()</link></code> a hassle to use. Consider setting the global
               parameter <code>$distribute-vocabulary</code> (default <code>false</code>) to
                  <code>true</code>. If that happens, whenever an IDref appears, it will be
               imprinted with the corresponding IRI + name pattern for the referred vocabulary item.
               Exercise this option with care: such repetition will result in a document
               considerably larger than the original.</para>
         </section>
         <section xml:id="tan-applications">
            <title>Using TAN outside the Network</title>
            <para>The function library behind TAN is quite powerful, and it can be used in non-TAN
               applications. Below is a list of some functions that have been extremely helpful.
               Some of the functions are not central to validation, so must be retrieved through
                  <link xlink:href="../functions/TAN-extra-functions.xsl"/>. For a complete list of
               all functions, see <xref xlink:href="#variables-keys-functions-and-templates"
               />.</para>
            <para>
               <orderedlist>
                  <listitem>
                     <para><code><link linkend="function-batch-replace"
                           >tan:batch-replace()</link></code>: runs a sequence of regular expression
                        replacements on any string. The sequence is prepared by constructing a
                        series of <code>&lt;replace pattern="" replacement="" [flags=""]></code>
                        whose attributes follow the rules of <code><link linkend="function-replace"
                              >tan:replace()</link></code> or <code>fn:replace()</code>.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-chop-string"
                           >tan:chop-string()</link></code>: changes a string into a sequence of
                        characters, as defined in TAN (i.e., combining characters are always kept
                        with the base character). It is roughly equivalent to the XPath expression
                           <code>for $i in fn:string-to-codepoints(.) return
                           fn:codepoints-to-string($i)</code>.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-collate">tan:collate()</link></code>: like
                              <code><link linkend="function-diff">tan:diff()</link></code>, but
                        applied to any number of strings. The results are treated much like a
                        collation of manuscript readings, with the output xml fragment tethered to
                        sigla corresponding to the input strings. The function can be used to
                        optimize the order of the input strings, and to compute pairwise similarity
                        of each string.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-copy-indentation"
                              >tan:copy-indentation</link>()</code>: applies the white-space
                        indentation of an element to any other XML fragment. Useful for when you
                        want to insert items in an XML file and preserve/imitate its
                        indentation.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-diff">tan:diff()</link></code>: compare any
                        two strings for differences. Includes an option to mark the changes
                        letter-for-letter, or merely word-for-word (easier to read in some
                        contexts). This function, which was written under the assumption that the
                        input strings would have some resemblance, has been used successfully on
                        pairs of strings as long as 5M characters.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-duplicate-items"
                              >tan:duplicate-items()</link></code>: like <code><link
                              linkend="function-duplicate-values"
                           >tan:duplicate-values()</link></code>, but applied to any item. If a
                        node, duplication is determined based on whether it is deeply equal to any
                        other node.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-duplicate-values"
                              >tan:duplicate-values()</link></code>: finds distinct items in a
                        sequence whose values are repeated in the sequence. This function
                        complements <code>fn:distinct-values()</code>.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-fill">tan:fill()</link></code>: repeats a
                        string a given number of times. Helpful for formatting plain-text
                        output.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-get-chars-by-name"
                              >tan:get-chars-by-name()</link></code>: retrieves Unicode characters
                        based upon words in their name.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-glob-to-regex"
                           >tan:glob-to-regex()</link></code>: changes a glob-like expression
                        (normally used for filenames) into a regular expression (e.g.,
                           <code>*.*</code> becomes <code>.*\..*</code>).</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-lang-code">tan:lang-code()</link></code>:
                        retrieves an ISO 639-3 code for a language of a given name.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-lang-name">tan:lang-name()</link></code>:
                        finds the name of a language, given its ISO 639-3 code.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-median">tan:median()</link></code>:
                        retrieves the median from a sequence of numbers</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-most-common-item"
                              >tan:most-common-item()</link></code>: from a sequence of items,
                        returns the one that occurs most frequently</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-most-common-item-count"
                              >tan:most-common-item-count</link>()</code>: returns the number of
                        times the most common item appears in a sequence</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-no-outliers"
                           >tan:no-outliers()</link></code>: removes outliers from a sequence of
                        numbers</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-outliers">tan:outliers()</link></code>:
                        returns only outliers from a sequence of numbers</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-search-morpheus"
                              >tan:search-morpheus()</link></code>: retrieves lexico-morphological
                        data for Greek and Latin from the Morpheus service</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-search-wikipedia"
                              >tan:search-wikipedia()</link></code>: retrieves a set number of
                        records from Wikipedia</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-shallow-copy"
                           >tan:shallow-copy()</link></code>: returns a copy of a node to a set
                        depth. Useful for messages, to provide feedback on a particular element and
                        its attributes, without any descendants (which would make the message hard
                        to read).</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="function-uri-relative-to"
                              >tan:uri-relative-to()</link></code>: converts an absolute URI to a
                        relative one, based on some context URI</para>
                  </listitem>
               </orderedlist>
            </para>
            <para>Some numeral functions might prove useful:<orderedlist>
                  <listitem>
                     <para>Letter numerals ↔ integers: <code><link linkend="function-aaa-to-int"
                              >tan:aaa-to-int()</link></code>, <code><link
                              linkend="function-int-to-aaa">tan:int-to-aaa()</link></code></para>
                  </listitem>
                  <listitem>
                     <para>Roman numerals → integers: <code><link linkend="function-rom-to-int"
                              >tan:rom-to-int()</link></code> (reverse not available)</para>
                  </listitem>
                  <listitem>
                     <para>Greek numerals ↔ integers: <code><link linkend="function-grc-to-int"
                              >tan:grc-to-int()</link></code>, <code><link
                              linkend="function-int-to-grc">tan:int-to-grc()</link></code></para>
                  </listitem>
                  <listitem>
                     <para>Syriac numerals → integers: <code><link linkend="function-syr-to-int"
                              >tan:syr-to-int()</link></code> (reverse not available)</para>
                  </listitem>
                  <listitem>
                     <para>Hexadecimal ↔ decimal: <code><link linkend="function-hex-to-dec"
                              >tan:hex-to-dec()</link></code>, <code><link
                              linkend="function-dec-to-hex">tan:dec-to-hex()</link></code></para>
                  </listitem>
                  <listitem>
                     <para>String range ↔ integers: <code><link
                              linkend="function-expand-numerical-sequence"
                              >tan:expand-numerical-sequence()</link></code>, <code><link
                              linkend="function-integers-to-sequence"
                              >tan:integers-to-sequence()</link></code></para>
                  </listitem>
               </orderedlist></para>
         </section>
      </chapter>
   </part>
   <part xml:id="appendixes">
      <title>Appendixes</title>
      <xi:include href="inclusions/elements-attributes-and-patterns.xml"/>
      <xi:include href="inclusions/vocabularies.xml"/>
      <xi:include href="inclusions/variables-keys-functions-and-templates.xml"/>
      <xi:include href="inclusions/errors.xml" xml:id="errors"/>
   </part>
</book>
