<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<book xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:xi="http://www.w3.org/2001/XInclude" version="5.0">
   <info>
      <title>Guide to the Text Alignment Network, Version 2021</title>
      <legalnotice>
         <info>
            <title>Text Alignment Network: Official Guidelines</title>
            <copyright>
               <year>2015-present</year>
               <holder>Joel Kalvesmaki</holder>
            </copyright>
            <author>
               <personname>Joel Kalvesmaki</personname>
               <email>kalvesmaki@gmail.com</email>
            </author>
         </info>
         <remark>All software, code, and dependencies (e.g., applications, functions, schemas,
            utiities, vocabularies) are released under a GNU General Public License, <link
               xlink:href="https://opensource.org/licenses/GPL-3.0"
               >https://opensource.org/licenses/GPL-3.0</link>.</remark>
         <remark>All other materials (such as this document), unless otherwise specified, are
            licensed under a Creative Commons Attribution 4.0 International License: <link
               xlink:href="http://creativecommons.org/licenses/by/4.0/"
               >http://creativecommons.org/licenses/by/4.0/</link>
         </remark>
      </legalnotice>
      <revhistory>
         <info>
            <releaseinfo>Latest stable version: <link
                  xlink:href="http://textalign.net/release/TAN-2021/guidelines/"
                  >http://textalign.net/release/TAN-2021/guidelines/</link>.</releaseinfo>
            <releaseinfo>Development version: <link
                  xlink:href="https://github.com/textalign/TAN-2021/tree/dev"
                  >https://github.com/textalign/TAN-2021/tree/dev</link></releaseinfo>
         </info>
         <revision>
            <revnumber>Version 2021 (alpha)</revnumber>
            <date>2021-09-07</date>
            <revdescription>
               <para>Formats: <link
                     xlink:href="http://textalign.net/release/TAN-2020/guidelines/xhtml/index.xhtml"
                     >HTML</link> • <link
                     xlink:href="http://textalign.net/release/TAN-2020/guidelines/pdf/TAN-2020-guidelines.pdf"
                     >PDF</link> • <link
                     xlink:href="http://textalign.net/release/TAN-2020/guidelines/main.xml"
                     >Docbook</link> (master)</para>
               <warning>
                  <para>In case of contradictions, apparent or not, between these guidelines and the
                     core TAN files, priority should be given first to the RELAX-NG schemas (compact
                     syntax), then to the functions, and finally to these guidelines.</para>
               </warning>
            </revdescription>
         </revision>
      </revhistory>
   </info>
   <part xml:id="general_overview">
      <title>General overview</title>
      <chapter>
         <title>Introduction</title>
         <section xml:id="tan_definition">
            <title>Overview</title>
            <para>The Text Alignment Network (TAN) is a framework that allows users, working
               independently and collaboratively, to share, find, create, edit, and explore digital
               texts and annotations. </para>
            <para>A customized extension of <link xlink:href="http://tei-c.org">Text Encoding
                  Initiative (TEI)</link> XML, TAN is particularly suited for organizing and
               aligning texts with multiple versions (copies, translations, paraphrases), and for
               creating and editing text annotations such as quotations, translation clusters
               (word-to-word), and linguistic features. </para>
            <para>The foundation of TAN is a suite of XML formats, each designed for a specific
               task. The extensive validation routines maximize the syntactic and semantic
               interoperability of texts, annotations, and language resources. TAN comes with
               applications and utilities that open new frontiers in scholarly publishing, research,
               and teaching. </para>
            <para>Why use TAN? </para>
            <para><emphasis role="bold">Extensive error checking</emphasis>. Built-in TAN validation
               rules go well beyond the customary error-checking performed by other formats. Files
               linked in the network "talk" to each other, to let users know about changes and
               updates. More than one hundred types of content-based errors are checked. Through
               Schematron Quick Fixes, many of the problems can be corrected in a matter of
               seconds.</para>
            <para><emphasis role="bold">Time-saving utilities</emphasis>. Enjoy enhanced editing
               functions in Oxygen XML Editor's Author mode. Highly customizable TAN utilities help
               you create, edit, and maintain TEI and TAN files. For example: </para>
            <para><itemizedlist>
                  <listitem>
                     <para><emphasis>Body Builder</emphasis>: write rules to convert plain text or
                        Word docx files into a preferred TAN/TEI structure and markup.</para>
                  </listitem>
                  <listitem>
                     <para><emphasis>Body Remodeler</emphasis>: incrementally restructure a text to
                        imitate an existing TAN/TEI file. In conjunction with Oxygen Author tools,
                        this utility can save hours of labor in creating a collection of many
                        versions of the same work.</para>
                  </listitem>
                  <listitem>
                     <para><emphasis>Body Sync</emphasis>: update a TAN/TEI file so its
                        transcription exactly matches that of another TAN/TEI file.</para>
                  </listitem>
                  <listitem>
                     <para><emphasis>TAN-A-lm Builder</emphasis>: generate lexico-morphological data
                        for a TAN/TEI file.</para>
                  </listitem>
               </itemizedlist></para>
            <para><emphasis role="bold">Pathbreaking applications</emphasis>. Core TAN applications,
               written in XSLT, provide cutting-edge tools for textual research and analysis. For
               example: </para>
            <para><itemizedlist>
                  <listitem>
                     <para><emphasis>Diff+</emphasis>: identify, analyze, and visualize text
                        differences between any number of versions of a text.</para>
                  </listitem>
                  <listitem>
                     <para><emphasis>Parabola</emphasis>: juxtapose in a single interactive HTML
                        page all the versions of a work, along with annotations. </para>
                  </listitem>
                  <listitem>
                     <para><emphasis>Tangram</emphasis>: identify quotations, paraphrases, and
                        common text between two groups of texts.</para>
                  </listitem>
               </itemizedlist></para>
            <para><emphasis role="bold">Intuitive text referencing</emphasis>. Unlike TEI, HTML, or
               other markup systems that rely heavily upon arbitrary identifiers that can be
               difficult to navigate and maintain, TAN points to text portions using familiar
               reference systems, or user-customized tokenization rules.</para>
            <para><emphasis role="bold">Application development</emphasis>. TAN is built upon an
               extensive and robust XSLT function library, one of the few of its kind. Do you
               already use <link xlink:href="https://www.nltk.org/">Natural Language Toolkit</link>,
                  <link xlink:href="http://cltk.org/">Classical Language Toolkit</link>, or
               comparable packages in programming languages to develop tools for textual and
               linguistic research? Do you have to process, analyze, and transform texts that are in
               tree structures? With more than 250 public functions, covering a range of tasks, from
               numerics to maps, checksums to tree manipulation, the TAN function library might have
               everything you need, and more, and help you stay within an XML environment. Many TAN
               functions are extremely useful, even outside of TEI or TAN.</para>
            <para><emphasis role="bold">Semantic Web</emphasis>. TAN was designed at the outset to
               ensure that texts and their annotations would be rooted in the practices of the
               Semantic Web. Unlike many other formats, whose attribute values are almost always
               only human-readable, most TAN file components are tied to URIs, making them
               suitable for use in Semantic Web applications.</para>
         </section>
         <section>
            <title>Rationale and purpose</title>
            <para>Scholars frequently work with numerous versions of texts. Sometimes the original
               version has been lost, or survives only fragmentarily, and can be studied only
               through later translations, paraphrases, or quotations. Even when an original
               survives, its later versions are often worth study, revealing as they do something of
               how words, concepts, and works were preserved, altered, or combined by generations
               and cultures who created, read, and circulated the versions.</para>
            <para>Such textual comparison requires texts whose words, sentences, paragraphs, and
               other segments are aligned. Such alignment can be challenging. Some versions might be
               defective, or follow an idiosyncratic sequence. One editor may have divided the text
               according to a system not easily applied to other versions. Identifying which words
               or phrases in a translation and its original correspond might result in complex,
               overlapping spans. And even larger segments such as sentences and paragraphs may not
               line up well. Further, every version of a text is part of a much larger, complex
               history of text reuse, and a complete study of that context requires engagement with
               other works and other languages, and collaboration across projects and fields of
               study.</para>
            <para>Text Alignment Network (TAN) XML facilitates the exchange of multiple versions of
               texts and annotations on those texts. TAN syntax is suitable for humans to read and
               edit, expressive enough to allow scholars to register doubt and nuance, and
               sufficiently structured to permit complex computer-based queries across independent
               datasets. TAN is not a single format, but rather a suite of formats, one task per
               format. Because nearly all TAN data must be expressed in way that computers can
               parse, the information can be used in semantic web applications (see <xref
                  linkend="rdf_and_lod"/>).</para>
            <para>TAN has been designed to support two kinds of scholarly activity: <emphasis
                  role="bold">creation</emphasis> and <emphasis role="bold"
               >research</emphasis>.</para>
            <para>When we <emphasis role="bold">create</emphasis> our primary sources or analyze
               them, we normally want what we create to be useful to our colleagues. TAN was
               designed to assist scholarly creative activities such as:</para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>Creating and sharing a transcription of a particular version of a textual
                        work that it is more likely to align with any other TAN version of that text
                        created by someone else;</para>
                  </listitem>
                  <listitem>
                     <para>Creating an index of quotations that is semantically rich and can be
                        applied to any other version of the quoting or quoted works;</para>
                  </listitem>
                  <listitem>
                     <para>Specifying exactly (e.g., word-for-word) where a source and its
                        translation correspond, even with overlapping or ambiguous relationships, or
                        where doubt or alternative possibilities of alignment need to be
                        expressed;</para>
                  </listitem>
                  <listitem>
                     <para>Listing the grammatical features of every word in a text or a language in
                        a way that allows it to be compared easily against other languages and
                        texts.</para>
                  </listitem>
               </itemizedlist>
            </para>
            <para>Shared TAN files form a decentralized, interoperable corpus of texts, a kind of
               Internet of primary sources and annotations. As this TAN-compliant corpus spreads
               into different linguistic, chronological, and geographical regions, third-party tools
               and applications can expand the repertoire of <emphasis role="bold"
                  >research</emphasis> questions beyond any single corpus, to help scholars
               fruitfully investigate broader, comparative questions such as:<itemizedlist>
                  <listitem>
                     <para>For classical Greek texts, how were words with the root -ιστημι ("stand")
                        translated into ancient Latin? In what specific ways did the vocabulary of
                        technical terms shift from pre-Christian translations into later, Christian
                        ones?</para>
                  </listitem>
                  <listitem>
                     <para>How do the reformed Chinese translation technique of Sanskrit Buddhist
                        texts, attested by Dao An (312-385 CE), compare to reforms in the seventh
                        and eighth centuries of Syriac translations of Greek texts?</para>
                  </listitem>
                  <listitem>
                     <para>How do Arabic translations of Greek texts from the Abbasid period differ
                        from contemporaneous translations from Sanskrit into Arabic?</para>
                  </listitem>
                  <listitem>
                     <para>Can an anonymous English translation of a modern French novel be
                        identified with known translators from that period?</para>
                  </listitem>
                  <listitem>
                     <para>How do present-day translations of official United Nations documents
                        differ across languages?</para>
                  </listitem>
               </itemizedlist></para>
            <para>Neither the TAN format nor its applications answer such questions. But they can be
               used to start to work on answers, because the TAN function library includes many
               cutting-edge algorithms that cannot be found in other programming libraries, whether
               XSLT or not. What the <link xlink:href="https://www.nltk.org/">Natural Language
                  Toolkit</link> (or the related <link xlink:href="http://cltk.org/">Classical
                  Language Toolkit</link>) is for digital humanists using Python, TAN aspires to be
               for those using XSLT. For more on the function library see <xref
                  xlink:href="#using-tan-functions"/>.</para>
         </section>
         <section>
            <title>About the format</title>
            <para>TAN differs from other text formats such as HTML, Microsoft Word, PDF, or Docbook.
               Each of those formats are interoperable only in the sense that any file can be
               reliably opened and displayed by the same software. Despite such software
               compatibility, the content, structured by each user, looks very different from one
               file to the next. If you receive from different people two versions of a particular
               literary work in the same file format (e.g., Word or PDF), there would be little
               likelihood that you could align them in a new document without a lot of extra work.
               These are presentation formats, designed to let the creator use his or her
               imagination to shape, structure, and present the material in highly stylized,
               creative ways. The formats are laissez faire, concerned mainly to ensure that each
               component is rendered properly, without regard for the meaning of those components. </para>
            <para>Creating a text in TAN is like opening a word processor and telling it, "I don't
               care how the text looks. I want to ensure that it is in a meaningful structure that
               corresponds to any other version of that text. The appearance, which could take
               thousands of directions, can be worried about later." </para>
            <para>The closest analogue to the TAN formats is the XML format developed by the Text
               Encoding Initiative, whose design catalyzed and continues to inspire the development
               of TAN. <emphasis>TAN is, in fact, a customized extension of TEI</emphasis>. TAN
               takes a handful of TEI concepts and extends them via stand-off annotation, to allow
               for overlapping annotations, to engage with the Semantic Web, and to support
               cross-project interoperability. TAN reduces some of the repetition that tends to be
               necessary in TEI files. For more on comparisons between TAN and TEI see <xref
                  linkend="TEI"/>.</para>
            <para>Some other caveats:<itemizedlist>
                  <listitem>
                     <para>Although TAN comes with an extensive library of functions and templates,
                        it is not what most people think of as a tool or application. It is not
                        customer, off-the-shelf software. It does not come with graphic interface.
                        Rather, it is a package of XML resources, particularly in XSLT, that allows
                        programmers and developers to create customized applications and tools. If
                        you work with an XML editor like Oxygen, your editing experience will be
                        greatly enhanced by the TAN function library, which was designed in Oxygen,
                        and optimized for it.</para>
                  </listitem>
                  <listitem>
                     <para>The TAN formats are specialized. They are not meant to replace other
                        common text formats such as TEI, Docbook, and so forth, or other alignment
                        formats such as XLIFF or TMX. Converting a TAN file into these formats is
                        usually straightforward, but will usually entail loss. Conversely, most
                        conversions from one of these formats into TAN will not entail loss, but
                        will be imperfect or incomplete, because many of these formats lack the data
                        required by TAN. Conversion must be given careful thought, and can only be
                        semiautomated.</para>
                  </listitem>
                  <listitem>
                     <para>Each TAN format has a restricted field of inquiry, defined and explained
                        in these guidelines. TAN is not for everyone. For example, if you are
                        working on developing a transcription that imitates a particular print
                        edition, you are better off using only TEI, or a version of TEI that you
                        have customized. But once you want to bring that transcription into close
                        comparison with other versions and study it intertextually, then TAN might
                        be ideal.</para>
                  </listitem>
               </itemizedlist></para>
         </section>
         <section xml:id="tan_participation">
            <title>Participation</title>
            <para>Changes are made regularly to TAN, mainly in its <link
                  xlink:href="https://github.com/textalign/TAN-2021/tree/dev">development
                  branch</link>. If you have a TAN library, sharing it with other participants,
               particularly via Git, will help developers test any changes that have been made to
               the function library, and encourage others to contribute to your project.</para>
            <para>The TAN project is by no means finished. This version TAN merely scratches the
               surface of what is possible. New participants to test, use, and develop TAN's
               schemas, functions, guidelines, and applications are welcome. Inquiries about
               participation should be sent to the project director, <link
                  xlink:href="http://kalvesmaki.com/">Joel Kalvesmaki</link>, by email:
                  <code>director</code> at <code>textalign.net</code>.</para>
            <para>Official announcements are made by <link
                  xlink:href="http://groups.google.com/group/textalign?hl=en">email (Google
                  Group)</link> and by <link xlink:href="https://twitter.com/textalign"
                  >Twitter</link>.</para>
         </section>
      </chapter>
      <chapter xml:id="gentle_guide">
         <title>Starting off with the TAN format</title>
         <para>If you think you are ready to jump in and get going, try <xref
               xlink:href="#local_setup"/>.  But if you are new to markup languages, or unfamiliar
            or uncomfortable with acronyms and technical terms such as <emphasis role="italic"
               >XML</emphasis>, <emphasis role="italic">RDF</emphasis>, <emphasis role="italic"
               >XPath</emphasis>, and <emphasis>Unicode</emphasis>, you should start with this
            chapter, which uses a simple example to illustrate the steps typically taken to create
            and edit TAN files, and to introduce new terminology. By the end of this chapter, you
            will have a sense of how to create and edit a small collection of TAN transcriptions and
               alignments.<footnote xml:id="transcription_and_transliteration">
               <para>In the TAN system, <termdef>a <firstterm>transcription</firstterm> is a plain
                     digital text that replicates a text found somewhere else, usually reproducing
                     its script and spelling</termdef>. The following—"In pluribus unum"—is a
                  (partial) transcription of a United States dollar. The term should be
                  distinguished from <termdef>a <firstterm>transliteration</firstterm>, which is a
                     transcription rendered in a script other than the original</termdef>. For
                  example, εν πλουριμπυς ουνεμ, would be a Greek transliteration of the previous
                  transcription.</para>
            </footnote></para>
         <para>The chapter touches on a number of general concepts that are discussed only briefly.
            If you find a particular term new or confusing, follow the prompts for further reading.
            If you are already familiar with basic markup concepts, you should at least skim through
            the chapter, because TAN approaches some old problems in new ways.</para>
         <section>
            <title>Creating TAN transcription and alignment data</title>
            <para>Let us take a simple example, that of aligning two English versions of the nursery
               rhyme <emphasis role="italic">Ring-a-ring-a-roses</emphasis>, sometimes known as
                  <emphasis role="italic">Ring around the Rosie</emphasis>. Our goal here is to
               publish two versions of the nursery rhyme in the TAN format so that they are most
               likely alignable with any other TAN version of the poem that might appear.<footnote>
                  <para>Although the TAN examples below look much like files in the
                        <code>examples</code> subdirectory of the TAN library, they have been
                     adjusted, to explain the formats better.</para>
               </footnote></para>
            <para>We begin by finding previously published versions that haven't been digitized. In
               this case we have taken an interest in the versions published in <link
                  xlink:href="http://lccn.loc.gov/12032709">1881</link> and <link
                  xlink:href="http://lccn.loc.gov/87042504">1987</link> (one published in the U.K.
               and the other, the U.S.). Each of these books have other rhymes, but we've decided to
               focus upon one nursery rhyme, so we type up (transcribe) that poem and nothing
                  else:<table frame="all">
                  <title>Ring around the Rosie</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <thead>
                        <row>
                           <entry>1881 (U.K.) version</entry>
                           <entry>1987 (U.S.) version</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>
                              <para>Ring-a-ring-a-roses,</para>
                              <para>A pocket full of posies;</para>
                              <para>Hush! Hush! Hush! Hush!</para>
                              <para>We're all tumbled down.</para>
                           </entry>
                           <entry>
                              <para>Ring-a-round the rosie,</para>
                              <para>A pocket full of posies,</para>
                              <para>Ashes! Ashes!</para>
                              <para>We all fall down.</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table></para>
            <para>We must be sure to save each of the two transcriptions as plain text. Do not
               bother with a word processor (Word, OpenOffice, Google Docs, and so forth), which is
               too fancy for our needs. Word processors sometimes generate erroneous data, even when
               you export to plain text. And we are not concerned with italics, colors, fonts,
               margins, and so forth. We would be better off with a <link
                  xlink:href="http://en.wikipedia.org/wiki/Text_editor">text editor</link>, which
               opens and saves only text. But even those do not check to see if the rules of the TAN
               format have been followed. So the best tool is an <link
                  xlink:href="http://en.wikipedia.org/wiki/XML_editor">XML editor</link>, which like
               a text editor takes and creates only text. An XML editor is designed to follow the
               rules of XML, and so saves a lot of typing, and prevents many errors. More important,
               an XML editor will tell us when our TAN file is invalid, and will provide important
               help as we edit.<footnote>
                  <para>Software suitable for your needs comes in many styles and prices. In
                     addition to the links in the paragraph above, you may wish to visit the
                     comparative lists published on Wikipedia for both <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_text_editors">text
                        editors</link> and <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_XML_editors">XML
                        editors</link>. TAN was developed using <link
                        xlink:href="https://www.oxygenxml.com">Oxygen</link>, which is very
                     powerful. If you are a new user, you are likely to find it overwhelming. Take
                     advantage of tutorials and documentation associated with the XML editor you
                     have chosen. </para>
               </footnote></para>
            <para>Our first task is to get these two versions into separate files with the
               appropriate markup. Each TAN transcription file has two major parts: a head and a
               body. For now, we focus on only the second part, the body, as well as a few of the
               necessary preliminary lines that stand at the opening of the file, before both the
               head and the body. First, the 1881 (U.K.) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
    type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.sch" 
    type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body xml:lang="eng">
        &lt;div type="line" n="1"></emphasis>Ring-a-ring-a-roses,<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="2"></emphasis>A pocket full of posies;<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="3"></emphasis>Hush! Hush! Hush! Hush!<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="4"></emphasis>We're all tumbled down.<emphasis role="bold">&lt;/div>
    &lt;/body>
&lt;/TAN-T></emphasis></programlisting> And now the 1987 (U.S.) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
   type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" 
   type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
   id="tag:parkj@textalign.net,2015:ring02">
   &lt;head>
   . . . . . . .
   &lt;/head>
   &lt;body xml:lang="eng">
      &lt;div type="l" n="1"></emphasis>Ring-a-round the rosie,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="2"></emphasis>A pocket full of posies,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="3"></emphasis>Ashes! Ashes!<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="4"></emphasis>We all fall down.<emphasis role="bold">&lt;/div>
   &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
            </para>
            <para>The examples above are in <emphasis role="bold">eXtensible Markup
                  Language</emphasis> (<emphasis role="bold">XML</emphasis>). XML lets you take a
               text or a collection of data and structure it with angle brackets, <code>&lt;</code>
               and <code>></code>. In the examples above, such markup is in boldface.</para>
            <para>Each file begins with a <emphasis role="bold">prolog</emphasis>, the first few
               lines that begin with <code>&lt;?</code>. The first line simply states that what
               follows is an XML document. The next two lines in each example are <emphasis
                  role="bold">processing instructions</emphasis> that point to the <emphasis
                  role="bold">schemas</emphasis>: files that will be used to check to see whether or
               not our XML follows TAN rules, a process called <emphasis role="bold"
                  >validation</emphasis>. We will skip the details of those first five lines. They
               will be identical, or nearly so, from one TAN file to the next. We can simply cut and
               paste them when we want to start a new TAN file.</para>
            <para>After the prolog comes an <emphasis role="bold">opening tag</emphasis>, signified
               by an angle bracket followed by a letter, here <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. That opening tag, <code>&lt;TAN-T...></code> is
               answered by a <emphasis role="bold">closing tag</emphasis>, <code>&lt;/TAN-T></code>,
               the last line. An opening tag and a closing tag mark the beginning and the end of one
               of the most important parts of an XML document, the <emphasis role="bold"
                  >element</emphasis>. For now, you can think of an element as a chunk of data.
               Every element is marked by a pair of tags. In this example, <code><link
                     linkend="element-head">&lt;head></link></code> is answered by
                  <code>&lt;/head></code>, <code><link linkend="element-body"
                  >&lt;body></link></code> by <code>&lt;/body></code> and each
                  <code>&lt;div...></code> by <code>&lt;/div></code>. Any element that has an
               opening tag must have a closing tag. If an element doesn't have anything between its
               opening and closing tags, the two of them can be collapsed into a single tag. That
               is, <code>&lt;a>&lt;/a></code> can be simplified to <code>&lt;a/></code> (such empty
               elements are illustrated below).</para>
            <para>Elements and processing instructions are two of the seven basic XML ingredients,
               called <emphasis role="bold">nodes</emphasis>. The other five node types are text,
               comment, attribute, namespace, and document, some of which we will meet below. The
               element node is arguably the most important type. You will see it most often, and it
               is absolutely required for anything to be well-formed XML. Every XML file must have
               at least one element. (But it does not have to have attributes, text, comments, or
               processing instructions.)</para>
            <para>Elements nest within or beside each other, but they never overlap or interlock.
               That is, you <emphasis>cannot</emphasis> have
                  <code>&lt;a>&lt;b>overlap&lt;/a>&lt;/b></code>. The prohibition on overlapping
               elements is one of the cardinal rules of XML. The no-overlap rule keeps XML files
               tidy, and makes it easier for developers to write efficient applications. </para>
            <para>Any two nearby elements normally relate to each other either by one nesting inside
               the other or by one being adjacent to the other. Because of these different close
               relationships, every XML file can be thought of as a tree, with the root at the trunk
               and the nested elements as branches, terminating in metaphorical leaves—those
               elements that do not contain any other elements. It is helpful to use the tree
               metaphor when we describe the path we take, toward either the leaves or the root. In
               these guidelines, we may use the terms <emphasis role="italic">rootward</emphasis>
               and <emphasis role="italic">leafward</emphasis> when we want to trace movement up and
               down the levels of hierarchy in an XML document. You may also encounter the
               corresponding terms <emphasis>outermost</emphasis> and
               <emphasis>innermost</emphasis>. The metaphor is strengthened by the XML rule that
               there can be but only one <emphasis role="bold">root element</emphasis>, i.e., the
               element that contains all other elements and is contained by none. In our examples
               above the root element is named <code>TAN-T</code>.</para>
            <para>An XML document tree can also be profitably thought of as a family. Family names
               provide the most common terminology to describe the relationship between elements. In
               our examples above, <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> is
               the <emphasis role="bold">parent</emphasis> of <code><link linkend="element-body"
                     >&lt;body></link></code>, and <code><link linkend="element-body"
                     >&lt;body></link></code> is the parent of the four <code><link
                     linkend="element-div">&lt;div></link></code> elements. Likewise, each
                     <code><link linkend="element-div">&lt;div></link></code> is the <emphasis
                  role="bold">child</emphasis> of <code><link linkend="element-body"
                     >&lt;body></link></code>, and <code><link linkend="element-body"
                     >&lt;body></link></code> is the child of <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. Distant parental relationships can be described with
               the terms <emphasis role="bold">ancestor</emphasis> and <emphasis role="bold"
                  >descendant</emphasis>. <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is the ancestor of every element it encompasses, and
               every element encompassed by <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is its descendant. Paratactic relationships are also
               important. <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code> are <emphasis role="bold"
                  >siblings</emphasis> to each other, and every <code><link linkend="element-div"
                     >&lt;div></link></code> is a sibling to every other <code><link
                     linkend="element-div">&lt;div></link></code>. The terms "following" and
               "preceding" are the most common ways to describe the relationship of one sibling to
               another.</para>
            <para>You may notice that some characters are inside opening tags, but not closing ones.
               In the opening tags for the <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code>, <code><link linkend="element-body"
                  >&lt;body></link></code>, and <code><link linkend="element-div"
                  >&lt;div></link></code> elements there appear sets of pairs: a word and something
               within quotation marks, each of them separated by an equals sign. These stretches of
               text are called <emphasis role="bold">attributes</emphasis>. On the left side of the
               equals sign is the attribute name, and on the right side, within the quotation marks,
               is the attribute value. In the example above <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code> has three attributes, <code>@xmlns</code>,
                     <code><link linkend="attribute-TAN-version">@TAN-version</link></code>, and
                     <code><link linkend="attribute-id">@id</link></code> (it is customary to signal
               attributes by writing <code>@</code>). We will skip <code>@xmlns</code> for now. It
               looks like an attribute, but it's really a pseudo-attribute, because it specifies the
                  <emphasis role="bold">namespace</emphasis> of the XML file. Namespaces are an
               important but advanced topic, not discussed in this chapter. (See <xref
                  xlink:href="#namespace"/>.)</para>
            <para>The value of <code><link linkend="attribute-TAN-version"
                  >@TAN-version</link></code> indicates that the 2021 version of TAN is being used. </para>
            <para><code><link linkend="attribute-id">@id</link></code> is quite important. Every TAN
               file has an <code><link linkend="attribute-id">@id</link></code> that uniquely names
               and permanently identifies the document itself. It should not be changed, even if we
               make edits. If you change the filename or a copy of it winds up being incorporated
               into another project, a stable <code><link linkend="attribute-id">@id</link></code>
               will be quite important for finding it. An <code><link linkend="attribute-id"
                     >@id</link></code> should be unique. The only time the value should be repeated
               in a file is when you are pointing to another version of the same file.</para>
            <para>In the <code><link linkend="element-TAN-T">&lt;TAN-T></link></code>, the value of
                     <code><link linkend="attribute-id">@id</link></code> must always be what is
               called a tag uniform resource name (tag URN). A tag URN begins with
               <code>tag:</code>, followed by an email address or domain name that we own or owned.
               It is okay to use an obsolete address or domain; its purpose is to allow users to
               identify you, perhaps centuries from now, not to contact you. But you can use a
               current email address if you want to be contacted by those who use your file. After
               that email address or domain name comes a comma (no spaces) and a date on which we
               owned it, in the form of numbers for the year, year + month, or year + month + date,
               each item joined by hyphens, e.g., 2014-12-31. If we leave off a day value, it is
               assumed to be 01, the first of the month; if we leave off the month value it is
               assumed to be 01, January. </para>
            <para>In the examples above, <code>parkj@textalign.net,2015</code> points to our fictive
               self, Jenny Park, who owned that particular email address on the stroke of midnight
               (Coordinated Universal Time) January 1, 2015. After that comes a colon, and then any
               name we wish to assign to the file. </para>
            <para>We have anticipated a simple collection of texts, so we've called the files
                  <code>ring01</code> and <code>ring02</code>. If we run out of names, or want to
               restart, we can simply use a new email-date preface, e.g.,
                  <code>parkj@textalign.net,2015-01-02</code>. Or we could change the way we build
               our tag URNs.</para>
            <para>Tag URNs are very useful. You do not need permission to create one. You don't need
               to register them. You are in control. You also signal who is responsible for the
               file. Hundreds of years from now, when that email will be defunct or perhaps owned by
               someone else, users might still be able to identify who was responsible.</para>
            <para>The element <code><link linkend="element-body">&lt;body></link></code> contains
               our transcription. <code><link linkend="attribute-xmllang">@xml:lang</link></code>,
               required, specifies the principal language of the transcribed text. We use the
               standard 3-letter abbreviation for English. We could have used <code>en</code>, but
               the 2-letter convention supports only a handful of languages. (See <xref
                  xlink:href="#language"/> for more.) </para>
            <para>Our transcription has been divided into four <code><link linkend="element-div"
                     >&lt;div></link></code> elements. How we divide up the work is entirely up to
               us. But we must make sure that every bit of text is enclosed by a leaf <code><link
                     linkend="element-div">&lt;div></link></code> (i.e., one that contains no other
                     <code><link linkend="element-div">&lt;div></link></code>). Every <code><link
                     linkend="element-div">&lt;div></link></code> must be the parent of only other
                     <code><link linkend="element-div">&lt;div></link></code>s, or none at all. No
                     <code><link linkend="element-div">&lt;div></link></code> may mix text and other
               elements. An exception is made for text that is nothing but space (the space bar, the
               tab, or the new line). Space-only text can be mixed with elements as needed, which
               means that a TAN file can be indented however you like, without changing its meaning. </para>
            <para>The values of <code><link linkend="attribute-type">@type</link></code> and
                     <code><link linkend="attribute-n">@n</link></code> indicate, respectively, the
               type of division and the name of the division. We have used <code>line</code> in the
               first example, but we could easily have also used <code>l</code> (as we did in the
               second) or <code>ln</code> or any other phrase that we think will be intuitive to
               other users. The value is arbitrary, but gets explained by what is in the header (we
               will how below). We have used arabic numerals for the values of <code><link
                     linkend="attribute-n">@n</link></code>, but the value, once again, could have
               been anything. Here we've opted for a reference system that seems intuitive and will
               most likely apply to multiple versions of the work. But the Arabic numerals are not
               required. We could have used Roman numerals, or some other numbering or naming scheme
               that is standard in the field. The idea is to use the term that is most like what
               other people encoding a different version of the same text might use.</para>
            <para>Aside from the <code><link linkend="element-head">&lt;head></link></code> element
               (discussed below), that's all we need in the TAN-T transcription. We can now move to
               alignment and annotation.</para>
            <para>We now turn to a second TAN format, TAN-A. Whereas the first two examples, TAN-T,
               had to do with texts and transcriptions, TAN-A has to do with alignment and
               annotation. The TAN-A format allows us to align and annotate as many transcriptions
               as we wish, and to make claims about them. Let's begin, once again temporarily
               skipping <code><link linkend="element-head">&lt;head></link></code>. Significant
               differences from the previous two TAN-T files are
               emphasized:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/<emphasis role="bold">TAN-A.rnc</emphasis>" 
    type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2018/schemas/TAN.sch" 
    type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;<emphasis role="bold">TAN-A</emphasis> xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">ring-alignment</emphasis>">
    &lt;head>
    . . . . . . .
    &lt;/head>
    <emphasis role="bold">&lt;body/>
</emphasis>&lt;/TAN-A></programlisting></para>
            <para>In the prolog, the first line is identical to the first line of our transcription
               files. The second and third lines, the processing instructions, are identical, except
               that <code>href</code> of the first of these points to the validation file specific
               to the TAN-A format. Even the fourth line looks like the two TAN-T files, other than
               the new name for the root element, <code><link linkend="element-TAN-A"
                     >&lt;TAN-A></link></code>, and the new value for <code><link
                     linkend="attribute-id">@id</link></code>.</para>
            <para>The penultimate line, <code>&lt;body/></code>, is an empty element, and is
               equivalent to an opening tag immediately followed by a closing tag, i.e., <code><link
                     linkend="element-body">&lt;body></link>&lt;/body></code>. The alternative form,
                  <code>&lt;body/></code>, is a more succincty way to say that an element contains
               nothing. It will become apparent, when we discuss <code><link linkend="element-head"
                     >&lt;head></link></code> below, why our <code><link linkend="element-body"
                     >&lt;body></link></code> can be empty.</para>
            <para>Let's look at a third TAN format, TAN-A-tok. This particular alignment file allows
               you to state precise which words in one text correspond with the words in another.
               Because of this precision, they can take more time to create. But we even start, we
               need to decide what kind of relationship holds between the two texts. Let us pretend,
               for the sake of example, that the 1987 version is a direct descendant (and therefore
               variation) of the 1881 one. So our task is to show exactly what words or phrases in
               the older version correspond to those of the newer one. We will simplify here, and
               exclude punctuation (some linguists legitimately treat punctuation as words in their
               own right). The term word is notoriously difficult to define, so we will call them
                  <emphasis>tokens</emphasis>, to avoid false connotations (hence the name of the
               file, TAN-A-tok, to refer to alignment of tokens).</para>
            <para>We now create a TAN-A-tok
               file:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/<emphasis role="bold">TAN-A-tok.rnc</emphasis>" 
    type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" 
    type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?>
&lt;<emphasis role="bold">TAN-A-tok</emphasis> xmlns="tag:textalign.net,2015:ns" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">TAN-A-tok,ring01+ring02</emphasis>">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body <emphasis role="bold">reuse-type="general_adaptation" bitext-relation="B-descends-from-A"</emphasis>>
        <emphasis role="bold">&lt;!-- Examples of picking tokens by number -->
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="1"/>
            &lt;tok src="ring1987" ref="1" pos="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="2"/>
            &lt;tok src="ring1987" ref="1" pos="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="3"/>
            &lt;tok src="ring1987" ref="1" pos="3"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="4"/>
            &lt;tok src="ring1987" ref="l" pos="4"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" pos="5"/>
            &lt;tok src="ring1987" ref="1" pos="5"/>
        &lt;/align>
        &lt;!-- Examples of picking tokens by value -->
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="A"/>
            &lt;tok src="ring1987" ref="2" val="A"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="pocket"/>
            &lt;tok src="ring1987" ref="2" val="pocket"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="full"/>
            &lt;tok src="ring1987" ref="2" val="full"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="of"/>
            &lt;tok src="ring1987" ref="2" val="of"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="posies"/>
            &lt;tok src="ring1987" ref="2" val="posies"/>
        &lt;/align>
        &lt;!-- Examples of picking ranges of tokens -->
        &lt;align>
            &lt;tok src="ring1881" ref="3" pos="1, 2"/>
            &lt;tok src="ring1987" ref="3" pos="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="3" pos="3 - 4"/>
            &lt;tok src="ring1987" ref="3" pos="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="1"/>
            &lt;tok src="ring1987" ref="4" pos="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="3"/>
            &lt;tok src="ring1987" ref="4" pos="2"/>
        &lt;/align>
        &lt;!-- examples of using "last" -->
        &lt;align>
            &lt;tok src="ring1881" ref="4" pos="last-1"/>
            &lt;tok src="ring1987" ref="4" pos="last-1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="last"/>
            &lt;tok src="ring1987" ref="4" ord="last"/>
        &lt;/align></emphasis>
    &lt;/body>
&lt;/TAN-A-tok></programlisting></para>
            <para>Once again, the first four lines, the prolog and root element, should look
               familiar, with the only significant changes being the names of the validation files,
               the name of the root element (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>), and the value of <code><link
                     linkend="attribute-id">@id</link></code>.</para>
            <para>The heart of the data is <code><link linkend="element-body"
                  >&lt;body></link></code>, which has two key attributes, <code><link
                     linkend="attribute-reuse-type">@reuse-type</link></code>, which describes the
               activity that was performed to change one version into the other, and <code><link
                     linkend="attribute-bitext-relation">@bitext-relation</link></code>, which
               specifies how one book relates to the other. Our two values,
                  <code>general_adaptation</code> and <code>B-descends-from-A</code>, are arbitrary
               names that we define in the <code><link linkend="element-head"
                  >&lt;head></link></code> (discussed later). (To understand the concepts behind
               reuse types and bitext relations, see <xref linkend="tan-a-tok"/>).</para>
            <para>You will also notice some lines that begin <code>&lt;!--</code> and end
                  <code>--></code>. These are <emphasis role="bold">comments</emphasis>, and can be
               placed within or beside any element, and can enclose any text we like, including line
               breaks. You may put a comment anywhere you like, as long as it is not inside a tag or
               attribute.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> is the parent of one or
               more <code><link linkend="element-align">&lt;align></link></code> elements, each of
               which correlates a set of tokens in each of the two texts, pointed to by its
                     <code><link linkend="element-tok">&lt;tok></link></code> children. Each
                     <code><link linkend="element-tok">&lt;tok></link></code> has, in this example,
               three attributes. <code><link linkend="attribute-src">@src</link></code> takes a
               nickname (an <code><link linkend="attribute-id">@id</link></code> reference) that
               points to one of the two transcriptions; we have used <code>ring1881</code> and
                  <code>ring1987</code> for our two texts, but we could have just as easily used
               anything else such as <code>a</code> and <code>b</code>, or <code>uk</code> and
                  <code>us</code>. <code><link linkend="attribute-ref">@ref</link></code> has a
               value that points to a specific <code><link linkend="element-div"
                  >&lt;div></link></code> in the source TAN-T transcription; and <code><link
                     linkend="attribute-pos">@pos</link></code> or <code><link
                     linkend="attribute-val">@val</link></code> specify which token is intended,
               either by word number (<code><link linkend="attribute-pos">@pos</link></code>) or
               text of the actual word (<code><link linkend="attribute-val">@val</link></code>).
               Either technique is fine, and <code><link linkend="attribute-pos">@pos</link></code>
               and <code><link linkend="attribute-val">@val</link></code> can be mixed, as in the
               example. It is generally a good idea to use <code><link linkend="attribute-val"
                     >@val</link></code>, because if you fix a typo, changing the number of tokens
               in the underlying transcription, <code><link linkend="attribute-val"
                  >@val</link></code> might not be affected; with <code><link
                     linkend="attribute-pos">@pos</link></code> alone, you can't. You may also
               notice that the comma and hyphen can be used in <code><link linkend="attribute-pos"
                     >@pos</link></code> to point to multiple words within the same <code><link
                     linkend="element-div">&lt;div></link></code>, and that <code>last</code> and
                  <code>last-X</code> (where <code>X</code> is a digit) can be used to point to a
               token by position counting from the end of a <code><link linkend="element-div"
                     >&lt;div></link></code>.</para>
            <para>Each <code><link linkend="element-align">&lt;align></link></code> can establish
               one-to-one, one-to-many, many-to-one, or many-to-many relationships between tokens
               from the two texts. A token may feature in multiple <code><link
                     linkend="element-align">&lt;align></link></code> elements. And if an
                     <code><link linkend="element-align">&lt;align></link></code> has <code><link
                     linkend="element-tok">&lt;tok></link></code> elements belonging to only one
               source, such as in the fourth-to-last <code><link linkend="element-align"
                     >&lt;align></link></code> above, we have what is called, in these guidelines, a
                  <emphasis>one-sided alignment</emphasis>. This one-sided alignment indicates that
               the second word of line four of the 1881 version is excluded from the act that we
               have called <code>adaptation</code>. If this were a translation, it would be as if we
               were saying that this word was excluded from the translation. (A one-sided alignment
               containing tokens only of the later source might point to words that the translator
               added, i.e., what in translation studies is called
               <emphasis>explicitation</emphasis>.) </para>
            <para>If in our TAN-A-tok file we say nothing about a particular word in one of the
               sources, that silence should not be interpreted to mean that it has no counterpart in
               the other source. As creators of this file, we make no claim to providing an
               exhaustive account, and we are under no obligation to indicate every word-for-word
               correspondence. If we fail to mention certain words, all that can be implied is that
               we opted not to say anything about them.</para>
            <para>We could have aligned the two texts in different ways. Perhaps further study will
               reveal that we were in error to associate the second "ring" with "round" in line 1.
               We can make corrections, even after publication, and notify other users of our data
               about the change. There are also ways to express doubt or alterative opinions, and to
               credit (or blame) the person making the assertion. We can even correlate fragments of
               tokens (letters, prefixes, infixes, or suffixes). All these more advanced uses are
               discussed at <xref xlink:href="#tan-a-tok"/>.</para>
         </section>
         <section>
            <title>TAN metadata (<code><link linkend="element-head">&lt;head></link></code>)</title>
            <para>At this point, we have finished four TAN files: two transcriptions (TAN-T), one
               macro-alignment file (TAN-A), and one micro-alignment file (TAN-A-tok). We've avoided
               discussing the <code><link linkend="element-head">&lt;head></link></code> in each of
               them until now. Before getting into details, some important concepts need to be
               covered first.</para>
            <para>Unlike <code><link linkend="element-body">&lt;body></link></code>, which carries
               the raw data, <code><link linkend="element-head">&lt;head></link></code> contains
               what is oftentimes called <emphasis role="bold">metadata</emphasis>. That is,
                     <code><link linkend="element-head">&lt;head></link></code> contains data about
               the data that is in <code><link linkend="element-body">&lt;body></link></code>.
               Because the TAN format is intended primarily to serve scholars, and because the
               format is heavily regulated (that is, there are numerous validation rules that
               supplement the standard XML ones), the metadata requirements are stricter than they
               are for Word documents, HTML, TEI, or other formats you might know better. They are
               stricter even than TEI rules. (But you'll be offered help that the TEI rules do not.)
               Scholars who find our file expect to know some things about it before they can
               responsibly use it. For example, what are the sources we have used? Who produced the
               data? When? What changes or adjustments have been made? What licenses govern the use
               of the data? The questions are not difficult to answer, but they require thought,
               care, and some time to answer.</para>
            <para>Some metadata questions apply only to one TAN format. For example, in a TAN-A-tok
               file, we ask what relationship holds between the two sources. But that question makes
               no sense for a TAN-T file, which is merely a transcription. Some questions apply
               universally across all TAN files, no matter what kind of data. TAN has been designed
               so that each <code><link linkend="element-head">&lt;head></link></code>, no matter
               the format, handles metadata consistently. This reduces potential confusion, and
               helps other people using our data to find the information they want. More important,
               what we write in one file can be referenced by another, without duplication, and so
               will reduce the chance of errors. There is an old programmer's adage, <emphasis>Don't
                  repeat yourself</emphasis>, oftentimes abbreviated <emphasis role="bold"
                  >DRY</emphasis>. The TAN format encourages you to be as DRY as possible.<footnote>
                  <para>The opposite of DRY is WET: <emphasis>write everything twice</emphasis> or
                        <emphasis>we enjoy typing</emphasis>.</para>
               </footnote></para>
            <para>Another TAN principle is that each <code><link linkend="element-head"
                     >&lt;head></link></code> should focus exclusively upon scope of the data in
                     <code><link linkend="element-body">&lt;body></link></code>, and not on other
               things. For example, in a TAN-T file, we are concerned only about the transcription,
               so our metadata too should be concerned only with the transcription. We should
               indicate its source, but because our file is not about the source itself, so we don't
               need to describe it further. We are not library catalogers, nor should we be. A TAN-T
               file is for transcribing, not for curating bibliographical data. Our obligation is
               merely to point a reader to complete and authoritative information, found
               elsewhere.</para>
            <para>TAN was also designed under the principle that all metadata should be useful to
               both humans and computers. For our example above, we must describe the work we have
               chosen (<emphasis role="italic">Ring around the Rosie</emphasis>) in a way that is
               comprehensible not just to the reader but to the computer. Computers have a difficult
               time with ordinary human names, so we have to approach the task in a special
               way.</para>
            <para>Take for example the 1881 book we have used for our first transcription. For the
               human reader we can write something like "Kate Greenaway, <emphasis>Mother
                  Goose</emphasis>, New York, G. Routledge and sons [1881]". But this human-readable
               string is too complex and syntactically opaque for computers and algorithms. A more
               computer-friendly identifier would be international standard book numbers (ISBNs),
               which distinguish the 1984 version of <emphasis>Mother Goose</emphasis> illustrated
               by Kayoko Okumura from the one of the same year illustrated by William Joyce. The
               ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be
               converted into a machine-actionable string called <emphasis role="bold">universal
                  resource names</emphasis> (<emphasis role="bold">URNs</emphasis>). The tag URN we
               made earlier is just one of many types of URNs. In this case we can cretae and ISBN
               URN as follows: <code>urn:isbn:0-671493159</code> and
                  <code>urn:isbn:0-394865340</code>. (Our 1881 version was published before the ISBN
               program was introduced. We will see below another way to give it a different kind of
               URN.)</para>
            <para>There are different URNs for different things: journals (via ISSNs,
                  <code>urn:issn:...</code>), articles (DOIs, <code>urn:doi:...</code>), movies
               (ISANs, <code>urn:isan:...</code>), and so forth, which means that anyone can use
               them to refer unambiguously to a particular kind of thing. URN naming schemes must be
               registered with the Internet Assigned Numbers Authority (IANA) to ensure permanent,
               persistent, unique names for various types of things. It is okay for something to be
               assigned more than one URN, but never acceptable for one URN to be applied to more
               than one thing. (See <link
                  xlink:href="https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml"
                  >IANA's registry</link> and <xref xlink:href="#variable-official-urn-namespaces"/>
               for a complete list of official URN schemes.)</para>
            <para>All URNs are simply names. They don't tell you where an object is. To provide a
               unique <emphasis role="italic">location</emphasis> we have <emphasis role="bold"
                  >universal resource locators</emphasis> (<emphasis role="bold">URLs</emphasis>),
               e.g., <code>http://academia.edu</code>. Like URNs, URLs are also centrally regulated,
               with individuals or organizations buying the rights to domain names from a central
               registry (usually through a third-party vendor).</para>
            <para>Both URNs and URLs can be thought of as the same type of thing, namely, a
                  <emphasis role="bold">universal resource identifier</emphasis> (<emphasis
                  role="bold">URI</emphasis>), sometimes called an <emphasis role="bold"
                  >international resource identifier</emphasis> (<emphasis role="bold"
                  >IRI</emphasis>). An IRI is a type of URN that allows any alphabet in Unicode, not
               just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and
               URLs. These four acronyms are easily confused and conflated, even by veterans. URIs
               and IRIs are basically the same thing, and they encompass URNs and URLs, a
               relationship and function that can be remembered by the last letter in each acronym:
                  UR<emphasis role="bold">I</emphasis>s/IR<emphasis role="bold">I</emphasis>s
                  <emphasis role="bold">I</emphasis>ncorporate both <emphasis role="bold"
                  >L</emphasis>ocators (UR<emphasis role="bold">L</emphasis>) and <emphasis
                  role="bold">N</emphasis>ames (UR<emphasis role="bold">N</emphasis>).</para>
            <para>If those acronyms are confusing, don't worry. For our purposes, they are pretty
               much all the same, and from this point onward we'll stick with the term IRI (unless
               we really mean a location to find a file, which we'll call a URL).</para>
            <para>IRIs are essential to a system frequently called the <emphasis role="bold"
                  >semantic web</emphasis> or <emphasis role="bold">linked (open) data</emphasis>,
               which uses them as the basis for a simple universal data model. The semantic web
               allows one to make assertions that computers can "understand." If people, working
               independently, happen to use the same IRIs to describe the same things, then
               computers can be programmed to make associations between disparate, heterogenous
               datasets. For example, if one scholar claims through IRIs that X is the mother of Y,
               and another claims in a different dataset that Y is the mother of Z, a computer can
               infer that X is the grandmother of Z, without the two scholars being aware of each
               other's work. The computer can also check for contradictions (e.g., someone claiming
               that Z is the mother of Y). When many scholars begin to use IRIs in their data, the
               result is a network that allows us or anyone else to discover connections across
               disciplines and projects, and to make discoveries that transcend any single
               project.</para>
            <para>TAN has been designed to be semantic-web friendly, and so requires in its
                     <code><link linkend="element-head">&lt;head></link></code> almost all data to
               be not just human-readable but also computer-readable, normally as an IRI. </para>
            <para>Our first task, then, in writing the <code><link linkend="element-head"
                     >&lt;head></link></code> sections of our four TAN files is to look for IRI
               vocabulary that will be familiar to those most likely to use our files. In trying to
               find suitable IRIs, we will find that the persons, things, and concepts we want to
               describe will range from the highly familiar to the unfamiliar.</para>
            <para><emphasis role="italic">Highly familiar</emphasis>: The two books that provide the
               basis of our transcription are catalogued and generally well known. A number of
               services provided by librarians provide controlled IRI vocabularies that can be used
               by anyone to unambiguously identify a particular version of a book. <link
                  xlink:href="http://www.worldcat.org">WorldCat</link> (run by OCLC) and the <link
                  xlink:href="http://catalog.loc.gov">Library of Congress</link> are good examples.
               In our case, we have found Library of Congress IRIs for both editions of
                  <emphasis>Mother Goose</emphasis>: <code>http://lccn.loc.gov/12032709</code> and
                  <code>http://lccn.loc.gov/87042504</code>. Observe that these two IRIs are also,
               perhaps confusingly, URLs (locations). If we paste these strings into our Web
               browser, we retrieve a record that describes the book. This locator does not lead us
               to the book itself, only to information <emphasis role="italic">about</emphasis> the
               book. Nevertheless, the Library of Congress has decided to make this URL also a name
               for the book, which means that it does double duty, both as a location for a Web page
               with information, and as a name for a book. This practice that can easily confuse
               anyone new to the semantic web, because such URLs name in reality two types of
               things: an entity and a web resource to learn more about that entity. The idea is
               that hundreds of years from now, when the web page no longer exists, the name will
               still be valid. </para>
            <para>In the TAN system, you can apply as many IRIs to a concept as you like. In fact,
               it is a good practice to find and add as many IRIs as you think worthwhile, just in
               case someone can't figure out what you're trying to identify. Just make sure that any
               IRI you copy unambiguously points to the thing you have in mind.</para>
            <para>We now have IRIs for the sources. Let's now find an IRI for the work, <emphasis
                  role="italic">Ring around the Rosie</emphasis>. The work is widely known, and even
               has a <link xlink:href="http://en.wikipedia.org/wiki/Ring_a_Ring_o%27_Roses"
                  >Wikipedia entry</link>. That Wikipedia entry is a benefit. The Universities of
               Leipzig and Mannheim and Openlink Software have collaborated on a project called
                  <link xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>, which provides a
               unique URN for every Wikipedia entry in the major languages, and these can be used
               for naming. The DBPedia IRI in this case is
                  <code>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</code>. Once again, this
               is both a name and a locator. It names a specific, intangible, abstract work, namely,
               a nursery rhyme that we've called <emphasis>Ring around the Rosie</emphasis>, no
               matter what specific version. But if you put that IRI into your browser, you will get
               back more information about that named object.</para>
            <para><emphasis role="italic">Familiar to specialists</emphasis>: We will need to have
               IRIs for some of the people who edited the file. Here we're not interested in the
               authors of the books we transcribed. We are interested in identifying the people who
               helped make the TAN files themselves. Most people who write and edit TAN files will
               not be well-known, public figures. If they are, and if they are famous enough to have
               a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the contributors
               are also published authors, there is a good chance that they are listed in the
               databases of either <link xlink:href="http://viaf.org">VIAF</link> or <link
                  xlink:href="http://isni.org">ISNI</link>, both of which publish unique IRIs for
               authors, editors, and other persons identified in online catalogues curated by
               libraries around the world. </para>
            <para>Most contributors to TAN files, however, will not be listed in these databases. In
               those cases, we can name these participants with an IRI that we "own." We have
               already done something like this by assigning tag URNs to our four TAN files (the
               value of <code><link linkend="attribute-id">@id</link></code> in the root element).
               Our editors can do the same thing. If a student Robin Smith has been helping with
               proofreading, Robin can take an email address (even one that doesn't work any more)
               and a date when the email address was used and construct a tag URN such as
                  <code>tag:smith.robin@example.com,2012:self</code>. This has a slight drawback in
               that we cannot type this string into our browser to find out more about this
               particular Robin, but it at least allows us to assign a name that will not be
               confused as another Robin Smith, for example the one identified by ISNI as
                  <code>http://isni.org/isni/0000000043306406</code>. (If we want to go a step
               further, Robin could mint a URN from a domain name that she owns, and set up a linked
               data service that offers more information, human- and computer-readable. But this is
               not required, and it can be a hassle to set up and maintain.)</para>
            <para>Let's take a more difficult challenge for locating an IRI, that of describing the
                     <code><link linkend="attribute-bitext-relation">@bitext-relation</link></code>
               in our TAN-A-tok file. <code><link linkend="attribute-bitext-relation"
                     >@bitext-relation</link></code> draws from the discipline of stemmatology,
               which studies how manuscripts were copied, one to another, and tries to place these
               manuscripts in a chain of transmission, a kind of historical stemma (tree). We have
               to find an IRI that describes the relationship that we claim holds between two
               text-bearing objects. Making that clear is important, because our perspective about
               the relationship between the two books affects the decisions we make when we align
               words, and other scholars using our files will want to know what assumptions we had
               when we aligned the two texts. </para>
            <para>For the sake of illustration we posit that the version published in the 1987
                  <emphasis>Mother Goose</emphasis> is a direct but not immediate descendant of the
               1881 version. Because no suitable IRI vocabulary yet exists for the relationships
               between texts, TAN itself has coined an IRI that can be used by anyone wishing to
               declare that, given two ordered sources, the second descends from the first through
               an unknown number of intermediaries:
                  <code>tag:textalign.net,2015:bitext-relation:a/x+/b</code>. (The arbitrary symbol
                  <code>/</code> signifies a step from one version to the next, and the
                  <code>x+</code> represents one or more versions as intermediate steps.) We'll use
               that one for now.</para>
            <para>We face a similar issue when thinking about text reuse, <code><link
                     linkend="attribute-reuse-type">@reuse-type</link></code>. Here we are concerned
               with creative activities such as translation, paraphrase, adaptation, and so forth.
               We generally consider the 1987 version to be an adaptation of the 1881 version. And
               there are no stable, well-published IRI vocabularies for text reuse. So we adopt an
               IRI that is part of TAN's standard vocabulary,
                  <code>tag:textalign.net,2015:reuse-type:adaptation:general</code>.</para>
            <para>In the previous two cases, we could have come up with our own vocabulary. But the
               idea behind the semantic web is to use common, familiar vocabulary whenever possible.
               That's the same principle that led us to structure and label the poem in four
               consecutively numbered lines. We adopt conventions we expect others will likely
               follow. The built-in TAN vocabulary simply gives us a convenient lingua franca for
               describing some important, abstract concepts. For other examples of IRIs coined by
               TAN, see <xref linkend="vocabularies-master-list"/>.</para>
            <para><emphasis role="italic">Generally unfamiliar</emphasis>: Some things or concepts
               will be unknown to very few people, perhaps even us. If we plan to refer to that
               thing or concept often, it is preferable to coin a tag URN, as described above. But
               in some cases, we might find that a tag URN we minted for some concept or thing was,
               in hindsight, misleading or poorly constructed, because we had only superficially
               thought about the category. </para>
            <para>One other possibility is to assign a randomly generated IRI called a universally
               unique identifier (UUID), e.g.,
                  <code>urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0</code>. UUID URNs are very
               useful. The likelihood that a randomly generated UUID will be identical to any
               existing UUID is astronomically improbable, making them reliably unique names for
               anything (barring someone copying and reusing that UUID URN to name some other object
               or concept). Numerous free UUID generators can be found online.</para>
            <para>To humans, a UUID on its own is meaningless, unmemorable, and rather ugly. But it
               is a start. We always have the option, later, of supplementing it with other IRIs.
               It's perfectly fine to assign multiple IRIs to one object or concept. But the reverse
               is never true. One should never use one IRI to identify more than one object or concept.<footnote>
                  <para>There is an exception when the IRI names a single class that has multiple
                     objects or concepts. But even then, it should name only one class, not two or
                     more of them.</para>
               </footnote></para>
         </section>
         <section>
            <title>Creating TAN metadata (<code><link linkend="element-head"
               >&lt;head></link></code>)</title>
            <para>Now that we have explored various IRI vocabularies for concepts related to our
               files concerning <emphasis>Ring-a-ring-a-roses</emphasis>, we can now complete the
               metadata in our four TAN files. Let us start with the TAN-T file of the 1881
               version:<programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
        <emphasis role="bold">&lt;name>TAN transcription of Ring a Ring o' Roses&lt;/name>
        &lt;master-location 
            href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
        &lt;license licensor="park">
            &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
            &lt;name>Attribution 4.0 International&lt;/name>
        &lt;/license>
        &lt;work>
            &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
            &lt;name>"Ring a Ring o' Roses" or "Ring Around the Rosie"&lt;/name>
        &lt;/work>
        &lt;source>
            &lt;IRI>http://lccn.loc.gov/12032709&lt;/IRI>
            &lt;name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]&lt;/name>
        &lt;/source>
        &lt;vocabulary-key>
            &lt;person xml:id="park">
                &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
                &lt;name>Jenny Park&lt;/name>
            &lt;/person>
            &lt;div-type xml:id="line">
                &lt;IRI>http://dbpedia.org/resource/Line_(poetry)&lt;/IRI>
                &lt;name>line of poetry&lt;/name>
            &lt;/div-type>
            &lt;role xml:id="creator">
                &lt;IRI>http://schema.org/creator&lt;/IRI>
                &lt;name xml:lang="eng">creator&lt;/name>
            &lt;/role>
        &lt;/vocabulary-key>
        &lt;file-resp who="park"/>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2014-08-13" who="park">Started file&lt;/change>
        &lt;to-do/></emphasis>
    &lt;/head>
    . . . . . . .
&lt;/TAN-T></programlisting></para>
            <para><code><link linkend="element-name">&lt;name></link></code>, the human readable
               counterpart to the <code><link linkend="attribute-id">@id</link></code> that is
               inside the root element, can be anything. And we can supply more than one <code><link
                     linkend="element-name">&lt;name></link></code>, in case we wish to provide
               alternative names of the file, or translations of them.</para>
            <para>One or more <code><link linkend="element-master-location"
                     >&lt;master-location></link></code>s provide URLs where master versions of the
               file are kept (and maintained). We provide this as a courtesy to others who might be
               using our data. Anyone who validates their local copy of the file will be warned if
               it does not match the master version, and they will be told of the most recent
               changes. With a couple of keystrokes, they can update their local copy to match the
               master. This one-way communication system lets us silently and conveniently notify
               other users of changes. We do not have to keep track of who is using our file, and
               users do not have to pester us with questions about what changed when.</para>
            <para><code><link linkend="element-master-location">&lt;master-location></link></code>
               is mandatory only if we are finished with our to-do list, which is specified at
                     <code><link xlink:href="#xml">&lt;to-do></link></code>. If that element is
               empty, then we imply that we do not know anything further that should be done to the
               file. Conversely, any elements in <code><link xlink:href="#xml"
                  >&lt;to-do></link></code> specify what remains to be done, and details will be
               returned to other users. That way you can release data that is useful but not
               completely perfect, and let users know about its deficiencies. This approach is ideal
               for formats such as TAN-A-tok, where you might have released only some of the data,
               and you are working on the rest.</para>
            <para>One day the link in <code><link linkend="element-master-location"
                     >&lt;master-location></link></code> will be dead. But perhaps a copy of our
               file will be in circulation elsewhere. The document <code><link
                     linkend="attribute-id">@id</link></code> in the root element provides a way to
               identify files, independent of links, and perhaps locate them in unexpected
               places.</para>
            <para><code><link linkend="element-license">&lt;license></link></code> specifies the
               license under which we are releasing our data. This element has nothing to do with
               the copyright of the source we have used (although, having been published in 1881,
               the book is clearly in the public domain). That is, we are specifying what rights are
               attached to the data, not its source, i.e., if we have placed additional strictures
               on the content in <code><link linkend="element-body">&lt;body></link></code>. In this
               example, we have released the data under a creative commons license. The child
               element <code><link linkend="element-IRI">&lt;IRI></link></code> specifies a Creative
               Commons IRI, and <code><link linkend="element-name">&lt;name></link></code> is the
               human-readable form.</para>
            <para><code><link xlink:href="#attribute-licensor">@licensor</link></code> specifies who
               has granted the license, in this case our fictive Jenny Park (see below).</para>
            <para>The conjunction of <code><link linkend="element-IRI">&lt;IRI></link></code> and
                     <code><link linkend="element-name">&lt;name></link></code>, the <emphasis
                  role="bold">IRI + name pattern</emphasis>, recurs throughout TAN files. They are
               used provide identifiers for <emphasis role="bold">vocabulary items</emphasis>. In an
               element that takes the IRI + name pattern, we may include as many children
                     <code><link linkend="element-IRI">&lt;IRI></link></code>s or <code><link
                     linkend="element-name">&lt;name></link></code>s as we like. But if we do so, we
               are stating that they are synonymous, i.e., that they all name the same thing. (Once
               again, an IRI is unique, so it should never be used to identify more than one
               thing.)</para>
            <para><code><link linkend="element-work">&lt;work></link></code> uses the IRI + name
               pattern to name the work we have chosen to transcribe. <code><link
                     linkend="element-source">&lt;source></link></code> points, through its IRI +
               name pattern, to a computer- and human-readable description of the book we have
               chosen. </para>
            <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
               contains vocabulary that we are using in our file. Inside, we can place more
               vocabulary items, and attach locally unique ids. For example, an IRI + name pattern
               is used for <code><link linkend="element-person">&lt;person></link></code>, which
               identifies through a tag URN Jenny Park. The value of <code><link
                     linkend="attribute-xmlid">@xml:id</link></code> allows us to use
                  <code>park</code> any time we want to mention Jenny. In fact, we already have, at
                     <code><link xlink:href="#attribute-licensor">@licensor</link></code>. Any
               mention of <code>park</code> will point to the appropriate item in <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>.</para>
            <para>There are a few other parts of <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code>. <code><link linkend="element-div-type"
                     >&lt;div-type></link></code> specifies an IRI + name pattern for line
               divisions, and the value of <code><link linkend="attribute-xmlid"
                  >@xml:id</link></code> means that we can use <code>line</code> any time we want to
               invoke the concept. Similarly, we have a <code><link linkend="element-role"
                     >&lt;role></link></code>. The <code><link linkend="element-IRI"
                  >&lt;IRI></link></code> value of <code><link linkend="element-role"
                     >&lt;role></link></code> comes from the vocabulary of <link
                  xlink:href="http://schema.org">schema.org</link>, which is maintained by Bing,
               Google, and Yahoo! in conjunction with the W3C (the nonprofit organization dedicated
               to universal Internet standards), but we could have used Dublin Core or some other
               IRI vocabulary describing behaviors, responsibilities, and roles.</para>
            <para>After the <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code>, we get into parts of the file that specify who
               did what, when. First is a <code><link xlink:href="#element-file-resp"
                     >&lt;file-resp></link></code>, whose value of <code><link
                     linkend="attribute-who">@who</link></code>, <code>park</code>, indicates that
               Jenny Park is the one primarily responsible for the file. <code><link
                     linkend="element-resp">&lt;resp></link></code> specifies further who was
               responsible for doing what.<footnote>
                  <para>If you decide to modify someone else's TAN file, you should credit / blame
                     yourself for the changes. Your first point of order should be to add a
                           <code><link linkend="element-person">&lt;person></link></code> to the
                           <code><link linkend="element-vocabulary-key"
                        >&lt;vocabulary-key></link></code>, identifying yourself. You can then
                     either add a <code><link linkend="element-change">&lt;change></link></code>
                     (see below) or a <code><link linkend="element-resp">&lt;resp></link></code>
                     (you might need to specify a <code><link linkend="element-role"
                           >&lt;role></link></code> in the <code><link
                           linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>). You
                     should not change the document's <code><link linkend="attribute-id"
                        >@id</link></code>, unless your changes are so significant that it becomes
                     altogether a new document, <emphasis>your</emphasis> document. TAN does not try
                     to broker the age-old problem of determining the point at which a thing becomes
                     something altogether different (e.g., the <link
                        xlink:href="https://en.wikipedia.org/wiki/Ship_of_Theseus">Ship of Theseus
                        problem</link>). Use your best intuition.</para>
               </footnote></para>
            <para>Remember that <code><link linkend="element-head">&lt;head></link></code> is
               focused on the data, not its sources, so the claim that Jenny Park is the creator
               pertains only to the data. No inference should be made about who was responsible for
               the printed source. If someone wants to know anything about the book, they should
               pursue the IRI identifier we have provided under <code><link linkend="element-source"
                     >&lt;source></link></code>.</para>
            <para><code><link linkend="element-change">&lt;change></link></code> has attributes
                     <code><link linkend="attribute-when">@when</link></code> and <code><link
                     linkend="attribute-who">@who</link></code> to specify who made the change and
               when. The value of <code><link linkend="attribute-when">@when</link></code> is always
               a date or a date + time, formatted according to the ISO standard syntax:
                  <code>[YYYY]-[MM]-[DD]</code> or <code>[YYYY]-[MM]-[DD]T[HH]:[MM]:[SS]</code>.
                     <code><link linkend="attribute-who">@who</link></code> always carries an IDref
               that points to a person or organization. <code><link linkend="element-change"
                     >&lt;change></link></code> does not take the IRI + name pattern, or even any
               children at all. It takes simply a plain-text description of what changed.</para>
            <para>So now we have finished one transcription file's metadata. You may have found it
               to represent a lot of typing: many names, IRIs, and so forth. Is there any way to
               shorten that load? Yes, there is. TAN is a <emphasis>vocabulary-based</emphasis>
               format. That is, there are standard vocabulary items that come with the TAN format,
               and you can design your own vocabulary, so that you can shorten the work involved,
               and to adhere to the best DRY principles.</para>
            <para>Our second example will look similar to the first one, but notice some
               shortcuts:<programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring02">
    &lt;head>
      &lt;name>TAN transcription of <emphasis role="bold">Ring around the Rosie</emphasis>&lt;/name>
      &lt;master-location>ring-o-roses.eng.<emphasis role="bold">1987.xml</emphasis>&lt;/master-location>
      &lt;license <emphasis role="bold">which="by 4.0"</emphasis> licensor="park"/>
      &lt;work>
         &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
         &lt;name>Ring around the Rosie&lt;/name>
      &lt;/work>
      &lt;source>
         &lt;IRI><emphasis role="bold">http://lccn.loc.gov/87042504</emphasis>&lt;/IRI>
         &lt;name><emphasis role="bold">Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.</emphasis>&lt;/name>
      &lt;/source>
      <emphasis role="bold">&lt;adjustments>
         &lt;normalization which="no hyphens"/>
      &lt;/adjustments></emphasis>
      &lt;vocabulary-key>
         <emphasis role="bold">&lt;div-type xml:id="l" which="line (verse)"/></emphasis>
         &lt;person xml:id="park" roles="creator">
            &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
            &lt;name xml:lang="eng">Jenny Park&lt;/name>
         &lt;/person>
      &lt;/vocabulary-key>
      &lt;resp roles="creator" who="park"/>
      &lt;change when="2014-10-24" who="park">Started file&lt;/change>
      <emphasis role="bold">&lt;comment when="2014-10-24" who="park">See p. 39 of source.&lt;/comment></emphasis>
      &lt;to-do/>
   &lt;/head>
   . . . . . .
&lt;/TAN-T></programlisting></para>
            <para>In this example, <code><link linkend="element-name">&lt;name></link></code>,
                     <code><link linkend="element-master-location"
                  >&lt;master-location></link></code>, and <code><link linkend="element-source"
                     >&lt;source></link></code> have been modified to describe this file. Note, we
               haven't had to change <code><link linkend="element-work"
               >&lt;work></link></code>.</para>
            <para><code><link linkend="element-license">&lt;license></link></code> looks different,
               but in reality it is identical to our previous example, and that is because the IRI +
               name pattern has been replaced with <link linkend="attribute-which"
                     ><code>@which</code></link>. You may replace any IRI + name pattern with <link
                  linkend="attribute-which"><code>@which</code></link>; its value must match a
                     <code><link linkend="element-name">&lt;name></link></code> in customized or
               standard vocabulary (a TAN-voc file). In this case, <code>"by 4.0"</code> points to
               TAN's standard vocabulary for licenses (see <xref xlink:href="#vocabularies-licenses"
               />). Here is what that looks like under the hood:</para>
            <para>
               <programlisting>&lt;<emphasis role="bold">TAN-voc</emphasis> xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
   id="tag:textalign.net,2015:<emphasis role="bold">tan-voc:licenses</emphasis>">
    . . . . . . .
   &lt;body <emphasis role="bold">affects-element="license"</emphasis>>
      <emphasis role="bold">&lt;item>
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
         &lt;IRI>tag:textalign.net,2015:license:by/4.0/&lt;/IRI>
         &lt;name>by 4.0&lt;/name>
         &lt;desc>attribution 4.0 international&lt;/desc>
      &lt;/item></emphasis>
    . . . . . . .
   &lt;/body>
&lt;/TAN-voc></programlisting>
            </para>
            <para>Because the validation rules for TAN-voc files require every <code><link
                     linkend="element-name">&lt;name></link></code> to be unique, that element can
               be treated as a unique identifier, similar to <code><link linkend="attribute-xmlid"
                     >@xml:id</link></code>. We could have repeated the <code><link
                     linkend="element-license">&lt;license></link></code> from the previous TAN-T
               file. But the <link linkend="attribute-which"><code>@which</code></link> method is
               much quicker, cleaner, and DRY.</para>
            <para>Before <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code> comes a new element, <code><link
                     linkend="element-adjustments">&lt;adjustments></link></code>, which contains a
                     <code><link linkend="element-normalization">&lt;normalization></link></code>
               statement whose <link linkend="attribute-which"><code>@which</code></link> says
                  <code>no hyphens</code>. That too points to a standard TAN vocabulary for
               normalizations: an IRI + name pattern for eliminating discretionary hyphens (see
                  <xref xlink:href="#vocabularies-normalizations"/>). Here's what that vocabulary
               item looks like (invisible to you, but you can look at it any time you like in the
                  <code>vocabularies</code> subdirectory of the TAN files):</para>
            <para>
               <programlisting>&lt;<emphasis role="bold">TAN-voc</emphasis> xmlns="tag:textalign.net,2015:ns" TAN-version="2021" id="tag:textalign.net,2015:<emphasis role="bold">tan-voc:normalizations</emphasis>">
    . . . . . . .
   &lt;body <emphasis role="bold">affects-element="normalization"</emphasis>>
      <emphasis role="bold">&lt;item>
         &lt;IRI>tag:textalign.net,2015:normalization:hyphens-discretionary-removed&lt;/IRI>
         &lt;name>no hyphens&lt;/name>
         &lt;desc>Discretionary word-break line-end hyphens have been deleted.&lt;/desc>
      &lt;/item></emphasis>
    . . . . . . .
   &lt;/body>
&lt;/TAN-voc></programlisting>
            </para>
            <para>As you might have inferred, the element <code><link
                     linkend="element-normalization">&lt;normalization></link></code> specifies how
               we have changed the data, namely, that we have opted to remove word-break line-end
               hyphenation. In other transcriptions we could use <code><link
                     linkend="element-normalization">&lt;normalization></link></code> to declare
               other kinds of changes we felt compelled to make, such as removing editorial comments
               or footnote signals. A healthy list of <code><link linkend="element-normalization"
                     >&lt;normalization></link></code>s is a courtesy to users of our data, some of
               whom might passionately care about keeping or removing line-end hyphenation. </para>
            <para>Back to our example. <code><link linkend="element-div-type"
                  >&lt;div-type></link></code> has a new value for <code><link
                     linkend="attribute-xmlid">@xml:id</link></code>, the letter <code>l</code>, and
               in it too the IRI + name pattern has been replaced by <link linkend="attribute-which"
                     ><code>@which</code></link>, whose value, <code>line (poetry)</code>, is a
               standard vocabulary item (see <xref xlink:href="#vocabularies-div-types"/>.<footnote>
                  <para>A line of poetry is to be contrasted with a physical line on the page. Some
                     lines of poetry take up two or more physical lines. For the physical line you
                     would specify: <code>which="line (physical)"</code>.</para>
               </footnote></para>
            <para>There is a also new <code><link linkend="element-comment"
                  >&lt;comment></link></code> element, which is built much the same as <code><link
                     linkend="element-change">&lt;change></link></code>. (A <code><link
                     linkend="element-change">&lt;change></link></code>, after all, is just a
               comment about what has been changed.)</para>
            <para>That seems to be all there is. But if you've been attentive, you will have noticed
               that <code><link linkend="element-role">&lt;role></link></code> from our first TAN-T
               file (inside <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code>) is missing. That's because we don't need it,
               based on the same principle that lets us resolve <link linkend="attribute-which"
                     ><code>@which</code></link>. A vocabulary <code><link linkend="element-name"
                     >&lt;name></link></code> can be invoked not only in <link
                  linkend="attribute-which"><code>@which</code></link>, but in any attribute that
               points to values of <code><link linkend="attribute-xmlid">@xml:id</link></code>, in
               this case <code><link linkend="attribute-roles">@roles</link></code>. There is
               already a standard TAN vocabulary item with the <code><link linkend="element-name"
                     >&lt;name></link></code>
               <code>creator</code>, so we can use it directly without having to declare an
               intermediate vocabulary item with an <code><link linkend="attribute-xmlid"
                     >@xml:id</link></code>. If we had defined something else in <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> with a
                     <code><link linkend="attribute-xmlid">@xml:id</link></code> of
                  <code>creator</code>, that item would take precedence and override the built-in
               TAN vocabulary. But we haven't, so the standard TAN vocabularies are the
               default.</para>
         </section>
         <section>
            <title>Building TAN vocabulary</title>
            <para>The first TAN-T transcription had a longer <code><link linkend="element-head"
                     >&lt;head></link></code> than the second one did, and that is because for the
               former we used an explicit method, that of specifying every IRI and name, and then in
               the latter adopted shortcuts that took advantage of TAN vocabulary. TAN vocabularies
               are meant not merely to be a convenience; they are intended to avoid problems that
               beset projects that create many files with repeated data patterns. When (not if) you
               make changes to one file, you shouldn't have to remember all the other places where
               you might need to make the same changes. Move repeating data patterns to one master
               place, and let the other files point to that pattern. Then, when we need to make
               changes, we do so only once, at the master location. Stay DRY.</para>
            <para>The previous examples drew from standard TAN vocabulary, which is written in one
               of the other TAN formats, TAN-voc. There is a whole collection of standard TAN-voc
               files in the project subdirectory called <code>vocabularies</code>. We can write our
               own TAN-voc files, to collect the vocabulary items that we will use repeatedly from
               one file to the next. For example:</para>
            <para>
               <programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="../../schemas/<emphasis role="bold">TAN-voc.rnc</emphasis>" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="../../schemas/TAN.sch" type="application/xml" 
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">TAN-voc:standard</emphasis>">
    &lt;head>
        <emphasis role="bold">&lt;name>Keywords for TAN files edited by Jenny Park&lt;/name></emphasis>
        &lt;license licensor="park" which="by 4.0"/>
        &lt;vocabulary-key>
            <emphasis role="bold">&lt;person which="Jenny Park" xml:id="park"/></emphasis>
        &lt;/vocabulary-key>
        &lt;file-resp who="park"/>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2019-10-08" who="park">Started file&lt;/change>
        &lt;to-do>
            <emphasis role="bold">&lt;comment when="2020-01-04" who="park">Need to check files for new vocabulary items.&lt;/comment></emphasis>
        &lt;/to-do>
    &lt;/head>
    &lt;body>
        <emphasis role="bold">&lt;group affects-element="person">
            &lt;item>
                &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
                &lt;name xml:lang="eng">Jenny Park&lt;/name>
            &lt;/item>
        &lt;/group></emphasis>
        <emphasis role="bold">&lt;item affects-element="work">
            &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
            &lt;name>Ring a Ring o' Roses&lt;/name>
            &lt;name>Ring Around the Rosie&lt;/name>
        &lt;/item></emphasis>
    &lt;/body>
&lt;/TAN-voc></programlisting>
            </para>
            <para>In this example case, updates have been made to <code><link linkend="attribute-id"
                     >@id</link></code> and <code><link linkend="element-name"
                  >&lt;name></link></code>, and a <code><link linkend="element-comment"
                     >&lt;comment></link></code> has been added to <code><link xlink:href="#xml"
                     >&lt;to-do></link></code>. The most significant difference is the <code><link
                     linkend="element-body">&lt;body></link></code>, which has two <code><link
                     linkend="element-item">&lt;item&gt;</link></code>s, one of which is wrapped in
               a <code><link linkend="element-group">&lt;group></link></code>. Each <code><link
                     linkend="attribute-affects-element">@affects-element</link></code> specifies
               one or more names of elements that the enclosed items affect, and the <code><link
                     linkend="element-item">&lt;item&gt;</link></code>s have the standard IRI + name
               pattern. <code><link linkend="element-group">&lt;group></link></code>s may nest as
               you like.</para>
            <para>The difference between a grouped and ungrouped <code><link linkend="element-item"
                     >&lt;item&gt;</link></code> is purely a matter of taste and convenience. The
               example above illustrates both methods. Whether you group your items or you do not,
               the practical effect does not change.</para>
            <para>The <code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
               has a <code><link linkend="element-person">&lt;person></link></code> whose <link
                  linkend="attribute-which"><code>@which</code></link> points to the body of the
               first <code><link linkend="element-item">&lt;item&gt;</link></code>. That is, a
               TAN-voc file can draw from its own <code><link linkend="element-body"
                     >&lt;body></link></code> for vocabulary, without repeating it in <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>.</para>
            <para>Let's return to the <code><link linkend="element-head">&lt;head></link></code>s of
               our two TAN-T files, and see how to incorporate our new TAN-voc vocabulary
               file.</para>
            <para>
               <programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
        &lt;name>TAN transcription of Ring a Ring o' Roses&lt;/name>
        &lt;master-location 
            href="http://textalign.net/release/TAN-2020/examples/ring-o-roses.eng.1881.xml"/>
<emphasis role="bold">        &lt;license which="by 4.0" licensor="park"/>
        &lt;work which="Ring around the Rosie"/>
</emphasis>        &lt;source>
            &lt;IRI>http://lccn.loc.gov/12032709&lt;/IRI>
            &lt;name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]&lt;/name>
        &lt;/source>
<emphasis role="bold">        &lt;vocabulary>
           &lt;IRI>tag:parkj@textalign.net,2015:TAN-voc:standard&lt;/IRI>
           &lt;name>Vocabulary for TAN files edited by Jenny Park&lt;/name>
           &lt;location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/>
        &lt;/vocabulary>
</emphasis>        &lt;vocabulary-key><emphasis role="bold">
            &lt;person xml:id="park" which="Jenny Park"/>
            &lt;div-type xml:id="line" which="line (verse)"/>
</emphasis>        &lt;/vocabulary-key>
        &lt;file-resp who="park"/>
        &lt;resp roles="creator" who="park"/>
        &lt;change when="2014-08-13" who="park">Started file&lt;/change>
        &lt;to-do/>
    &lt;/head>
    . . . . . . .
&lt;/TAN-T></programlisting>
            </para>
            <para>
               <programlisting>&lt;TAN-T xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring02">
    &lt;head>
      &lt;name>TAN transcription of Ring around the Rosie&lt;/name>
      &lt;master-location>ring-o-roses.eng.1987.xml&lt;/master-location>
      &lt;license which="by 4.0" licensor="park"/>
      <emphasis role="bold">&lt;work which="Ring around the Rosie"/></emphasis>
      &lt;source>
         &lt;IRI>http://lccn.loc.gov/87042504&lt;/IRI>
         &lt;name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.&lt;/name>
      &lt;/source>
      <emphasis role="bold">&lt;vocabulary>
         &lt;IRI>tag:parkj@textalign.net,2015:TAN-voc:standard&lt;/IRI>
         &lt;name>Vocabulary for TAN files edited by Jenny Park&lt;/name>
         &lt;location href="TAN-voc/park-projects.TAN-voc.xml" accessed-when="2020-01-10"/>
      &lt;/vocabulary></emphasis>
      &lt;adjustments>
         &lt;normalization which="no hyphens"/>
      &lt;/adjustments>
      &lt;vocabulary-key>
         <emphasis role="bold">&lt;div-type xml:id="l" which="line (verse)"/></emphasis>
         <emphasis role="bold">&lt;person xml:id="park" which="Jenny Park"/></emphasis>
      &lt;/vocabulary-key>
      &lt;resp roles="creator" who="park"/>
      &lt;change when="2014-10-24" who="park">Started file&lt;/change>
      &lt;comment when="2014-10-24" who="park">See p. 39 of source.&lt;/comment>
      &lt;to-do/>
   &lt;/head>
   . . . . . .
&lt;/TAN-T></programlisting>
            </para>
            <para>In each TAN-T file, a new <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code> points to the project TAN-voc vocabulary file
               we have just created. Along with the customary IRI + name pattern is a new element,
                     <code><link linkend="element-location">&lt;location></link></code>, which
               specifies where the digital file was accessed and when (through <code><link
                     linkend="attribute-accessed-when">@accessed-when</link></code>). We may include
               as many of these <code><link linkend="element-location">&lt;location></link></code>
               elements as we wish, with the most preferred or reliable one at the top. The
               validation process will consult only the first one that leads to an available
               document. The <code><link linkend="attribute-accessed-when"
                  >@accessed-when</link></code> value is important, because TAN files talk to each
               other. The validator will look for changes in the file since we last accessed it, and
               if any changes are found, a warning with a summary of the changes will be returned.
               It is then up to us to determine if the alterations merit any action on our part. </para>
            <para>Similarly, anyone using or dependending upon our file will be notified of any
               changes we make, through the same validation process.</para>
            <para>Once the <code><link linkend="element-vocabulary">&lt;vocabulary&gt;</link></code>
               is in place, we can draw from our predefined vocabulary. Our revised versions of the
                     <code><link linkend="element-head">&lt;head></link></code>s are a bit more DRY,
               and certainly more compact and easier to read. The longer the TAN file, the more
               noticable the improvement. And when our library grows into dozens, hundreds, or
               thousands of files, we'll be grateful that a change that affects all the files needs
               to be made only once.</para>
            <para>In general, when you share your files with other people, you need to make sure
               that you also share your vocabulary files too. There is an alternative method, that
               of sending what is called a resolved TAN file, which encapsulates all the vocabulary,
               but that is a slightly more advanced topic. See <xref
                  xlink:href="#validating_tan_files"/>.</para>
            <para>Now that we have created the metadata for our transcriptions, let's turn to the
               alignment files. Those <code><link linkend="element-head">&lt;head></link></code>s
               will look slightly different, because they are not concerned with transcriptions per
               se. We start with the TAN-A
               file:<programlisting>&lt;TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:<emphasis role="bold">ring-alignment</emphasis>">
    &lt;head>
       &lt;name><emphasis role="bold">div-based alignment of multiple versions of Ring o Roses</emphasis>&lt;/name>
       &lt;master-location href="<emphasis role="bold">http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml</emphasis>"/>
       &lt;license which="by_4.0" licensor="park"/>
       <emphasis role="bold">&lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
       &lt;/source></emphasis>
       &lt;vocabulary-key>
          &lt;person xml:id="park" which="Jenny Park"/>
       &lt;/vocabulary-key>
       &lt;resp who="park" roles="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
       &lt;to-do>
          &lt;comment when="2018-08-09-04:00" who="park">Finish file.&lt;/comment>
       &lt;/to-do>
    &lt;/head>
    . . . . . .
&lt;/TAN-A></programlisting></para>
            <para>Much of the code above will look similar to the previous two examples. The file's
                     <code><link linkend="element-name">&lt;name></link></code> and <code><link
                     linkend="element-master-location">&lt;master-location></link></code> are
               updated. Just like TAN-T files have <code><link linkend="element-source"
                     >&lt;source></link></code>s, so TAN-A files do as well, except that those
               sources are always TAN-T transcription files, and they take the IRI + name + location
               pattern we saw above in <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code>. Because alignment files take only TAN
               transcription files as sources, each <code><link linkend="element-source"
                     >&lt;source></link></code>'s <code><link linkend="element-IRI"
                  >&lt;IRI></link></code> always takes the <code><link linkend="attribute-id"
                     >@id</link></code> value of the target TAN-T transcription file. <code><link
                     linkend="element-name">&lt;name></link></code> is arbitrary. It may replicate
               exactly the title found in the transcription file, or it may be modified, perhaps to
               harmonize better with the descriptions of the other source names, or the role of the
               source within the TAN-A file. Our TAN-A file could contain any number of <code><link
                     linkend="element-source">&lt;source></link></code>s, and not necessarily for
               the same work. The order in which we put the <code><link linkend="element-source"
                     >&lt;source></link></code>s does not necessarily mean anything. </para>
            <para>This <code><link linkend="element-head">&lt;head></link></code> explains why the
                     <code><link linkend="element-body">&lt;body></link></code> of our TAN-A file is
               allowed to be empty. We have already specified which sources are to be aligned and
               where they are to be found. Any user or processor of a TAN-A file may assume that
               every <code><link linkend="element-div">&lt;div></link></code> in every source should
               be automatically aligned upon the basis of shared values of <code><link
                     linkend="attribute-n">@n</link></code>.</para>
            <para>Meanwhile we turn to our fourth file, TAN-A-tok, whose <code><link
                     linkend="element-head">&lt;head></link></code> might look like
               this:<programlisting>&lt;TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    &lt;head>
        &lt;name><emphasis role="bold">token-based alignment of two versions of Ring o Roses</emphasis>&lt;/name>
        &lt;master-location 
          href="<emphasis role="bold">http://textalign.net/release/TAN-2020/examples/TAN-A-tok/ringoroses.01+02.token.1.xml</emphasis>"/>
        &lt;license which="<emphasis role="bold">by-nc-nd_4.0</emphasis>" rights-holder="park"/>
        <emphasis role="bold">&lt;token-definition src="ring1881 ring1987" which="letters"/></emphasis>
        &lt;source xml:id="eng-uk">
            &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
            &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
            &lt;location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
        &lt;/source>
        &lt;source xml:id="eng-us">
            &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
            &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
            &lt;location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
        &lt;/source>
        &lt;vocabulary-key>
            <emphasis role="bold">&lt;bitext-relation xml:id="B-descends-from-A" which="a/x+/b"/></emphasis>
            &lt;person xml:id="park" which="Jenny Park"/>
        &lt;/vocabulary-key>
        &lt;change when="2015-01-20" who="park">Started file&lt;/change>
    &lt;/head>
    . . . . . .
&lt;/TAN-A-tok></programlisting></para>
            <para>The TAN-A-tok <code><link linkend="element-head">&lt;head></link></code> looks
               similar to the previous examples, except that <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> has some new
               content.</para>
            <para><code><link linkend="element-bitext-relation">&lt;bitext-relation></link></code>
               states through <link linkend="attribute-which"><code>@which</code></link> or an IRI +
               name pattern the stemmatic relationship we think holds between the two sources. We
               have used <link linkend="attribute-which"><code>@which</code></link> and the value
                  <code>a/x+/b</code>, pointing to a standard TAN vocabulary item for <link
                  xlink:href="#vocabularies-bitext-relations">bitext relations</link>:</para>
            <para>
               <programlisting>&lt;TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
   id="tag:textalign.net,2015:tan-voc:bitext-relation">
. . . . . .
        &lt;item>
            &lt;IRI>tag:textalign.net,2015:bitext-relation:a/x+/b&lt;/IRI>
            <emphasis role="bold">&lt;name>a/x+/b&lt;/name></emphasis>
            &lt;desc>direct descent, B descends from A, one or more mediaries&lt;/desc>
        &lt;/item>
. . . . . .
&lt;/TAN-voc></programlisting>
            </para>
            <para><code><link linkend="element-token-definition">&lt;token-definition></link></code>
               specifies how we have defined our word tokens. <code><link linkend="attribute-src"
                     >@src</link></code> has more than one value, specifying that the same
               tokenization rule should be applied to both sources. <link linkend="attribute-which"
                     ><code>@which</code></link> points to this standard TAN vocabulary item:</para>
            <para>
               <programlisting>&lt;TAN-voc xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
   id="tag:textalign.net,2015:tan-voc:tokenizations">
. . . . . .
        &lt;item>
            <emphasis role="bold">&lt;token-definition pattern="[\w&amp;#xad;​&amp;#x200b;&amp;#x200d;]+"/></emphasis>
            &lt;name>letters&lt;/name>
            &lt;name>letters only&lt;/name>
            &lt;name>general word characters only&lt;/name>
            &lt;name>general ignore punctuation&lt;/name>
            &lt;name>gwo&lt;/name>
            &lt;desc>General tokenization pattern for any language, words only. Non-letters 
                such as punctuation are ignored.&lt;/desc>
        &lt;/item>
. . . . . .
&lt;/TAN-voc></programlisting>
            </para>
            <para>Up until now, all vocabulary items have taken the IRI + name pattern. The one
               above does not have an IRI, only a <code><link linkend="element-token-definition"
                     >&lt;token-definition></link></code> with a <code><link
                     linkend="attribute-pattern">@pattern</link></code>. The value of <code><link
                     linkend="attribute-pattern">@pattern</link></code>, which may look like
               gibberish, is a <emphasis role="bold">regular expression</emphasis>. "Regular" here
               does not mean ordinary; rather it relates to Latin <emphasis>regula</emphasis>, rule.
               Regular expressions are rule-based patterned text searches. This particular pattern
               says that a token is defined as any contiguous string of word characters
                  (<code>\w</code>), soft hyphens (<code>&amp;#xad;</code>), zero-width spaces
                  (<code>&amp;#x200b;</code>), or zero-width joiners (<code>&amp;#x200d;</code>).
               This is TAN's default tokenization pattern, and it will be assumed for any TAN-A-tok
               file that lacks a <code><link linkend="element-token-definition"
                     >&lt;token-definition></link></code>. TAN adopts this default to capture what
               we commonly mean in ordinary conversation by "word." When we refer someone to the nth
               word in a sentence, we most often ignore punctuation marks. For more on token
               definitions see <xref xlink:href="#defining_tokens"/> and <xref
                  xlink:href="#vocabularies-token-definitions"/>. See also <xref
                  xlink:href="#regular_expressions"/>.</para>
            <para>In our <code><link linkend="element-vocabulary-key"
                  >&lt;vocabulary-key></link></code> we could have also included a <code><link
                     linkend="element-reuse-type">&lt;reuse-type></link></code>, but we have
               intentionally omitted it here, because we have <code>&lt;body
                  bitext-relation="B-descends-from-A" reuse-type="general_adaptation"></code>. The
               value for <code><link linkend="attribute-reuse-type">@reuse-type</link></code>,
                  <code>general_adaptation</code>, corresponds to a <code><link
                     linkend="element-name">&lt;name></link></code> in a standard TAN vocabulary
               item for reuse types. We don't need to invoke a <code><link
                     linkend="element-reuse-type">&lt;reuse-type></link></code> in the <code><link
                     linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> because we
               are going directly to the name of the vocabulary item. Notice that
                  <code>general_adaptation</code> has an underscore instead of a space. That's
               because <code><link linkend="element-reuse-type">&lt;reuse-type></link></code> can
               take multiple values, which are signified by spaces. So spaces in names need to be
               replaced by an underscore, or a hyphen if we prefer. The values of <code><link
                     linkend="element-name">&lt;name></link></code> are never case-sensitive, and
               the space, hyphen, and underscore are treated as equivalent. (<code><link
                     linkend="attribute-id">@id</link></code> values, on the other hand, are always
               case-sensitive, and never have spaces.)</para>
         </section>
         <section>
            <title>Aligning across projects</title>
            <para>We now have a collection of five TAN files: two TAN-T transcriptions, a TAN-A
               alignment/annotation file, a TAN-A-tok word-for-word alignment file, and a TAN-voc
               file for vocabulary shared across the files. </para>
            <para>Let us imagine what it might be like to connect our TAN collection to a TAN file
               made by someone else. Let us assume that we have found elsewhere, in a German
               project, a TAN transcription of a work that looks quite similar to our
               own:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN-T.rnc" 
   type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-2020/schemas/TAN.sch" 
   type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel">
   &lt;head>
      &lt;name>TAN Transkription, Ringelreihen mit Riederfallen&lt;/name>
      &lt;master-location>http://beispiel.com/TAN-T/ringel.xml&lt;/master-location>
      &lt;license>
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
         &lt;name>Creative Commons Namensnennung 4.0 International Lizenz&lt;/name>
         &lt;desc>Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0
            International Lizenz.&lt;/desc>
      &lt;/license>
      &lt;licensor who="schmidt"/>
      &lt;work>
         &lt;IRI>tag:beispiel.com,2014:texte:holderbusch&lt;/IRI>
         &lt;name>"Die Kinder auf dem Holderbusch"&lt;/name>
      &lt;/work>
      &lt;version>
         &lt;IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e&lt;/IRI>
         &lt;name>zweite Version&lt;/name>
      &lt;/version>
      &lt;numerals priority="letters"/>
      &lt;source>
         &lt;IRI>http://www.worldcat.org/oclc/4574384&lt;/IRI>
         &lt;name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen 
            aus allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. 
            Leipzig, 1897.&lt;/name>
      &lt;/source>
      &lt;adjustments>
         &lt;normalization>
            &lt;IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off&lt;/IRI>
            &lt;name>Keine Bindestriche&lt;/name>
         &lt;/normalization>
      &lt;/adjustments>
      &lt;vocabulary-key>
         &lt;div-type xml:id="Zeile">
            &lt;IRI>http://dbpedia.org/resource/Gedichtzeile&lt;/IRI>
            &lt;name>Gedichtzeile&lt;/name>
         &lt;/div-type>
         &lt;div-type which="poem" xml:id="Gedicht"/>
         &lt;person xml:id="schmidt" roles="Produzent">
            &lt;IRI>tag:hans@beispiel.com,2014:selbst&lt;/IRI>
            &lt;name xml:lang="eng">Hans Schmidt&lt;/name>
         &lt;/person>
         &lt;role xml:id="Produzent">
            &lt;IRI>http://schema.org/producer&lt;/IRI>
            &lt;name xml:lang="eng">Produzent&lt;/name>
         &lt;/role>
      &lt;/vocabulary-key>
      &lt;file-resp who="schmidt"/>
      &lt;resp who="schmidt" roles="Produzent"/>
      &lt;change when="2014-08-13" who="schmidt">Anfang&lt;/change>
      &lt;comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht&lt;/comment>
      &lt;to-do/>
   &lt;/head>
   &lt;body xml:lang="deu">
      &lt;div type="Gedicht" n="1">
         &lt;div type="Zeile" n="a">Ringel, Ringel, Reihe!&lt;/div>
         &lt;div type="Zeile" n="b">Sind der Kinder dreie,&lt;/div>
         &lt;div type="Zeile" n="c">Sitzen auf dem Holderbuch,&lt;/div>
         &lt;div type="Zeile" n="e">Schreien alle: husch, husch, husch!&lt;/div>
      &lt;/div>
   &lt;/body>
&lt;/TAN-T></programlisting></para>
            <para>It seems that this 19th-century German version is quite similar to our two English
               versions. We have some alignment options open to us. Two more sets of word-for-word
               alignments would be interesting, but remember, just because we find a text that
               nicely aligns with others does not mean that we <emphasis role="italic"
                  >must</emphasis> align them, or that for a given alignment we must align
                  <emphasis>everything</emphasis>. In this case, we choose not to worry about
               word-for-word alignments, and we focus here only on the TAN-A alignment, so that, for
               example, we can use the built-in TAN application to display the three versions in
               parallel, a reading tool to study more closely intertextual relationships.</para>
            <para>To that end, we first observe some differences between this transcription and our
               other two. First, the value of <code><link linkend="element-work"
                  >&lt;work></link></code> is not the one we have given our two versions. Second,
                     <code><link linkend="element-numerals">&lt;numerals></link></code> specifies by
               its value for <code><link linkend="attribute-priority">@priority</link></code> that
               any ambiguous numerals should be interepreted as letter numerals, not Roman (that's
               important, e.g., for a <link linkend="element-div"><code>&lt;div></code></link> with
               an <code><link linkend="attribute-n">@n</link></code> value <code>c</code>, which
               could mean 3 [a, b, c, ...] or the Roman numeral for 100). Next, the lines are
               wrapped in a <link linkend="element-div"><code>&lt;div></code></link> for the whole
               poem (<code>Gedicht</code>) and they have been lettered instead of numbered. And
               last, the editor seems to have made a typographical error, making the last line
                  <code>e</code> instead of the expected <code>d</code>). These five differences
               typify inconsistencies one commonly finds in digital texts from different projects of
               the same work.<footnote>
                  <para>There are a few other differences in this third transcription that do not
                     affect our alignment. <code><link linkend="element-version"
                        >&lt;version></link></code> is used to distinguish different versions of the
                     same work found on the same text-bearing object. That is, if we are
                     transcribing a bilingual edition, we can use <code><link
                           linkend="element-version">&lt;version></link></code> to specify which of
                     the two versions we are encoding. Notice that the <code><link
                           linkend="element-IRI">&lt;IRI></link></code> value is a UUID. In this
                     case the editor was not prepared to deploy a formal IRI naming scheme (perhaps
                     using a tag URN) that would be satisfactory for work-versions. Also, the
                           <code><link linkend="element-div-type">&lt;div-type></link></code> is
                     defined as <code>http://dbpedia.org/resource/Gedichtzeile</code> (Gedichtzeile
                     = line of poetry), so it doesn't intersect with our IRIs for the vocabulary
                     item <code>line</code>. But <code><link linkend="element-div-type"
                           >&lt;div-type></link></code> is not used to align versions, and
                     validation isn't affected, so we do not concern ourselves here with trying to
                     reconcile the different IRIs. </para>
               </footnote></para>
            <para>These are points we can easily reconcile in our TAN-A file, which we now expand to
               include the German version. We make the following adjustments
               (emphasized):<programlisting>&lt;TAN-A xmlns="tag:textalign.net,2015:ns" TAN-version="2021" 
    id="tag:parkj@textalign.net,2015:ring-alignment">
    &lt;head>
       &lt;name>div-based alignment of multiple versions of Ring o Roses&lt;/name>
       &lt;master-location href="http://textalign.net/release/TAN-2020/examples/TAN-A/ringoroses.div.1.xml"/>
       &lt;license which="by_4.0" licensor="park"/>
       &lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1881.xml" accessed-when="2015-03-10"/>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location href="../ring-o-roses.eng.1987.xml" accessed-when="2014-08-13"/>
       &lt;/source>
       <emphasis role="bold">&lt;source xml:id="ger">
          &lt;IRI>tag:beispiel.com,2014:ringel&lt;/IRI>
          &lt;name>Transcription of an ancestor of Ring around the roses in German&lt;/name>
          &lt;location accessed-when="2014-08-22">http://beispiel.com/TAN-T/ringel.xml&lt;/location>
          &lt;location accessed-when="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml&lt;/location>
       &lt;/source>
       &lt;adjustments src="ger">
          &lt;skip div-type="Gedicht"/>
          &lt;rename n="e" by="-1"/>
       &lt;/adjustments></emphasis>
       &lt;vocabulary-key>
          &lt;person xml:id="park" which="Jenny Park"/>
          <emphasis role="bold">&lt;alias id="ring" idrefs="ger eng-us"/></emphasis>
       &lt;/vocabulary-key>
       &lt;resp who="park" roles="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
       <emphasis role="bold">&lt;change when="2014-08-22" who="park">Added German version.&lt;/change></emphasis>
       &lt;to-do>
          &lt;comment when="2018-08-09-04:00" who="park">Finish file.&lt;/comment>
       &lt;/to-do>
    &lt;/head>
    . . . . . .
&lt;/TAN-A></programlisting></para>
            <para>The first major change is the insertion of a third <code><link
                     linkend="element-source">&lt;source></link></code>, pointing to the new file
               and specifying its name and IRI. Note that two <code><link linkend="element-location"
                     >&lt;location></link></code>s have been provided, one for the original and
               another for a local copy we have saved. Validation will take into account only the
               first document available. If we wanted to work primarily off our local copy, we would
               have put that <code><link linkend="element-location">&lt;location></link></code>
               first. By placing it second, we allow the validation engine to work primarily off the
               master version and therefore look for updates and changes. If that version is
               unavailable, validation will be made against second, local copy.</para>
            <para><code><link linkend="element-adjustments">&lt;adjustments></link></code> specifies
               through its <code><link linkend="attribute-src">@src</link></code> that only the
               German version should be adjusted by the contained instructions. The enclosed
                     <code><link linkend="element-skip">&lt;skip&gt;</link></code> says, in effect,
               to ignore the wrapping <link linkend="element-div"><code>&lt;div></code></link> for
               purposes of alignment. The <code><link linkend="element-rename"
                  >&lt;rename></link></code> takes care of the apparent typographical error, and
               anchors the German version to the U.S. one. Note that the German version uses
                  <code>e</code>, but we have used <code>5</code>. But we could have used
                  <code>e</code>, or even the Roman numeral <code>v</code>, had we wished to. Every
               TAN file's numeration system is evaluated locally, independent of any external files.
               We need not reconcile the <code>a</code>, <code>b</code>, and <code>c</code>
               <code><link linkend="attribute-n">@n</link></code> values in the German version,
               because these will be automatically treated as equivalent to <code>1</code>,
                  <code>2</code>, and <code>3</code>. The TAN format supports four numeration
               systems other than Arabic numerals: Roman numerals (uppercase or lowercase),
               alphabetic numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations
               (e.g., 1a, 1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two
               systems are interpreted as a two-tier numbering system.</para>
            <para>The second major change, to address the German version's different value of
                     <code><link linkend="element-work">&lt;work></link></code>, is the addition of
               an <code><link linkend="element-alias">&lt;alias></link></code>, which allows us to
               assign one or more vocabulary items a common id. Wherever the value <code>ring</code>
               is used, it stands in for <code>ger</code> and <code>eng-us</code>, which point to
               the two TAN-T files. You may be familiar with this concept from critical editions,
               where a siglum, e.g., A might stand for several other sigla, e.g., a, b, and c. So
               every time you see something said about A, you know that by implication it is true of
               a, b, and c.</para>
            <para>Every TAN-T file has only one work and only one written source. So if you wish to
               make a claim about a particular work or source, you can use a TAN-T's id as a
               surrogate. That is, the <code><link linkend="attribute-id">@id</link></code> in
                     <code><link linkend="element-source">&lt;source></link></code> can stand it to
               represent either the work or the book or manuscript from which the text has been
               taken. So if we make claims in our TAN-A file about a written source or a work,
                  <code>ring</code> would assert the claim to be true for the works pointed to by
               the German and the U.S. version. (We do not need to specifically mention
                  <code>eng-uk</code> in the <code><link linkend="element-alias"
                  >&lt;alias></link></code>, since it has the same work IRI as the U.S. version
               does.) <footnote>
                  <para>Alternatively, instead of <code><link linkend="element-alias"
                           >&lt;alias></link></code>, we could simply have adjusted our TAN-voc
                     file, adding the German version's <code><link linkend="element-IRI"
                           >&lt;IRI></link></code> value to the appropriate vocabulary item, and use
                     that id.</para>
               </footnote></para>
            <para>The last major insertion is a new <code><link linkend="element-change"
                     >&lt;change></link></code>, documenting when we made the alterations. Its
                     <code><link linkend="attribute-when">@when</link></code> effectively updates
               the version of our TAN-A file.</para>
            <para>With these additions, the German version is now aligned with the other two. We
               could have made our work simpler just by directly modifying our local copy of the
               German version. But such a change would not have affected the master copy. What
               happens when the owner of the German file makes changes? At that point we be faced
               with version conflict: changes in the original, and our own changes in the copy. We
               would struggle to reconcile the differences. And we would have to repeat that
               exercise every time the German file was updated. By keeping our local copy of the
               German file unchanged, and making simple adjustments in our TAN-A file, we can keep
               our local copy synchronized with the master file and yet make the adjustments needed
               to coordinate with ours.</para>
            <para>The purpose statement in these guidelines says that TAN was "designed to <emphasis
                  role="bold">maximize</emphasis> the syntactic and semantic interoperability of
               texts, annotations, and language resources." Here we see the importance of the
               qualifier "maximize." In no world will there ever be (nor should there be, it seems)
               a single, indisputable way to divide a given work. The TAN format does not change
               that reality. Rather, it provides a convergent ecosystem in which different practices
               can be easily reconciled, to help editors and authors enhance cross-project
               interoperability without artificially forcing conformity, or suppressing legitimately
               different outlooks.</para>
            <para>Perhaps Hans Schmidt, the producer of the German version, can be contacted (e.g.,
               through his tag URN). We do so, and we suggest that he modify the version to make it
               align better. Perhaps he has reasons for labeling the lines with letters, and perhaps
               he is reluctant to explicitly identify this poem with <emphasis role="italic">Ring
                  around the Rosie</emphasis>. That is within his rights. But the conversation might
               lead to our pointing out that <code>n="e"</code> should probably be
                  <code>n="d"</code> and that there is an apparent typographic error in the last
               line. Or perhaps we're the ones in error. (The original, printed book has the poem
               twice on page 438, one with the spelling "Holderbuch" at line 3, the other,
               "Holderbusch".) If Schmidt chooses to correct his master file, he can add a new
                     <code><link linkend="element-change">&lt;change></link></code>, and thereby
               tacitly notify anyone else using the file that corrections have been made.</para>
            <para>At this point we have a network of six TAN files, five from our collection and one
               from outside. Although simple and small, this network could be extended to address
               some creative and complex research questions. Applications based on XSLT stylesheets
               could be used to automatically align the versions for reading and study, or to
               perform statistical analysis. </para>
            <para>What you've read so far is only a cursory introduction to TAN features. Study the
               rest of these guidelines, as well as example TAN libraries, and you will find
               numerous ways to develop TAN files, and to use them to enhance your research,
               teaching, and writing.</para>
         </section>
      </chapter>
   </part>
   <part xml:id="detailed_description">
      <title>Detailed description</title>
      <partintro>
         <para>This part of the guidelines provides a detailed description of the design and
            structure of the formats of the Text Alignment Network. The material follows the
            organization of the schema files (kept in the <code>schemas</code> subdirectory), so
            both can be studied in tandem.</para>
         <para><xref linkend="concepts_common"/> outlines, in a non-technical way, the principles
            and technical foundations of the TAN format.</para>
         <para><xref linkend="class_common"/>, <xref linkend="class_1"/>, <xref linkend="class_2"/>,
            and <xref linkend="class_3"/> describe each TAN format, class by class. Each chapter
            starts with theory or scholarly context before expanding on technical points. </para>
         <para>The chapters in this part have been written with the assumption that you have already
            read the previous part (<xref linkend="general_overview"/>) and that you have already
            started to create or edit a TAN collection.</para>
         <para>Because readers will come from different specialties, all acronyms, abbreviations,
            and concepts are defined and explained, albeit tersely, to explain how they affect the
            use of TAN. Suggestions for further reading are provided for those who want a more
            thorough introduction to a topic. </para>
      </partintro>
      <chapter xml:id="concepts_common">
         <title>General underpinnings</title>
         <para>This chapter retains something of the introductory spirit of the previous one by
            providing an overview of the fundamental principles and technologies behind TAN. The
            goal is to explain the design of the format. Although this chapter assumes on your part
            no prior knowledge of any particular technology, it is also not meant to be a tutorial.
            Links to further reading will take you to good introductory material.</para>
         <section xml:id="design_principles">
            <title>Design principles</title>
            <para>The TAN formats have been designed around a few basic principles:</para>
            <para><emphasis role="bold">Scholarly habits</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Be patient.</para>
                  </listitem>
                  <listitem>
                     <para>Simplify.</para>
                  </listitem>
                  <listitem>
                     <para>Stay focused.</para>
                  </listitem>
                  <listitem>
                     <para>Don't repeat yourself.</para>
                  </listitem>
                  <listitem>
                     <para>Don't state the obvious.</para>
                  </listitem>
                  <listitem>
                     <para>Use familiar conventions.</para>
                  </listitem>
               </itemizedlist></para>
            <para>
               <emphasis role="bold">Scholarly freedom</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Express doubt.</para>
                  </listitem>
                  <listitem>
                     <para>Offer alternatives.</para>
                  </listitem>
                  <listitem>
                     <para>Exercise independence.</para>
                  </listitem>
                  <listitem>
                     <para>Invite interdependence.</para>
                  </listitem>
               </itemizedlist></para>
            <para>
               <emphasis role="bold">Scholarly responsibility</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Declare your assumptions.</para>
                  </listitem>
                  <listitem>
                     <para>Make your work citable.</para>
                  </listitem>
                  <listitem>
                     <para>Satisfy scholars' expectations:</para>
                     <itemizedlist>
                        <listitem>
                           <para>Who did what when?</para>
                        </listitem>
                        <listitem>
                           <para>What are your sources?</para>
                        </listitem>
                        <listitem>
                           <para>How do you define your terms?</para>
                        </listitem>
                        <listitem>
                           <para>What alterations have you made to your sources?</para>
                        </listitem>
                        <listitem>
                           <para>What rights do I have to use your material?</para>
                        </listitem>
                     </itemizedlist>
                  </listitem>
               </itemizedlist>
               <emphasis role="bold">General utility</emphasis>
               <itemizedlist>
                  <listitem>
                     <para>Use stable technology.</para>
                  </listitem>
                  <listitem>
                     <para>Keep design predictable, consistent.</para>
                  </listitem>
                  <listitem>
                     <para>Make each datum human readable.</para>
                  </listitem>
                  <listitem>
                     <para>Make each datum computer actionable.</para>
                  </listitem>
               </itemizedlist>
            </para>
         </section>
         <section>
            <title>Format organization</title>
            <para>The Text Alignment Network is a modular suite of XML encoding formats, each one
               designed for a specific type of textual data, divided into three classes: texts
               (class 1), text alignments and annotations (class 2), and everything else (class 3). </para>
            <para><emphasis role="bold">Class 1</emphasis>: representations of textual objects,
               i.e., transcriptions. (See <link xlink:href="#transcription_and_transliteration">note
                  on transcriptions versus transliterations</link>.) Each transcription file
               contains the text of a single work from a single text-bearing object (which we term
                  <emphasis>scriptum</emphasis>; see <xref xlink:href="#domain_model"/>), whether
               physical or digital. There are two types of transcription file: a standard generic
               format (TAN-T) and a gentle customization of TEI All (TAN-TEI). These two types are
               differentiated by the root element, <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code> and <code>&lt;TEI></code> respectively.</para>
            <para><emphasis role="bold">Class 2</emphasis>: annotations on class-1 texts, and
               alignment declarations. There are two types of alignment, one for broad, general
               alignments and another for granular, word-for-word aligments. The former, with
                     <code><link linkend="element-TAN-A">&lt;TAN-A></link></code> as the root
               element, aligns any number (one or more) of class-1 files, and allows one to annotate
               those files. The latter, <code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>, aligns only pairs of class-1 files, on a
               word-for-word basis. Lexico-morphology files, <code><link linkend="element-TAN-A-lm"
                     >&lt;TAN-A-lm></link></code>, are used to encode the lexical and morphological
               (or part-of-speech) forms of individual words from a single class-1 file, or of a
               language in general.</para>
            <para><emphasis role="bold">Class 3</emphasis>: everything else. <code><link
                     linkend="element-TAN-voc">&lt;TAN-voc></link></code> collects and labels
               vocabulary items used in other TAN files. TAN catalog files have the root element
                     <code><link linkend="element-collection">&lt;collection></link></code>, and
               they index locally available TAN files, and selective parts of their metadata.
                     <code><link linkend="element-TAN-mor">&lt;TAN-mor></link></code> is used to
               define the grammatical categories or features of a given language and to specify
               rules for lexico-morphological codes in dependent TAN-A-lm files.</para>
            <para>TAN adopts a <emphasis role="italic">stand-off</emphasis> approach to annotation
               or markup. In the alternative method, <emphasis role="italic">inline</emphasis>
               markup, which you may be familiar with from TEI or HTML, an annotation is applied
               directly to the base text, e.g., <code>&lt;p>He said
                  &lt;quote>"Jump!"&lt;/quote>&lt;/p></code>, where the inner element
                  <code>&lt;quote></code> annotates the third word. </para>
            <para>In stand-off annotation, however, <code>&lt;p>He said "Jump!"&lt;/p></code> would
               be left as-is, and somewhere else there would be an annotation that states that the
               third word is a quotation. If the stand-off annotation is in the same file, it is an
                  <emphasis>internal stand-off</emphasis> annotation. If the annotation is in a
               different file, it is an <emphasis>external stand-off</emphasis> annotation.</para>
            <para>For many common, simple cases, inline annotation is simple, convenient, and
               straightforward. But as inline annotations are added, the benefits slowly diminish.
               When parts of a file attract multiple markup elements, the file can become difficult
               to read and navigate. </para>
            <para>Stand-off annotation provides several benefits: <itemizedlist>
                  <listitem>
                     <para>An editor can focus on a limited set of closely related questions.</para>
                  </listitem>
                  <listitem>
                     <para>A source text without inline annotations is less cluttered, and therefore
                        easier to read, than one with inline annotations. </para>
                  </listitem>
                  <listitem>
                     <para>Editors can work on separate annotation files based upon the same master
                        transcription file, even if they have very different research
                        interests.</para>
                  </listitem>
                  <listitem>
                     <para>A single annotation refer to two or more texts (e.g., identification of
                        quotations), and not have to prioritize, or be located in any single
                        one.</para>
                  </listitem>
                  <listitem>
                     <para>Complementary or competing annotations can be made, and those annotations
                        may point to concurrent or overlapping spans of text (a major problem for
                        in-line annotation, where according to XML rules no element may interlock or
                        overlap with another).</para>
                  </listitem>
                  <listitem>
                     <para>A corpus of stand-off external annotation files become, collectively, a
                        complex dataset, supporting lines of research that might not have been
                        anticipated by any single project.</para>
                  </listitem>
                  <listitem>
                     <para>Editorial labor can be conducted without central coordination, as
                        individuals work at their own pace, independently.</para>
                  </listitem>
                  <listitem>
                     <para>When an error is found in a transcription file, it can be corrected in a
                        single place, in the master. Anyone using a copy of that master file will be
                        notified in the validation process of changes that have been made and they
                        can deal with them accordingly. </para>
                  </listitem>
                  <listitem>
                     <para>Any data file can be updated independent of any other that points to it,
                        or to which it points.</para>
                  </listitem>
                  <listitem>
                     <para>Cross-file links required in stand-off annotation networks files, which
                        can then be combined and transformed in any number of ways to produce a wide
                        variety of derivative documents (e.g., collated versions, statistical
                        analysis).</para>
                  </listitem>
               </itemizedlist></para>
            <para>The stand-off approach works toward a principle often valued in computer science,
               that of the disaggregation of data. That is, in a master format, data of a particular
               type should not be entangled with other types of data. It can later be reaggregated
               in all kinds of ways, but that is an end product, not the way master data should be
               stored and managed. It is analogous to the way any well-run kitchen keeps its
               ingredients separate, until it is time to cook or bake a variety of products. We keep
               separate our flour, eggs, sugar, and so forth, until we find out what a recipe calls
               for, at which point we combined those ingredients in a variety of ways. It would be
               terrible if you were asked to make muesli (or granola), and found that someone had
               already turned the ingredients you wanted into a cake!</para>
            <para>Stand-off annotation is not without problems and vulnerabilities. For example:<itemizedlist>
                  <listitem>
                     <para>When (not if) the base text changes, the editor is unaware of how that
                        change will affect any stand-off annotations.</para>
                  </listitem>
                  <listitem>
                     <para>Not having the annotated text and an the annotation in the same reading
                        space can be an inconvenience. </para>
                  </listitem>
                  <listitem>
                     <para>When searching for, or querying, the base text, standoff annotations can
                        be difficult or impossible to incorporate to refine a selection.</para>
                  </listitem>
                  <listitem>
                     <para>When using the material for other purposes, it can be cumbersome or
                        challenging to reintegrate annotations with the base text.</para>
                  </listitem>
                  <listitem>
                     <para>Linking an annotation to its base text requires extra work and
                        maintenance. Normally this involves building and administering a library of
                        identifiers. Adding and removing ids, or checking them for errors, can be
                        time-consuming and confusing.</para>
                  </listitem>
               </itemizedlist>These are important challenges, but TAN validation rules have been
               designed to mitigate such problems. The last problem listed above is perhaps the
               greatest barrier to stand-off annotation. TAN approaches pointing in a much different
               way that is closer to current scholarly habits. See <xref
                  xlink:href="#pointer-syntax"/>.</para>
            <para>Furthermore, TEI inline annotations are supported. In general, you are encouraged
               to use TEI inline annotations where they are simple and make sense. But when the
               markup accumulates, threatens to create overlapping structures, or pose other
               difficulties, TAN class 2 files can be an ideal way to build and curate
               annotations.</para>
         </section>
         <section xml:id="assumptions_creating_data">
            <title>Assumptions in the creation of TAN data</title>
            <para>All creators and users of TAN files are expected to share few basic
               assumptions.</para>
            <para>First, all TAN-compliant data is to be understood as largely
                  <emphasis>derivative</emphasis>. That is, data files express no originality or
               creativity independent of their sources (but see below about interpretation). A TAN
               file should be created with the intent of adhering as closely as possible to some
               model or archetype. For example, a transcription is assumed to replicate faithfully
               some earlier digital edition or text-bearing material object (e.g., stone, papyrus,
               manuscript, printed book for written text; audiovisual media for oral or performative
               texts). Morphological files and alignment files should describe as clearly and as
               reliably as possible their source transcriptions. <emphasis>In creating and
                  publishing a TAN file you claim to have offered a good-faith representation or
                  description of something; in using a TAN file, you hold the creator to that
                  expectation.</emphasis></para>
            <para>Second, all core TAN files are <emphasis>interpretive</emphasis>. That is, they
               are permeated by editorial assumptions and opinions that might not be shared by
               everyone. If there is any resemblance of originality or creativity in a TAN file it
               is in that interpretive outlook. For example, if you edit a transcription file you
               must decide how to handle unusual letterforms and other visible marks. Your decisions
               will be influenced by your perspective on the original text and its native writing
               system, and how you interpret and use Unicode. If you write an alignment file, you
               must make decisions about what factors caused one text to be transformed into
               another. Lexicomorphological files require you to commit to one or more grammars and
               dictionaries, which adopt certain perspectives on language, and you must discern how
               best to handle cases of vagueness and ambiguity. No TAN file ever stands completely
               outside the interpretive act. <emphasis>In creating and publishing a TAN file you
                  claim to have disclosed as best you can the assumptions behind your interpretive
                  outlook; in using a TAN file, you hold the creator to that
               expectation.</emphasis></para>
            <para>Third, all core TAN files are <emphasis>applicable</emphasis>. That is, the
               interpretive impluse is assumed to be coupled with an equally strong desire to make
               the data as useful to as many users as possible, even those who may not share your
               assumptions or interpretation. TAN files are intended for intertextual comparison, so
               idiosyncrasies of a particular text-bearing object will be regarded by some users as
               either uninteresting or an obstacle. A creator of a transcription file should
               normalize and segment texts, adopting the most widely used reference systems, so as
               to optimize the alignment process. Morphological files should depend whenever
               possible upon commonly accepted grammars and lexica. Alignment files should work with
               comprehensible categories of text reuse. No TAN file will always be applicable to
               everyone, but it should be as suitable to as many as possible, for as many purposes
               as possible. <emphasis>In creating a TAN file you claim to use common, shared
                  conventions whenever possible, and to note any departures; in using a TAN file,
                  you hold the creator to that expectation.</emphasis></para>
            <para xml:id="accuracy-precision-comprehensiveness">Fourth, TAN data is to be considered
                  <emphasis>accurate, but not necessarily precise or complete</emphasis>. For
               example, if a TAN-A file claims that the opening of Plato's
                  <emphasis>Republic</emphasis> book 3 quotes from Homer's
                  <emphasis>Iliad</emphasis>, the claim is true and accurate, but is neither precise
               nor complete. Parts of the opening of book 3 are certainly not quotations, and the
               whole of the <emphasis>Iliad</emphasis> is not quoted in the
                  <emphasis>Republic</emphasis>. Or take a TAN-A-tok file. The token-for-token
               alignment of two texts might be selective, and focus only on the points of interest
               to the editor. Although the TAN formats permit a great deal of both precision and
               comprehensiveness, neither is mandated, except where explicitly noted by the TAN
               specifications. <emphasis>In creating a TAN file you claim to make accurate
                  assertions; in using a TAN file, you should hold the creator to that expectation,
                  but you must assess for yourself how precise and complete it is.</emphasis></para>
         </section>
         <section>
            <title>Core technology</title>
            <para>TAN depends upon a set of relatively stable technologies. Those technologies and
               the underlying terminology are briefly explained below, with attention paid to
               interpretive decisions that affect validation rules. </para>
            <section xml:id="unicode">
               <title>Unicode</title>
               <section>
                  <title>What is it?</title>
                  <para>Unicode is the worldwide standard for the encoding, representation, and
                     exchange of digital texts. The standard is maintained by a nonprofit consortium
                     whose goal is to represent all the world's writing systems, living and
                     historical. The Unicode standard allows us to share texts in any alphabet,
                     syllabary, or ideographic system reliably, regardless of how that text is
                     rendered (e.g., fonts, display).</para>
                  <para>With more than 128,000 characters, Unicode is almost as complex as human
                     writing itself. The entire sequence of characters is divided into blocks, each
                     one reserved, more or less, for a particular script or group of characters.
                     Within each block, characters may be grouped further. Each character is
                     assigned a single number called a codepoint.</para>
                  <para>Codepoints are numbered according to the hexadecimal system (base 16), which
                     uses the digits 0 through 9 and the letters A through F. (The decimal number 10
                     is hexadecimal A; decimal 11 = hex B; decimal 17 = hex 10; decimal 79 = hex
                     4F.) It is helpful to think of Unicode as a very long table of sixteen columns,
                     a glyph in each square; this is illustrated nicely <link
                        xlink:href="http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF"
                        >in this article</link>.</para>
                  <para>It is common to refer to Unicode characters by their value and perhaps by
                     their name. The value customarily starts "U+" and continues with the
                     hexadecimal value, usually at least four hexadecimal characters. When the
                     official Unicode name is given, it is normally in uppercase. Examples:</para>
                  <para>
                     <table frame="all">
                        <title>Unicode characters</title>
                        <tgroup cols="3">
                           <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                           <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                           <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                           <thead>
                              <row>
                                 <entry>Character</entry>
                                 <entry>Unicode value</entry>
                                 <entry>Unicode name</entry>
                              </row>
                           </thead>
                           <tbody>
                              <row>
                                 <entry>" " (space)</entry>
                                 <entry>U+0020</entry>
                                 <entry>SPACE</entry>
                              </row>
                              <row>
                                 <entry>®</entry>
                                 <entry>U+00AE</entry>
                                 <entry>REGISTERED SIGN</entry>
                              </row>
                              <row>
                                 <entry>ю</entry>
                                 <entry>U+044E</entry>
                                 <entry>CYRILLIC SMALL LETTER YU</entry>
                              </row>
                           </tbody>
                        </tgroup>
                     </table>
                  </para>
                  <para>In an XML file, nearly any Unicode codepoint may be used, either by typing
                     or pasting the character directly, or by using <emphasis role="bold">XML
                        entities</emphasis>. An XML entity is a proxy for some other text, marked by
                     an ampersand, some text, and then the semicolon. For example,
                        <code>&amp;amp;</code> represents the ampersand and <code>&amp;lt;</code>
                     stands for <code>&lt;</code>. To access specific Unicode characters an entity
                     may start <code>&amp;#x</code> followed by the hexadecimal codepoint (if you
                     prefer to work with decimal codepoints, leave off the <code>x</code>). For
                     example, the XML hex entity <code>&amp;#x44E;</code> (or
                        <code>&amp;#1102;</code> in decimal) is a proxy for the Cyrillic small
                     letter yu.</para>
               </section>
               <section xml:id="normalization">
                  <title>Unicode normalization</title>
                  <para>Unicode rules provide guidance on how text should be normalized, to identify
                     equivalent variations. For example, the character o (U+006F: LATIN SMALL LETTER
                     O) followed by the combining accent ¨ (U+0308: COMBINING DIAERESIS) should be
                     treated as identical in meaning to the single character ö (U+00F6: LATIN SMALL
                     LETTER O WITH DIAERESIS). There are two codepoints that could be used for the
                     Greek question mark (;), and normalization converts the less preferred
                     codepoint to the other.</para>
                  <para>TAN validation rules require all data to be normalized according to the
                     Unicode NFC algorithm (the most common of the four normalization methods). Any
                     text in a TAN file that is not NFC normalized will be marked as invalid. A
                     supplied Schematron Quick Fix will let users automatically normalize text (for
                     editing tools such as Oxygen that support Schematron Quick Fixes). This
                     enforcement of NFC normalization helps to make sure that texts are fairly
                     compared.</para>
               </section>
               <section xml:id="unicode-characters-with-special-interpretation">
                  <title>Unicode characters with special interpretation</title>
                  <para>The characters U+200B ZERO WIDTH SPACE, U+200D ZERO WIDTH JOINER, and U+00AD
                     SOFT HYPHEN placed at the end of a leaf <link linkend="element-div"
                           ><code>&lt;div></code></link>, perhaps followed by space that will be
                     ignored (see below), signal that the text is to be joined with any subsequent
                     text (i.e., the next leaf <link linkend="element-div"
                        ><code>&lt;div></code></link>). Accordingly, any TAN function that needs to
                     extract text from a leaf <link linkend="element-div"
                        ><code>&lt;div></code></link> structure will delete from the end of its text
                     the U+200B, U+200D, or U+00AD character and its trailing space. (By contrast,
                     text from a leaf <link linkend="element-div"><code>&lt;div></code></link> that
                     does not end this way will first be space-normalized, then a single space will
                     be appended.) Because these special line-end characters are difficult to
                     distinguish visually from spaces and hyphens, their XML entities,
                        <code>&amp;#x200b;</code>, <code>&amp;#x200d;</code>, and
                        <code>&amp;#xad;</code> should be preferred in any XML output.</para>
                  <para>Much has been written about the different ways U+00AD SOFT HYPHEN has been
                     or should be used and interpreted. Debate will no doubt continue. TAN design
                     assumes that the soft hyphen marks a place in a word where a line break has
                     occurred, is allowed to occur, or both. In situations where the text is printed
                     or displayed, any soft hyphen that does not mark a word broken by a line should
                     not be displayed.</para>
               </section>
               <section xml:id="combining_characters">
                  <title>Combining characters</title>
                  <para>At the core level of conformance, Unicode does not dictate whether combining
                     characters (accents, modifying symbols) should be counted independently, or as
                     part of a base character, nor do core XML technologies. In most cases, this
                     point is negligible. But it can affect regular expressions and XPath
                     expressions (see below). </para>
                  <para>Two of the class-2 formats allow the counting of characters. Such counting
                     is assumed to be made exclusively of individual base (non-combining) characters
                     (each perhaps followed by one or more combining characters). Therefore one
                     character is defined as the regular expression <code>\P{M}\p{M}*</code>, bound
                     to global variable <xref xlink:href="#variable-char-regex"/>. Any numerical
                     reference made in a TAN file to an individual character, i.e., through
                           <code><link linkend="attribute-chars">@chars</link></code>, is
                     interpreted by counting only non-combining characters. When the nth character
                     is requested, TAN functions will return the nth base character along with any
                     combining characters that immediately follow. </para>
                  <para>For example, a̳b̈́c͠d consists of four base characters, interleaved with
                     three combining characters, technically seven total. But <code><link
                           linkend="attribute-chars">@chars</link></code>, which counts characters,
                     there are a maximum of four characters. A value of 1 picks both the base
                     character and its combining character, a̳.</para>
                  <para>TAN rules stipulate that combining characters must have a preceding base
                     character. Any <link linkend="element-div"><code>&lt;div></code></link> that,
                     after any initial space, starts with a combining character will be marked as
                     invalid. See also <xref linkend="reg_exp_and_comb_chars"/>.</para>
               </section>
               <section xml:id="deprecated-unicode-points">
                  <title>Unicode points not allowed</title>
                  <para>Because TAN files are not scriptum-oriented (see <xref
                        xlink:href="#domain_model"/>), the following characters will generate an
                     error if found in a TAN file:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>U+00A0 NO-BREAK SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2000 EN QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2001 EM QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2002 EN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2003 EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2004 THREE-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2005 FOUR-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2006 SIX-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2007 FIGURE SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2008 PUNCTUATION SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2009 THIN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+200A HAIR SPACE</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
               <section>
                  <title>Further reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://unicode.org">Unicode
                              Consortium</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="http://en.wikipedia.org/wiki/Unicode"
                                 >Unicode</link> (Wikipedia)</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="xml">
               <title>eXtensible Markup Language (XML)</title>
               <section>
                  <title>What is it?</title>
                  <para>Defined by the W3C, the eXtensible Markup Language (XML) is a markup
                     language that that can be extended to allow anyone to define the structure and
                     rules of a document type. For a quick, simple introduction to XML see <xref
                        linkend="gentle_guide"/>. XML is one of many formats that can be described
                     as tree-based formats. Others include JSON, HTML, YAML, and Markdown. All of
                     the preceding formats can be expressed in XML, but not the other way around.
                     This does not mean that XML is inherently superior. (For some purposes, it is
                     overkill.) But it does mean that XML is the lingua franca for treelike data
                     structures. For more on the relationship between XML and other treelike
                     formats, especially JSON, see the <link
                        xlink:href="https://www.w3.org/community/ixml/">Invisible Markup Community
                        Group</link>.</para>
               </section>
               <section>
                  <title>Schemas and validation</title>
                  <para>TAN validation files are found in the <code>schemas</code>
                     subdirectory.</para>
                  <para>Each TAN file is validated by two types of schema files, one dealing with
                     major rules concerning structure and data type, written in RELAX-NG, the other
                     with more complex, detailed rules, written in Schematron.</para>
                  <para>The RELAX-NG rules are written primarily in compact syntax
                        (<code>*.rnc</code>), and then converted to XML syntax (<code>*.rng</code>).
                     For TAN-TEI, the special format One Document Does it all
                        (<code>TAN-TEI.odd</code>) is used to adjust the rules for TEI All. The ODD
                     file is then processed by TEI stylesheets into compact and XML RELAX-NG
                     formats.</para>
                  <para>The Schematron files are generally quite short. The primary work is done by
                     an extensive function library written in XSLT. For the most part, the
                     Schematron files arbitrate between the file and the validation results
                     calculated by the TAN function library. For a detailed overview of this
                     process, see <xref xlink:href="#validating_tan_files"/>.</para>
                  <para>Some validation engines that process a valid TAN-compliant TEI file may
                     return an error such as <code>conflicting ID-types for attribute "who" of
                        element "comment" from namespace "tag:textalign.net,2015:ns"</code>. Such a
                     message alerts you to the fact that by mixing TEI and TAN namespaces, you open
                     yourself up to the possibility of conflicting <code>xml:id</code> values. It is
                     your responsibility to ensure that you have not assigned duplicate identifiers.
                     An XML editor may be configured to ignore this discrepancy. (In Oxygen XML
                     editor go to Options > Preferences... > XML > XML Parser > RELAX NG and uncheck
                     the box ID/IDREF.)</para>
               </section>
               <section xml:id="whitespace">
                  <title>Space characters and normalization</title>
                  <para>By default in XML, unless otherwise specified, consecutive space characters
                     (space, tab, newline, and carriage return) are considered equivalent to a
                     single space. This gives editors the freedom to format XML documents as they
                     like, balancing human readability against compactness. In XML, <emphasis
                        role="bold">space normalization</emphasis> is performed by stripping leading
                     and trailing whitespace and replacing sequences of one or more whitespace
                     character with a single space, <code>&amp;#x20;</code>. </para>
                  <para>All TAN formats assume space normalization, with an extra caveat for leaf
                           <code><link linkend="element-div">&lt;div></link></code>s. Initial space
                     is always stripped. If a leaf <code><link linkend="element-div"
                        >&lt;div></link></code> ends in the soft hyphen or the zero width joiner
                     (see <xref linkend="unicode-characters-with-special-interpretation"/>) the
                     character is suppressed along with any ending space, otherwise the text is
                     normalized to end in a single space character (whether or not there are space
                     characters in the leaf <code><link linkend="element-div">&lt;div></link></code>
                     itself).</para>
                  <para>If retention of multiple spaces or spaces of specific sizes is important for
                     your files and research, then you should not be working with the TAN format,
                     which cannot be used to replicate the appearance of a scriptum (see <xref
                        xlink:href="#domain_model"/>). Pure TEI (and not TAN-TEI) is a better
                     alternative, since it allows for a literal use of space, and supports the
                     creation of scriptum-oriented XML files. Once you finish with that
                     scriptum-oriented transcription, you might be ready to prepare a second one
                     oriented toward intertextual analysis, at which point TAN would be
                     ideal.</para>
                  <para>For more on space see guidance in <link
                        xlink:href="https://www.w3.org/TR/REC-xml/#sec-white-space">the W3C
                        recommendation</link>.</para>
               </section>
               <section xml:id="non-mixed_content">
                  <title>Mixed, non-mixed, and semi-mixed content</title>
                  <para>In many popular XML formats such as TEI, XHTML, and Docbook some elements
                     allow a mixture of elements and nonspace text as children, e.g.,
                        <code>&lt;div>Some &lt;span>text&lt;/span>&lt;/div></code>. These are called
                        <emphasis role="bold">mixed content</emphasis> models. The TAN formats,
                     aside from TAN-TEI, are committed to a <emphasis role="bold">non-mixed
                        content</emphasis> model, e.g., <code>&lt;div>&lt;span>Some
                        &lt;/span>&lt;span>text&lt;/span>&lt;/div></code>. Nonspace text nodes and
                     elements are never siblings. The practical effect of this decision is TAN files
                     may be indented as you like, and whitespace text may be placed anywhere,
                     without altering the meaning. The exception are TAN-TEI files, which allow any
                     kind of TEI constructions, including mixed content. Many projects do not
                     consider the implications of how they render space, however, and you should
                     study the topic closely.</para>
                  <para>An expanded TAN file (see <xref linkend="validating_tan_files"/>) may
                     include what we term a <emphasis role="bold">semi-mixed content</emphasis>
                     model, in which any element may have one and only one nonspace text node along
                     with any children elements. That nonspace text node may appear at the beginning
                     or the end of the children nodes. This applies only to the expansion of TAN
                     files, not to TAN files themselves.</para>
               </section>
            </section>
            <section xml:id="namespace">
               <title>Namespaces</title>
               <section>
                  <title>What are they?</title>
                  <para>XML allows users to create document types of whatever kind. One person may
                     wish to use the element <code>&lt;band></code> to refer to a musical group;
                     another might use this element to encode radio frequencies. Perhaps someone
                     wishes to mention a musical group and a radio frequency in the same document,
                     which would entail mixing two very different types of elements, each named
                        <code>band</code>. XML allows users to mix vocabularies, even when those
                     vocabularies use the same element names. Disambiguation is accomplished by
                     associating an element name with a kind of family name. That family name is an
                     IRI (see <xref linkend="IRIs_and_linked_data"/> below). The actual full name of
                     an element, then, is the local name plus the IRI that qualifies its meaning,
                     e.g., <code>band{http://music-example.com/terms/}</code> and
                        <code>band{http://frequency-example.com/terms/}</code>. </para>
                  <para>The IRI—the family name—is called the <emphasis>namespace</emphasis>, a term
                     that might seem vague or confusing. It has nothing to do with space. It is
                     merely a term of art to qualify a name. In the world there are many cities that
                     have the same name. We use the name of the state, region, or even country to
                     explain which city we mean. As region names are to city names, so namespaces
                     are to element (and some attribute) names.</para>
                  <para>Namespaces can be declared in an XML document. When they appear, they look a
                     lot like attributes. (They aren't.) They take the form
                        <code>xmlns="http://music-example.com/terms/"</code> (this defines the
                        <emphasis role="bold">default namespace</emphasis>) or
                        <code>xmlns:[PREFIX]="http://frequency-example.com/terms/"</code> (this
                     assigns a namespace to a prefix) placed inside an opening tag. For example,
                        <code>&lt;band xmlns="http://music-example.com/terms/">...&lt;/band></code>
                     declares <code>http://music-example.com/terms/</code> to be the default
                     namespace for <code>&lt;band></code> and all descendants, unless explicitly
                     overridden. </para>
                  <para>To return to our example, different <code>&lt;band></code>s can be combined
                     through namespaces:</para>
                  <programlisting>&lt;band xmlns="http://music-example.com/terms/">
    &lt;band xmlns="http://radio-frequency-example.com/terms/">
        ...
    &lt;/band>
&lt;/band>

&lt;band xmlns="http://music-example.com/terms/" 
    xmlns:e2="http://radio-frequency-example.com/terms/">
    &lt;e2:band >
        ...
    &lt;/e2:band>
&lt;/band>

&lt;e1:band xmlns:e1="http://music-example.com/terms/" 
    xmlns:e2="http://radio-frequency-example2.com/terms/">
    &lt;e2:band >
        ...
    &lt;/e2:band>
&lt;/e1:band></programlisting>
                  <para>Namespaces allow us to mix elements as we like. But it also means that when
                     you point to, or refer to an element, you should always be aware of what its
                     namespace is.</para>
               </section>
               <section>
                  <title>TAN namespace and prefix</title>
                  <para>The TAN namespace is <emphasis role="bold"
                           ><code>tag:textalign.net,2015:ns</code></emphasis>. The recommended
                     prefix is <emphasis role="bold"><code>tan</code></emphasis>. The namespace does
                     not change from one version of TAN to another.</para>
                  <para>The TAN-TEI format uses as its default the TEI namespace, <code><link
                           xlink:href="http://www.tei-c.org/ns/1.0"/></code>, normally given the
                     prefix <emphasis role="bold"><code>tei</code></emphasis>. But in a TAN-TEI
                     file, the <code>head</code> and its descendants are in the TAN
                     namespace.</para>
                  <para>All TAN functions and core global parameters and variables are set in the
                     TAN namespace.</para>
               </section>
            </section>
            <section xml:id="TEI">
               <title>The Text Encoding Initiative</title>
               <section>
                  <title>What is it?</title>
                  <para>The Text Encoding Initiative (TEI; <link
                        xlink:href="http://www.tei-c.org/index.xml"/>) is consortium of scholars and
                     scholarly organizations that maintains the rules and documentation behind a
                     collection of XML formats intended for encoding texts. TEI files have been used
                     widely by libraries, museums, publishers, and individual scholars to prepare
                     and publish texts for online research, teaching, and preservation. In addition
                     to the guidelines themselves, the Consortium provides a variety of <link
                        xlink:href="http://www.tei-c.org/Support/Learn/">resources</link> and <link
                        xlink:href="http://members.tei-c.org/Events">training events</link> for
                     learning TEI, information on <link
                        xlink:href="http://www.tei-c.org/Activities/Projects/">projects using the
                        TEI</link>, a <link
                        xlink:href="http://www.tei-c.org/Activities/SIG/Education/tei_bibliography.xml"
                        >bibliography of TEI-related publications</link>, and <link
                        xlink:href="http://www.tei-c.org/Tools/">software</link>.</para>
                  <para>TEI provided the impetus for the creation of TAN, and continues to inspire
                     its development. TEI was designed to be highly customizable, to suit the needs
                     of individuals or communities of practice. One of the TAN formats, TAN-TEI, is
                     one such customization, based as it is on an ODD file that is in the same
                     directory as the rest of the schemas. TAN-TEI schemas are generated on the
                     basis of the official TEI All schema that is available at the time of release. </para>
                  <para>TAN-TEI files and standard, out-of-the-box TEI All files are not
                     automatically interchangeable. TAN-TEI expects all metadata to be human- and
                     computer-readable, whereas TEI metadata is geared primarily to human
                     readability. TAN-TEI tightly regulates the structure of the text, whereas TEI
                     allows for a variety of structures. In any conversion process to and from TEI
                     and TAN-TEI, some human intervention may be required, and conversion in either
                     direction may entail loss.</para>
                  <para>For more about the strictures placed upon the TEI All schema see <xref
                        linkend="tan-tei"/>. See also <xref linkend="class_common"/> and <xref
                        linkend="class_1"/>.</para>
               </section>
               <section>
                  <title>Further reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://www.tei-c.org/">Text Encoding
                                 Initiative</link></para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="data_types">
               <title>Data types</title>
               <para>Being written purely in XML technologies, TAN uses data types defined in the
                  W3C's <link xlink:href="https://www.w3.org/TR/xmlschema-2/">official
                     specifications</link>, e.g., strings, booleans, integers. The following data
                  types require some special comments.</para>
               <section xml:id="language">
                  <title>Languages</title>
                  <para>TAN adopts for language identification Best Common Practices (BCP) 47, which
                     standardizes identifiers for languages and scripts. For most users of TAN, this
                     will be a simple two- or three-letter abbreviation, sometimes supplemented with
                     a hyphen and an abbreviation designating a script or regional subtag. For
                     example, <code>eng</code>, <code>eng-UK</code>, and <code>eng-UK-Cyrl</code>
                     refer, respectively, to English (in general), English from the United Kingdom,
                     and English from the United Kingdom written in the Cyrillic script. As a
                     general rule, values of this type should begin with a three-letter language
                     code, preferably lowercase. (The two-letter codes cover only a few dozen
                     languages; the three-letter codes support thousands of them.)</para>
                  <para>ISO codes for human languages appear in <code><link
                           linkend="attribute-xmllang">@xml:lang</link></code> and <code><link
                           linkend="element-for-lang">&lt;for-lang></link></code>. The former states
                     what language the enclosed text is in. The latter is an empty element that
                     simply points to a specific language. For example, <code><link
                           linkend="element-for-lang">&lt;for-lang></link></code> in the context of
                     a TAN-mor file indicates which languages the file was written for.</para>
                  <para>TAN has several global variables and functions useful for working with
                     language codes. See <xref xlink:href="#tan-function-group-language"/>.</para>
               </section>
               <section xml:id="date_and_datetime">
                  <title>Dates and times</title>
                  <para>For dates and dates + times, TAN adopts the corresponding XML data types,
                     which follow ISO syntax. That syntax begins with years (the largest unit) and
                     ends with days, seconds, or fractions of seconds (the smallest).</para>
                  <para>The simplest date takes this form: <code>YYYY-MM-DD</code>. If a time is
                     included, it is specified by continuing the string, first with a <code>T</code>
                     (for time) then the form <code>hh:mm:ss.sss(Z|[-+]hh:mm)</code>. For example,
                     the following is <code>2016-09-20T20:38:27.141-04:00</code> is an ISO date-time
                     for Tuesday, September 20, 2016 at 8:38 p.m., Eastern Time Zone.</para>
               </section>
               <section>
                  <title>Further reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>BCP 47 <link xlink:href="http://tools.ietf.org/rfc/bcp/bcp47"
                                 >official specifications</link></para>
                        </listitem>
                        <listitem>
                           <para>BPC 47 <link
                                 xlink:href="http://www.w3.org/TR/xmlschema11-2/#language">technical
                                 details</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="https://www.w3.org/TR/xmlschema-2/#dateTime">W3C
                                 specification</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="https://en.wikipedia.org/wiki/ISO_8601">Wikipedia
                                 entry on ISO 8601</link></para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="IRIs_and_linked_data">
               <title>Identifiers and their use (IRIs, URIs, URLs, URNs, UUIDs)</title>
               <para>TAN makes extensive use of the following identifiers:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><emphasis>IRI</emphasis>: Internationalized Resource Identifier, a
                           generalization of the URI system, allowing the use of Unicode; <link
                              xlink:href="http://www.ietf.org/rfc/rfc3987.txt">defined by RFC
                              3987</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URI</emphasis>: Uniform Resource Identifier, a string of
                           characters used to identify a name or a resource; <link
                              xlink:href="https://tools.ietf.org/html/rfc3986">defined by RFC
                              3986</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URL</emphasis>: Uniform Resource Locator, a URI that
                           identifies a Web resource and the communication protocol for retrieving
                           the resource.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URN</emphasis>: Uniform Resource Name, a term that
                           originally referred to persistent names that used a bare
                              <code>urn:</code> scheme, but is now applied to a variety of systems
                           that have registered with the IANA. URNs are generally best thought of as
                           a subset of URIs.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>UUID</emphasis>: Universally Unique Identifier, a
                           computer-generated 128-bit number that may be attached as an identifier
                           to any entity. UUIDs can be built into a URN by prefixing them with
                              <code>urn:</code>.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>The TAN format makes extensive use of all the above. See also <xref
                     xlink:href="#tag_urn"/>.</para>
               <section xml:id="rdf_and_lod">
                  <title>Resource Description Framework (RDF) and Linked Open Data</title>
                  <section>
                     <title>What are they?</title>
                     <para>Identifiers are used in many contexts for many purposes. One such purpose
                        is called Linked Open Data (LOD), also known as the Semantic Web, which aims
                        to allow cross-project interoperability of data. It relies upon a very
                        simple data model called Resource Description Framework (RDF), recommended
                        by the World Wide Web Consortium (W3C). The term "Resource"—the R in
                        RDF—refers to any person, place, concept—anything at all, whether you think
                        of it as a resource or not. "Description" is overly specific, too, since RDF
                        was designed to support general assertions, descriptive or not. Perhaps it
                        is easiest to think of RDF as a standardized way to make assertions, as if
                        the name were simply "Assertion Framework." It is a way to make claims about
                        things in the world.</para>
                     <para>The RDF data model rests upon the concept of a statement, made of three
                        parts: subject, predicate, and object. Subjects and predicates take
                        identifiers that name things. The object may take an identifier or just
                        data. As people independently identify concepts with the same URLs, they
                        create RDF datasets can be combined, synthesized, and compared. RDF
                        statements found across the web allow inferences no individual project could
                        ever anticipate. </para>
                     <para>The Semantic Web recommends the use of URLs as identifiers. That way, if
                        a computer encounters a URL naming a concept, it can be programmed go to the
                        web resource and retrieve other RDF statements, recursively. So URL
                        identifiers look like a web page address (e.g., <code>http://...</code>),
                        but they are first and foremost names for things. Ideally, those URLs will
                        still name those things after the domain name expires and the web resource
                        cannot be found. </para>
                     <para>Although RDF statements must be made of only three components, it is
                        possible in a roundabout way to create more complex assertions. In one
                        technique, the assertion itself is given a URL, and then RDF statements are
                        made about the assertion. Such assertions are in some cases not easily
                        integrated with other RDF statements. Users who query an RDF database will
                        not find relevant complex RDF statements unless they build their queries to
                        anticipate such situations (or the query engine has been customized).</para>
                  </section>
                  <section>
                     <title>TAN claims and RDF</title>
                     <para>Much of TAN can be converted to RDF statements. In fact, TAN may be one
                        of the most human-friendly ways to read and write RDF. For example, consider
                        how one might express "Person X's name is 'Dave Smith'." Compare this
                        snippet (taken from <link
                           xlink:href="http://linkeddatabook.com/editions/1.0/"/>), written in
                        Turtle, the RDF syntax generally regarded as the most human-readable,
                        ...<programlisting>@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix foaf: &lt;http://xmlns.com/foaf/0.1/> . 

&lt;http://biglynx.co.uk/people/dave-smith> 
rdf:type foaf:Person ; 
foaf:name "Dave Smith" .</programlisting></para>
                     <para>...with the TAN
                        equivalent:<programlisting>&lt;person>
   &lt;IRI>http://biglynx.co.uk/people/dave-smith&lt;/IRI>
   &lt;name>Dave Smith&lt;/name>
&lt;/person></programlisting></para>
                     <para>These TAN and RDF expressions are interchangeable. </para>
                     <para>But in more complex claims, it is, at this time, not clear whether all
                        assertions in TAN can be losslessly converted to the RDF model. Every
                        class-2 file makes a claim about the text, and there must always be attached
                        to the claim someone that must be blamed or credited for the assertion. TAN
                        also permits such claims to be modified through traditional adverbs. This is
                        best seen in the TAN-A <code><link linkend="element-claim"
                           >&lt;claim></link></code>, which allows a person to nuance a claim to a
                        degree that is difficult or impossible to express in traditional RDF. For
                        example, RDF does not allow one to say "Person X is not the author of text
                        Y," but TAN does. </para>
                     <para>TAN claims can also be quite complex. Whereas the standard RDF claim
                        consists of three components—subject, predicate, object—most TAN claims have
                        more. Every TAN claim must have at the minimum: a claimant (no RDF
                        counterpart; the person, organization, or algorithm that asserts the claim),
                        a subject (counterpart to RDF subject), and a verb (counterpart to RDF
                        predicate). Verbs can be defined to permit, require, or disallow other claim
                        components, such as adverbs or objects, many of which are permitted by
                        default. Most TAN claims involve more than three components, so converting a
                        TAN claim to RDF requires creating a complex RDF statement. In many cases,
                        this requires the use of RDF* instead of RDF (link below). </para>
                     <para>Many TAN claims involve textual subjects or objects. References to parts
                        of text can be quite complex, and they must be made with reference to other
                        entities. It doubtful whether a given specific textual subject or object can
                        be satisfactorily reduced to an unambiguous IRI, because such an IRI would
                        need to include a mechanism to resolve the meaning of the syntax. Such an
                        IRI must not only explain the work's reference system, but also identify the
                        chosen version, scriptum, and perhaps token definition and numeration
                        system. Many texts have more than one "canonical" reference system, so an
                        IRI might point to two different textual passages, thereby breaking a
                        cardinal rule of IRIs: although an entity may be given multiple IRIs, it is
                           <emphasis>never</emphasis> acceptable for an IRI to be ambiguous. There
                        is, at present, no widely accepted solution to this problem, although
                        attempts have been made through CTS URNs and DTS URNs.</para>
                     <para>For more details see <xref linkend="TAN-A"/> and <code><link
                              linkend="element-claim">&lt;claim></link></code>.</para>
                  </section>
                  <section>
                     <title>Further reading</title>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para><link xlink:href="https://www.w3.org/RDF/">W3C
                                    recommendation</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://linkeddata.org/">Linked
                                 Data</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://lov.okfn.org/dataset/lov/">Linked Open
                                    Vocabularies</link></para>
                           </listitem>
                           <listitem>
                              <para><link
                                    xlink:href="https://w3c.github.io/rdf-star/cg-spec/2021-02-18.html"
                                    >RDF*</link></para>
                           </listitem>
                           <listitem>
                              <para><link
                                    xlink:href="https://www.homermultitext.org/hmt-docs/cite/cts-urn-overview.html"
                                    >CTS URNs</link></para>
                           </listitem>
                           <listitem>
                              <para><link
                                    xlink:href="https://distributed-text-services.github.io/specifications/"
                                    >DTS URNs</link></para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </section>
               </section>
               <section xml:id="tag_urn">
                  <title>Tag URNs</title>
                  <para>TAN files make extensive use of tag URNs (see <xref
                        xlink:href="#IRIs_and_linked_data"/>). In fact, TAN's namespace is itself a
                     tag URN (<xref linkend="namespace"/>). A <link
                        xlink:href="http://www.taguri.org">tag URN</link> has two parts:</para>
                  <para>
                     <orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Namespace.</emphasis>
                              <code>tag:</code> + an e-mail address or domain name owned by the
                              person or organization that has authorized the creation of the TAN
                              file + <code>,</code> + an arbitrary day on which that address or
                              domain name was owned + <code>:</code>. The day is expressed in the
                              form <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
                                 <code>YYYY</code>. A missing <code>MM</code> or <code>DD</code> is
                              implicitly assigned the value of <code>01</code>.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Name of the subject.</emphasis> An arbitrary
                              string (unique to the namespace chosen) chosen by the namespace owner
                              as a label for subject (e.g., the file, a work, a scriptum). If you
                              are providing a tag URN for a TAN file, that name can be the same as
                              the filename, but it is a good practice not to do so, because
                              filenames need to be changed. You should pick a name that is at least
                              somewhat intelligible to human readers. It is a good idea to build a
                              name via categories, from most general to most specific. For example
                                 <code>tag:pat@example.com,2014:work:aristotle-pseudo:secreta-secretorum</code>
                              might be used as an IRI to name the work the <emphasis>Secret of
                                 Secrets</emphasis> attributed to Aristotle. A TAN file that
                              transcribes a particular version of this text might look like this:
                                 <code>tag:pat@example.com,2014:transcription:scriptum:badawi-1954:work:secrets</code>.</para>
                        </listitem>
                     </orderedlist>
                  </para>
                  <para>Although you may use any tag URN coined by someone else, when you create a
                     tag URN, you may use only namespaces you own or owned.</para>
                  <para>Care should be taken in choosing the name, because you are the sole
                     guarantor of its uniqueness. <emphasis role="italic">It is permissible for
                        something to have multiple identifiers, but never acceptable for an
                        identifier to name more than one thing.</emphasis> It is a good practice to
                     keep a master checklist of tag URNs you have created. If you find yourself
                     forgetting, or think you run the risk of creating duplicate tag URNs, you
                     should start afresh by creating a new namespace for your tag URNs, if only by
                     changing the date in the tag URN namespace.</para>
                  <para>
                     <example>
                        <title>Tag URNs</title>
                        <programlisting>tag:jan@example.com,1999-01-31:TAN-T001
tag:example.com,2001-04:work:usc22.1
tag:evagriusponticus.net,2014:tan-a-lm:Evagrius_Praktikos_grc_Guillaumonts
tag:bbrb@example.org,1995-04-01:pos-grc</programlisting>
                        <para>The first example comes from someone who owned the email address
                              <code>jan@example.com</code> on January 31, 1999 (at the stroke of
                           midnight, Universal Coordinated Time). The other examples follow a
                           similar logic. The namespace of the second and third examples are tied to
                           the owners of specific domain names. The <code>2014</code> in the third
                           example is shorthand for the first second of January 1, 2014.</para>
                     </example>
                  </para>
                  <para>TAN files are identified and named via tag URNs, not URLs, for several
                     reasons:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><emphasis role="bold">Permanence.</emphasis> Authors of TAN data
                              are creating files that are meant to be relevant for decades and
                              centuries from now, well after most domain names today have changed
                              ownership or fallen into obsolesence, and well after the creators are
                              dead. URLs are not designed for such longevity. </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Responsibility.</emphasis> The TAN format
                              requires every piece of data to be attributable to someone (a person,
                              a group of persons, or an algorithm). A tag URN connects the
                              identifier with the responsible person or group. URLs cannot identify
                              the person or organization responsible for the name.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Accessibility.</emphasis> Tag URNs have
                              almost no barriers. They can be created by anyone who has an email
                              address. No one has to register with a central authority. You can
                              begin naming anything you want, any time you want, without anyone's
                              approval, and without paying anything.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Ease</emphasis>. Tag URNs are easy to use.
                              All you need is an email address, which is very easy to get. You can
                              use a domain name too, but many potential TAN authors never have owned
                              a domain name, and never will, barring them from creating or
                              publishing linked open data under the classic model, where you coin
                              URLs in a domain you own. Many of those who do own domain names cannot
                              or do not wish to configure, populate, maintain, and troubleshoot
                              servers with the referral mechanisms recommended by Semantic Web
                              advocates (see <xref xlink:href="#rdf_and_lod"/>).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Scholarly citation norms</emphasis>. In the
                              Semantic Web, the conflation of URL <emphasis>qua</emphasis> name with
                              URL <emphasis>qua</emphasis> location is considered by many a virtue
                              because the single string does double duty, both naming the resource
                              and pointing to a location where more can be learned. Although the
                              combination is elegant from the perspective of an engineer, it is
                              confusing to many others. URLs are commonly thought to be merely
                              locations for data, not names for things. It also goes against an
                              important principle in scholarly citation practices, namely, the name
                              of a publication should always be distinguished from where it might be
                              found. </para>
                        </listitem>
                     </itemizedlist>
                  </para>
                  <para>Further reading:</para>
                  <itemizedlist>
                     <listitem>
                        <para><link xlink:href="https://tools.ietf.org/html/rfc4151">RFC
                           4151</link>, the official definition of tag URNs</para>
                     </listitem>
                  </itemizedlist>
               </section>
            </section>
            <section xml:id="regular_expressions">
               <title>Regular expressions</title>
               <para>Regular expressions are patterns for searching text. The term <emphasis
                     role="italic">regular</emphasis> here does not mean ordinary. Rather, alluding
                  to the Latin root <emphasis role="italic">regula</emphasis> (rule), it refers to a
                  rule-based method of finding and replacing text through patterns. Regular
                  expressions come in different flavors, and have several layers of complexity. TAN
                  regular expressions adhere closely to the <link
                     xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions">recommendation
                     of XSLT 3.0</link> (XML Schema Datatypes plus some extensions), and outlined in
                     <link xlink:href="https://www.w3.org/TR/xpath-functions-31/#regex-syntax">XPath
                     Functions 3.1</link>. <caution>
                     <para>XML Schema Datatypes define regular expressions differently than do Perl,
                        one of the most common forms of regular expression. For example, the pipe
                        symbol, |, is treated as a word character in XML regular expressions
                           (<code>\w</code>), but the opposite is true for Perl. For convenience,
                        here are the codepoints in the range U+0020..U+00FF that are considered word
                        characters according to XML (and therefore TAN):</para>
                     <para><emphasis role="bold">Word characters </emphasis>(<code>\w</code>):
                           <code>$ + 0 1 2 3 4 5 6 7 8 9 &lt; = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ</code>
                     </para>
                     <para><emphasis role="bold">Non-word characters </emphasis>(<code>\W</code>):
                           <code>! " # % &amp; ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ­ ¶ · »
                           ¿</code></para>
                     <para>The placement of some of these characters may seem to you
                        counterintuitive or wrong. But at this point complaining will not change the
                        conventions. Any apparent mistakes are definitive ones. Just familiarize
                        yourself with the conventions.</para>
                  </caution></para>
               <para>A regular expression search pattern is treated just like a normal search
                  pattern until the computer reaches a special character: <code>. [ ] \ | ^ $ ? * +
                     { } ( )</code>. Here is a brief key to how those special characters behave in
                  regular expressions when they are first found. (Some of these special characters
                  change their meaning if they are found inside square brackets; on this point, see
                  the recommended reading below):</para>
               <para><table frame="all">
                     <title>Special characters in regular expressions</title>
                     <tgroup cols="2">
                        <colspec colname="c1" colnum="1" colwidth="1*"/>
                        <colspec colname="c2" colnum="2" colwidth="12.33*"/>
                        <thead>
                           <row>
                              <entry>Symbol</entry>
                              <entry>Meaning</entry>
                           </row>
                        </thead>
                        <tbody>
                           <row>
                              <entry><code>.</code></entry>
                              <entry>any character</entry>
                           </row>
                           <row>
                              <entry><code>|</code></entry>
                              <entry>or (union)</entry>
                           </row>
                           <row>
                              <entry><code>^</code></entry>
                              <entry>start of line or string (doesn't capture any
                                 characters)</entry>
                           </row>
                           <row>
                              <entry><code>?</code></entry>
                              <entry>zero or one</entry>
                           </row>
                           <row>
                              <entry><code>*</code></entry>
                              <entry>zero or more</entry>
                           </row>
                           <row>
                              <entry><code>+</code></entry>
                              <entry>one or more</entry>
                           </row>
                           <row>
                              <entry><code>[ ]</code></entry>
                              <entry>a class of characters</entry>
                           </row>
                           <row>
                              <entry><code>( )</code></entry>
                              <entry>a group</entry>
                           </row>
                           <row>
                              <entry><code>^</code></entry>
                              <entry>beginning of a line or string (doesn't capture any
                                 characters)</entry>
                           </row>
                           <row>
                              <entry><code>$</code></entry>
                              <entry>end of a line or string (doesn't capture any
                                 characters)</entry>
                           </row>
                        </tbody>
                     </tgroup>
                  </table>If you need to use any of those special characters as characters in their
                  own right, then you need to escape them, by prefixing the character with an escape
                  character, \.</para>
               <para>
                  <table frame="all">
                     <title>Special characters in regular expressions</title>
                     <tgroup cols="2">
                        <colspec colname="c1" colnum="1" colwidth="1*"/>
                        <colspec colname="c2" colnum="2" colwidth="12.33*"/>
                        <thead>
                           <row>
                              <entry>Symbol</entry>
                              <entry>Meaning</entry>
                           </row>
                        </thead>
                        <tbody>
                           <row>
                              <entry><code>\\</code></entry>
                              <entry>backslash (an escaped escape character)</entry>
                           </row>
                           <row>
                              <entry><code>\^</code></entry>
                              <entry>a caret sign (must be escaped with the \)</entry>
                           </row>
                           <row>
                              <entry><code>\$</code></entry>
                              <entry>dollar sign (escaped)</entry>
                           </row>
                           <row>
                              <entry><code>\(</code></entry>
                              <entry>opening parenthesis (escaped)</entry>
                           </row>
                           <row>
                              <entry><code>\[</code></entry>
                              <entry>opening square bracket (escaped)</entry>
                           </row>
                        </tbody>
                     </tgroup>
                  </table>
               </para>
               <para>The escape character appearing before some letters accesses certain classes of
                  characters:</para>
               <para>
                  <table frame="all">
                     <title>Special characters in regular expressions</title>
                     <tgroup cols="2">
                        <colspec colname="c1" colnum="1" colwidth="1*"/>
                        <colspec colname="c2" colnum="2" colwidth="12.33*"/>
                        <thead>
                           <row>
                              <entry>Symbol</entry>
                              <entry>Meaning</entry>
                           </row>
                        </thead>
                        <tbody>
                           <row>
                              <entry><code>\w</code></entry>
                              <entry>any word character</entry>
                           </row>
                           <row>
                              <entry><code>\W</code></entry>
                              <entry>any nonword character</entry>
                           </row>
                           <row>
                              <entry><code>\s</code></entry>
                              <entry>any of the four standard spacing characters: space (U+0020),
                                 tab (U+0009), newline (U+000A), carriage return (U+000D)</entry>
                           </row>
                           <row>
                              <entry><code>\S</code></entry>
                              <entry>anything not a spacing character</entry>
                           </row>
                           <row>
                              <entry><code>\d</code></entry>
                              <entry>any digit (0-9)</entry>
                           </row>
                           <row>
                              <entry><code>\D</code></entry>
                              <entry>anything not a digit</entry>
                           </row>
                           <row>
                              <entry><code>\p{IsGujarati}</code></entry>
                              <entry>any character from the Unicode block named Gujarati</entry>
                           </row>
                        </tbody>
                     </tgroup>
                  </table>
               </para>
               <para>Some examples of regular expressions:</para>
               <table frame="all">
                  <title>Examples of Regular Expressions</title>
                  <tgroup cols="3">
                     <colspec colname="newCol1" colnum="1" colwidth="1*"/>
                     <colspec colname="c1" colnum="2" colwidth="1.48*"/>
                     <colspec colname="c2" colnum="3" colwidth="6.59*"/>
                     <thead>
                        <row>
                           <entry>Expression</entry>
                           <entry>Meaning</entry>
                           <entry>What the expression matches when applied to "Wi-fi, good. A_hem*
                              isn't!"</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>^.+$</code></entry>
                           <entry>one whole line of characters</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>[ae]</code></entry>
                           <entry>a or e</entry>
                           <entry>"e"</entry>
                        </row>
                        <row>
                           <entry><code>[a-e]</code></entry>
                           <entry>a, b, c, d, or e</entry>
                           <entry>"d", "e"</entry>
                        </row>
                        <row>
                           <entry><code>[^ae]+</code></entry>
                           <entry>one or more characters that are anything except a or e</entry>
                           <entry>"Wi-fi, good. A_h", "m* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>.i</code></entry>
                           <entry>any character followed by i.</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>(.i)</code></entry>
                           <entry>when a character followed by an i is found treat it as a capture
                              group (used only in a search pattern)</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>[aeiou]\w*</code></entry>
                           <entry>any lowercase vowel along with every word character that
                              follows</entry>
                           <entry>"i", "i", "ood", "em", "isn"</entry>
                        </row>
                        <row>
                           <entry><code>[t*].</code></entry>
                           <entry>any t or * and the following character</entry>
                           <entry>"* ", "t!" Note that the asterisk, if inside a character class,
                              represents itself.</entry>
                        </row>
                        <row>
                           <entry><code>\s+</code></entry>
                           <entry>one or more space characters</entry>
                           <entry>" ", " ", " "</entry>
                        </row>
                        <row>
                           <entry><code>\w+</code></entry>
                           <entry>one or more word characters</entry>
                           <entry>"Wi", "fi", "good", "A_hem", "isn", "t"</entry>
                        </row>
                        <row>
                           <entry><code>\W+</code></entry>
                           <entry>match one or more nonword characters</entry>
                           <entry>"-", ", ", ". ", "* ", "'", "!"</entry>
                        </row>
                        <row>
                           <entry><code>[^q]+</code></entry>
                           <entry>one or more characters that are not a q</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>The examples above provide a taste of how regular expressions are constructed
                  and read.</para>
               <warning xml:id="reg_exp_and_comb_chars">
                  <title>Regular Expressions and Combining Characters</title>
                  <para>A regular expressions might be ambiguous in the context of combining
                     characters. Suppose we have a string of three characters, áb (i.e., an acute
                     accent over the a; the codepoints are, in XML entities,
                        <code>&amp;#x61;&amp;#x301;&amp;#x62;</code>). The regular expression
                        <code>a.</code> will in some search engines include the b and others
                     not.</para>
                  <para>Unicode has differentiated three levels of support for regular expressions
                     (see <link xlink:href="http://www.unicode.org/reports/tr18/">official
                        report</link>). Only level-one conformance in XPath and therefore TAN is
                     guaranteed. Combining characters fall in level two. In TAN, character counts
                     depend exclusively upon base characters, not combining ones (see <xref
                        linkend="combining_characters"/>).</para>
               </warning>
               <para>TAN includes several functions that usefully extend XML regular expressions.
                  See <xref xlink:href="#tan-function-group-regular_expressions"/>.</para>
               <para>Further reading:<itemizedlist>
                     <listitem>
                        <para>Various <link
                              xlink:href="http://www.google.com/search?q=tutorial+regular+expressions"
                              >tutorials on Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para>Wikipedia, <link
                              xlink:href="http://en.wikipedia.org/wiki/Regular_expression">Regular
                              Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions"
                              >Regular Expressions in XSLT 3.0</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.unicode.org/reports/tr18/">Unicode and
                              Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xmlschema-2/#regexs">XML Schema
                              Datatypes</link></para>
                     </listitem>
                     <listitem>
                        <para><link
                              xlink:href="http://www.balisage.net/Proceedings/vol25/html/Kalvesmaki01/BalisageVol25-Kalvesmaki01.html"
                              >A New \u: Extending XPath Regular Expressions for Unicode</link>
                        </para>
                     </listitem>
                  </itemizedlist></para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_common">
         <title>Common patterns and structures</title>
         <para>This chapter provides general background to the elements and attributes that are
            common to all TAN files. For more detailed discussion, see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>This chapter does not discuss TAN catalog files, on which see <xref
               linkend="catalog-files"/>.</para>
         <section xml:id="patterns">
            <title>Common patterns</title>
            <section xml:id="pattern-iri_and_name">
               <title>IRI + name pattern</title>
               <para>Both humans and computers need to read and write TAN metadata. Very often what
                  is readable to humans is unreadable to computers, and vice versa. So the TAN
                  format requires that all metadata be provided whenever possible in both forms.
                  Although this rule may appear to introduce redundancy and therefore opportunities
                  for error, the clarity is critical. It is the only way at present to ensure that
                  any person or algorithm that approaches the data can parse and use it. In
                  addition, doubly expressed metadata provides a safeguard much like a checksum:
                  human- and computer-readable descriptions should comport. Any discrepancy signals
                  a problem that should be checked.</para>
               <para>Some metadata, such as that inside <code><link linkend="element-comment"
                        >&lt;comment></link></code> or <code><link linkend="element-change"
                        >&lt;change></link></code>, are neither easily nor profitably translated
                  into a computer-actionable string. In such cases only the human-readable form is
                  required. Other metadata involve regular expressions (e.g., <code><link
                        linkend="attribute-pattern">@pattern</link></code>) or ISO-compliant dates
                  (e.g., <code><link linkend="attribute-when">@when</link></code>), both of which
                  are well formed and are usually human-legible. Such data are not repeated,
                  although they may be explained via <code><link linkend="element-desc"
                        >&lt;desc></link></code> or <code><link linkend="element-comment"
                        >&lt;comment></link></code>.</para>
               <para>Those exceptions aside, all other metadata takes what is called the <emphasis
                     role="italic">IRI + name</emphasis> pattern: one or more <code><link
                        linkend="namespace">&lt;IRI></link></code>s followed by one or more
                        <code><link linkend="element-name">&lt;name></link></code>s then zero or
                  more <code><link linkend="element-desc">&lt;desc></link></code>s. This is the core
                  pattern for nearly all TAN vocabulary items.</para>
            </section>
            <section xml:id="digital_entity_metadata">
               <title>Digital entity metadata pattern</title>
               <para>Some entities identified by the <xref linkend="pattern-iri_and_name"/> will be
                  digital resources. In those cases, the IRI + name pattern is extended.</para>
               <para>There must be one or more <code><link linkend="element-location"
                        >&lt;location></link></code>s, with <code><link linkend="attribute-href"
                        >@href</link></code> and <code><link linkend="attribute-accessed-when"
                        >@accessed-when</link></code>, which signals where the resource is and when
                  it was last consulted. In validation, only the first document available will be
                  used. Extra <code><link linkend="element-location">&lt;location></link></code>s
                  might prove helpful for applications.</para>
               <para>There may be an optional <code><link linkend="element-checksum"
                        >&lt;checksum></link></code>, to more accurately specify which version of a
                  file was consulted.</para>
               <para>If the entity is a TAN file, then <code><link linkend="namespace"
                        >&lt;IRI></link></code> must be a valid tag URN that matches the <code><link
                        linkend="attribute-id">@id</link></code> value of the TAN file being
                  referred to. Because there is only one <code><link linkend="attribute-id"
                        >@id</link></code> in a TAN file, any IRI + name pattern that points to it
                  will have only one <code><link linkend="namespace">&lt;IRI></link></code>. If the
                  entity is not a TAN file, then any IRI may be used, including its resolved
                  URL.</para>
               <para><code><link linkend="attribute-accessed-when">@accessed-when</link></code>
                  states when a file was last accessed. During validation, the target file will be
                  checked. Any changes before that date will be ignored; those after will be
                  reported, normally as warnings. See <xref xlink:href="#versioning-tan-files"
                  />.</para>
               <para>All these requirements may seem excessive, since in other formats (HTML, TEI),
                  to refer to another file one needs simply a link, via <code>@href</code> or
                     <code>@src</code>. But TAN files are meant to be valid long after their
                  creation, when <code><link linkend="attribute-href">@href</link></code> points to
                  broken links. An <code><link linkend="namespace">&lt;IRI></link></code> might
                  allow one to find a missing file. It also helps specify which file is intended.
                  Sometimes one file gets overwritten by a different one.</para>
            </section>
            <section xml:id="edit_stamp">
               <title>Edit stamp</title>
               <para>Most TAN elements allow for an optional edit stamp, an <code><link
                        linkend="attribute-ed-who">@ed-who</link></code> and an <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, stating who created or
                  edited the enclosed data and when. Neither attribute is allowed without the other. </para>
               <para><code><link linkend="attribute-ed-when">@ed-when</link></code> is one of the
                  attributes that help determine a file's version. See <xref
                     xlink:href="#versioning-tan-files"/>.</para>
               <para>An edit stamp is much like a <code><link linkend="element-change"
                        >&lt;change></link></code> without a narrative. The attributes simply mark
                  the element where a change has been made. If a description of the alteration is
                  considered necessary, <code><link linkend="element-change"
                     >&lt;change></link></code> should be used.</para>
            </section>
         </section>
         <section xml:id="structure">
            <title>Overall structure</title>
            <para>All TAN-compliant files, no matter the type or class, follow a common basic
               structure: (1) a prolog normally with at two processing instruction nodes; (2) a root
               element; and (3) a head, a body, and an optional teiHeader and tail.</para>
            <para><emphasis role="italic">Prolog and processing instruction nodes</emphasis>: The
               standard prolog of every XML file should begin: <code>&lt;?xml version="1.0"
                  encoding="UTF-8"?></code>
               <footnote>
                  <para>XML version 1.1 is a permissible alternative, and
                        <code>encoding="UTF-8"</code> is optional.</para>
               </footnote></para>
            <para>After that come two processing instructions specifying the two schema files
               required for validation<itemizedlist>
                  <listitem>
                     <para><code>&lt;?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR
                           c]"?></code></para>
                  </listitem>
                  <listitem>
                     <para><code>&lt;?xml-model href="[PATH]/TAN.sch"?></code></para>
                  </listitem>
               </itemizedlist></para>
            <para>The first processing instruction node points to the RELAX-NG schema that declares
               the major, structural rules. The second points to the finely tuned rules, written in
               Schematron. Both processing instructions are required, except in systems where those
               processing instructions are implicitly understood (e.g., an Oxygen project or
               framework). <code>[PATH]</code> represents the pathname to the schema file, whether
               local or on a server, and <code>[ROOT-ELEMENT-NAME]</code> stands for the name of the
               file's root element (the element that is the ancestor of all other elements in the
               document and the descendant of none). It is your choice whether you use
                  <code>.rnc</code> or <code>.rng</code> as the extension for the RELAX-NG schema.
               The former is the compact syntax and the latter, the XML format. They are equivalent.
               The schemas are written initially in the compact sequence, then converted to the XML
               format.</para>
            <para>TAN files permit three different levels of Schematron validation:
                  <code>terse</code>, <code>normal</code>, and <code>verbose</code>. A phase may be
               specified with a pseudoattribute <code>phase</code> in the prolog, e.g.,
                  <code>&lt;?xml-model href="TAN.sch" phase="verbose"?></code>. But it is customary
               not to specify the phase, since most users will want to pick the level of validation
               desired at a given time. Verbose takes the longest time, and terse the shortest.
               Verbose provides the most feedback, terse the least. But some files will not show any
               difference in results from one phase to the next. For more on validation, see <xref
                  xlink:href="#validating_tan_files"/>.</para>
            <para><emphasis role="italic">Root element</emphasis>: The name of the root element
               identifies the type of TAN file:<table frame="all">
                  <title>Root TAN elements</title>
                  <tgroup cols="3">
                     <colspec colname="c1" colnum="1" colwidth="1.19*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.19*"/>
                     <colspec colname="newCol3" colnum="3" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Root element name</entry>
                           <entry>Type of data</entry>
                           <entry>TAN class</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-TAN-T"
                              >&lt;TAN-T></link></code></entry>
                           <entry>plain text transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>TEI transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A"
                              >&lt;TAN-A></link></code></entry>
                           <entry>division-based alignments and annotations</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-tok"
                                 >&lt;TAN-A-tok></link></code></entry>
                           <entry>token-based alignments</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-lm"
                              >&lt;TAN-A-lm></link></code></entry>
                           <entry>lexico-morphological annotations</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-mor"
                              >&lt;TAN-mor></link></code></entry>
                           <entry>part of speech / morphology patterns</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-voc"
                              >&lt;TAN-voc></link></code></entry>
                           <entry>glossaries</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-collection"
                                 >&lt;collection></link></code></entry>
                           <entry>catalog of TAN files</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table><footnote>
                  <para><code><link linkend="element-collection">&lt;collection></link></code> is
                     provided here only to complete the table. None of the material in this chapter
                     applies to this special class 3 format. See <xref linkend="catalog-files"
                     />.</para>
               </footnote></para>
            <para>Each root element takes a mandatory <code><link linkend="attribute-id"
                  >@id</link></code> and <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code>. On <code><link linkend="attribute-id"
                  >@id</link></code>, see below. <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code> must be <code>2021</code>, the current version of
               TAN.</para>
            <para>All TAN elements fall under the namespace <code>tag:textalign.net,2015:ns</code>.
               In most cases, the namespace is declared in the root element. (The only exceptions
               are TAN-TEI transcription files, which take as a default namespace
                  <code>http://www.tei-c.org/ns/1.0</code> everywhere but in <code>/TEI/head</code>,
               which takes the TAN namespace.) For more about namespaces, see <xref
                  linkend="namespace"/>.</para>
            <para><emphasis>Root element children:</emphasis> Most root elements take two mandatory
               children: <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code>, the latter containing data and
               the former, metadata (data about the data). Root elements of TAN-TEI files take three
               children: <code>&lt;teiHeader></code>, <code><link linkend="element-head"
                     >&lt;head></link></code>, and <code>&lt;text></code>. The apparent duplication
               of a head element is necessary: the <code>&lt;teiHeader></code> does not satisfy TAN
               metadata requirements, and the TAN header does not try to do what the teiHeader does.
               See <xref linkend="tan-tei"/>. </para>
            <para>All TAN files may take one final optional child, <link linkend="element-tail"
                     ><code>&lt;tail></code></link>, a private use element that allows any
               well-formed XML. It was introduced initially to experiment with methods in improving
               the efficiency of validation and applications, but it can be used for a variety of
               tasks or applications. Nothing in a TAN file should be dependent upon the <link
                  linkend="element-tail"><code>&lt;tail></code></link>. That is, if you are editing
               a TAN file and you add a <link linkend="element-tail"><code>&lt;tail></code></link>,
               assume that it will be disregarded by other users. Similarly, you may delete any TAN
               file's <link linkend="element-tail"><code>&lt;tail></code></link> without
               consequence.</para>
            <section xml:id="tan-file-id">
               <title>Identifying TAN files: <code><link linkend="attribute-id"
                  >@id</link></code></title>
               <para>Every TAN file requires in its root element an <code><link
                        linkend="attribute-id">@id</link></code>, which must take the form of a tag
                  URN (see <xref linkend="tag_urn"/> for syntax). The file's <code><link
                        linkend="attribute-id">@id</link></code> is the primary way other TAN files
                  will refer to it, and it may be used in RDFa, JSON-LD, and linked open data (see
                     <xref linkend="IRIs_and_linked_data"/>).</para>
               <para>A tag URN begins with a namespace component, and concludes with the identifying
                  string. The namespace of <code><link linkend="attribute-id">@id</link></code> must
                  match at least one other tag URN namespace from the <code><link
                        linkend="element-IRI">&lt;IRI></link></code> of a <code><link
                        linkend="element-person">&lt;person></link></code> identified by <code><link
                        linkend="element-file-resp">&lt;file-resp&gt;</link></code>. See <xref
                     xlink:href="#responsibility"/>.</para>
               <para>In choosing a value for <code><link linkend="attribute-id">@id</link></code>
                  you might imitate the filename, but this is normally not a good idea, since files
                  are frequently renamed, often with good reason. A TAN file's <code><link
                        linkend="attribute-id">@id</link></code> should not be changed, especially
                  after public release. The name should remain permanent and stable, even if flaws
                  in the name are recognized. </para>
               <para>On occasion during editing, it will become clear that revisions are so deep
                  that the file is altogether a different kind of thing. If a previous version has
                  been published, then coining a new <code><link linkend="attribute-id"
                     >@id</link></code>
                  <emphasis>is </emphasis>advised, to make a clean break. You may document the
                  connection by supplying <code><link linkend="element-predecessor"
                        >&lt;predecessor&gt;</link></code>, which establishes a line of
                  ancestry.</para>
               <para>If you take someone else's data and alter it then you should <emphasis
                     role="italic">not</emphasis> change the <code><link linkend="attribute-id"
                        >@id</link></code>. To ensure that you are credited with any revisions you
                  make to the file (if you are allowed—see <code><link linkend="element-license"
                        >&lt;license&gt;</link></code>), you should add yourself as a <link
                     linkend="element-person"><code>&lt;person></code></link> and then document your
                  alterations through <link linkend="element-change"><code>&lt;change></code></link>
                  or <link linkend="attribute-ed-when"><code>@ed-when</code></link> and <link
                     linkend="attribute-ed-who"><code>@ed-who</code></link>. You might also add a
                        <code><link linkend="element-predecessor">&lt;predecessor&gt;</link></code>
                  element, pointing to the previous version of the file.</para>
               <para>The <code><link linkend="attribute-id">@id</link></code> is the only
                  file-specific metadatum positioned outside <code><link linkend="element-head"
                        >&lt;head></link></code>. It is placed as rootward in the document as
                  possible to make clear that it names the entire document.</para>
            </section>
            <section>
               <title xml:id="versioning-tan-files">TAN file versions</title>
               <para>The version of a TAN file is identified by the most recent date in a file's
                        <code><link linkend="attribute-when">@when</link></code>, <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, and <code><link
                        linkend="attribute-accessed-when">@accessed-when</link></code>. </para>
               <para>Whenever you change a TAN file that has already been published, provide at
                  least an edit stamp (<xref linkend="edit_stamp"/>) in the part of the file you
                  changed, or add a new <code><link linkend="element-comment"
                     >&lt;comment></link></code> or <code><link linkend="element-change"
                        >&lt;change></link></code>, so that anyone validating a TAN file dependent
                  upon yours will be warned that changes have been made. The user may then either
                  continue to process the file (the changes may be minor or inconsequential) or
                  pause and see if anything on their end needs to be changed. </para>
            </section>
         </section>
         <section xml:id="inheritable_attributes">
            <title>Attribute inheritability and priority</title>
            <para>Some attributes affect not merely their parent element but all their parent's
               descendents. This phenomenon is called <emphasis>inheritability</emphasis>.</para>
            <para>Some attributes are non-inheritable. That is, the attribute relates only to the
               parent element. Examples: <code><link linkend="attribute-pattern"
                  >@pattern</link></code>, <code><link linkend="attribute-flags"
                  >@flags</link></code>. If TAN schema documentation for an attribute does not state
               anything about the inheritability of an attribute's values, it should be treated as
               non-inheritable.</para>
            <para>Most inheritable attributes are weakly inheritable. That is, inheritance stops at
               any descendant that has the same attribute. For example, <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> set to <code>eng</code>
               specifies that its text nodes are in English, but it might contain another element
               whose <code><link linkend="attribute-xmllang">@xml:lang</link></code> is set
                  <code>lat</code>. If text has multiple ancestors with different <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code>s, the closest
               (leafward-most) is the only one that counts. </para>
            <para>Other inherited attributes are cumulative. That is, their values combine as one
               goes from root to leaf. For example, if an element with <code><link
                     linkend="attribute-cert">@cert</link></code> wraps another, and each one has a
                     <code><link linkend="attribute-cert">@cert</link></code> value of
                  <code>0.5</code>, it means that claim behind the wrapped element has only 25%
               certainty. <code><link linkend="attribute-n">@n</link></code> in a <code><link
                     linkend="element-div">&lt;div></link></code> is indirectly cumulative for the
               purposes of resolving values of <code><link linkend="attribute-ref"
                  >@ref</link></code>. Any given <code><link linkend="element-div"
                  >&lt;div></link></code> has one or more implied references, formed by all
               permutations of concatenating values of inherited <code><link linkend="attribute-n"
                     >@n</link></code>s. Cumulative inherited attributes are infrequent, and the
               documentation specifies how each one behaves.</para>
            <para>Some attributes within the same element have interpretive priority. <code><link
                     linkend="attribute-claimant">@claimant</link></code>, for example, has priority
               over <code><link linkend="attribute-cert">@cert</link></code>. That is, the two
               attributes in the same element are to be interpreted to mean: "<code><link
                     linkend="attribute-claimant">@claimant</link></code> has <code><link
                     linkend="attribute-cert">@cert</link></code> confidence about the following
               claim:...." It does not mean that one is uncertain whether the claimant made the
               claim.</para>
         </section>
         <section xml:id="defining_tokens">
            <title>Defining words and tokens</title>
            <para>At the heart of interaction between class-1 and class-2 files is the need to
               identify words. This poses a problem at the outset. The term <emphasis role="italic"
                  >word</emphasis> is notoriously difficult to define, no matter the context or
               language. For example, "New York" and "didn't" can each be reasonably defined as
               being either one or two words. Furthermore, some scholars consider punctuation to be
               words (e.g., commas in modern prose, representing "and"), whereas others ignore them
               as being anachronistic or capricious (e.g., medieval manuscripts or modern editions
               of ancient texts). In the end, the many meanings for "word" reflects the diversity of
               scholarship.</para>
            <para>TAN follows the field of corpus linguistics and avoids <emphasis>word</emphasis>
               in favor of the proximate term <emphasis role="italic">token</emphasis>—one or more
               characters defined not according to grammar but according to a regular expression
               (see <xref linkend="regular_expressions"/>). </para>
            <para>In TAN, a token is purely a string definition, used to segment and to point. A
               token in TAN does not entail any linguistic categories. Neither editors nor users of
               TAN data should infer that a <link linkend="element-tok"><code>&lt;tok></code></link>
               points to a morpheme, a lexeme, or any other linguistic entity. There will frequently
               be a fortuitous correlation between the two, but it is not guaranteed. </para>
            <para>TAN was developed with a concern for ancient literature, where punctuation is
               generally ignored as being late or not central to the text. Happily, even in
               contemporary use, most people ignore punctuation when they count words. Therefore the
               default <link linkend="element-token-definition"
                  ><code>&lt;token-definition></code></link> defines a token as being any continuous
               string of word characters (<code>\w</code>), the soft hyphen, the zero-width space,
               or the zero-width joiner, formally defined by <xref
                  xlink:href="#variable-token-definition-default"/>:</para>
            <para>
               <programlisting>&lt;token-definition regex="[\w&amp;#xad;&amp;#x200b;&amp;#x200d;]+"/></programlisting>
            </para>
            <para>This pattern closely resembles what is ordinarily thought of as words, but perhaps
               with some surprises (see above, <xref linkend="regular_expressions"/>). If no <link
                  linkend="element-token-definition"><code>&lt;token-definition></code></link> is
               explicitly given, the default token definition above will be used.</para>
            <para>If you are working with modern texts, where punctuation might be important to name
               and number, try the built-in keyword <code>letters and punctuation</code>:</para>
            <para>
               <programlisting>&lt;token-definition regex="[\w&amp;#xad;​&amp;#x200b;&amp;#x200d;]+|[^\w&amp;#xad;&amp;#x200b;​&amp;#x200d;\s]"/></programlisting>
            </para>
            <para>This expression defines a token as a sequence of word characters or any single
               character that is neither a word nor a space. The string <code>(I go!)</code> would
               have five tokens: <code>(</code>, <code>I</code>, <code>go</code>, <code>!</code>,
               and <code>)</code>.</para>
            <para>For other standard TAN token definitons see <xref
                  xlink:href="#vocabularies-token-definitions"/><link
                  linkend="element-token-definition"><code>&lt;token-definition></code></link>s. You
               may customize your own <link linkend="element-token-definition"
                     ><code>&lt;token-definition></code></link>. But keep in mind that TAN files
               were meant to be shared across fields and disciplines. You should define tokens in a
               way users of your texts expect. Two class-2 TAN annotation files with different
               tokenization systems can be challenging to collate.</para>
         </section>
         <section xml:id="metadata_head">
            <title>Metadata (<code><link linkend="element-head">&lt;head></link></code>)</title>
            <para>No matter how much one TAN format differs from another, the metadata follows the
               same basic structure. Anyone getting a TAN file, no matter its class or type, is
               assumed to want to know, and therefore to find easily and predictably, the following:<orderedlist>
                  <listitem>
                     <para>the stable name of the file;</para>
                  </listitem>
                  <listitem>
                     <para>its version;</para>
                  </listitem>
                  <listitem>
                     <para>its sources;</para>
                  </listitem>
                  <listitem>
                     <para>other files upon which it depends or otherwise has an important
                        relationship;</para>
                  </listitem>
                  <listitem>
                     <para>the most significant parts of the editorial history;</para>
                  </listitem>
                  <listitem>
                     <para>the linguistic or scholarly conventions that have been adopted in
                        creating and editing the data;</para>
                  </listitem>
                  <listitem>
                     <para>the license, i.e., who holds what rights to the data, and what kind of
                        reuse is allowed.</para>
                  </listitem>
                  <listitem>
                     <para>the persons, organizations, or entities that helped create the data, and
                        the roles played by each.</para>
                  </listitem>
               </orderedlist></para>
            <para>To answer these questions completely, consistently, and predictably, the
                     <code><link linkend="element-head">&lt;head></link></code>, a mandatory child
               of the root element, takes a common pattern across all TAN formats, making TAN files
               predictable across a variety of formats. The TAN <code><link linkend="element-head"
                     >&lt;head></link></code>, intended to be concise and focused, compels you to
               provide metadata for the data that is governed by <code><link linkend="element-body"
                     >&lt;body></link></code>, but it does not accommodate metadata for the
               metadata. TAN metadata centers on the data itself and not on other things. For
               example, <code><link linkend="element-head">&lt;head></link></code> requires you name
               the people who helped create or edit the data, but you are not expected to tell us
               about them. Merely give good <code><link linkend="element-IRI"
               >&lt;IRI></link></code>s to point to authoritative sources that provide background information.<footnote>
                  <para>The principles above explain why the TEI extension of TAN requires two
                     heads, one for TEI and the other for TAN. The <code>&lt;teiHeader></code>
                     supports the creation of metadata that has little or no relevance to the
                     content of <code><link linkend="element-body">&lt;body></link></code>, has its
                     own unique structure, has very few metadata that are required, and is not
                     designed to incorporate IRIs. Although <code>&lt;teiHeader></code>and TAN's
                           <code><link linkend="element-head">&lt;head></link></code> overlap in
                     some respects, they cannot be mapped onto each other. Each has a different
                     purpose, so both must be retained.</para>
               </footnote></para>
            <para>In what follows we provide a general overview of the TAN <code><link
                     linkend="element-head">&lt;head></link></code>, focusing on its general
               structure, and some of the principles that affect other parts of the TAN
               ecosystem.</para>
            <section>
               <title>Key Information</title>
               <para>Key information about the file as a whole is the first section of a <code><link
                        linkend="element-head">&lt;head></link></code>. This includes <code><link
                        linkend="element-name">&lt;name></link></code>, perhaps one or more
                        <code><link linkend="element-desc">&lt;desc></link></code>s, and perhaps one
                  or more <code><link linkend="element-master-location"
                     >&lt;master-location></link></code>s, which point to locations for
                  authoritative versions. <code><link linkend="element-master-location"
                        >&lt;master-location></link></code> is optional, but not if <code><link
                        linkend="element-to-do">&lt;to-do&gt;</link></code> (see below) is
                  empty.</para>
            </section>
            <section xml:id="key_declarations">
               <title>Key Declarations</title>
               <para>Each <code><link linkend="element-head">&lt;head></link></code> in a TAN file
                  has a declaration section, pertaining to how the file should be used: <code><link
                        linkend="element-license">&lt;license></link></code> and <code><link
                        linkend="element-numerals">&lt;numerals&gt;</link></code>.</para>
               <para><code><link linkend="element-license">&lt;license></link></code> stipulates the
                  license(s) under which the persons or organizations listed in its <code><link
                        linkend="attribute-licensor">@licensor</link></code> are releasing the data.
                     <emphasis role="bold">The license applies only to the data in <code><link
                           linkend="element-body">&lt;body></link></code>, not to its
                     sources.</emphasis> The distinction is important, and helpful. It is much
                  easier for you to decide and state the rights and license behind your own work
                  than to speak for others. Declaring who holds what rights over your source(s) may
                  be not only difficult but risky, and is therefore optional, best handled in a
                        <code><link linkend="element-desc">&lt;desc&gt;</link></code> or <code><link
                        linkend="element-comment">&lt;comment></link></code>.</para>
               <para>When using a TAN file, you should investigate the entire chain of rights. You
                  may find discrepancies between the license of a TAN file and that of its sources.
                  For example, you might create a complete TAN-based lexico-morphological analysis
                  of a 20th-century novel, and legitimately release the TAN data under a public
                  domain license, even though the novel itself is under copyright. Users must be
                  aware of and respect licenses, and know that the license in a TAN file may not be
                  the license of its sources. </para>
               <para>TAN adopts the Creative Commons licenses as its default license vocabulary. See
                     <xref linkend="vocabularies-licenses"/>.</para>
               <para><code><link linkend="element-numerals">&lt;numerals&gt;</link></code> may be
                  used to declare whether an ambiguous numeral should be interpreted as an
                  alphabetic numeral or a Roman numeral (default). See the entry for <code><link
                        linkend="element-numerals">&lt;numerals&gt;</link></code> as well as the
                     <link xlink:href="#numeration-systems">section on numeration
                  systems</link>.</para>
               <para>Many TAN files allow in this section <code><link
                        linkend="element-token-definition">&lt;token-definition></link></code>,
                  which specifies a definition for tokens, perhaps tailored via <code><link
                        linkend="attribute-src">@src</link></code> to a specific class-2 file. See
                     <xref xlink:href="#defining_tokens"/> and <code><link
                        linkend="element-token-definition"
                  >&lt;token-definition></link></code>.</para>
            </section>
            <section xml:id="inclusions-and-vocabularies">
               <title>Networked Files</title>
               <para>The third major section of <code><link linkend="element-head"
                     >&lt;head></link></code> accommodates links and references to other files. Some
                  files are essential to processing the TAN file, while others are less
                  important.</para>
               <para>The two most critical types of files are marked by <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                        linkend="element-vocabulary">&lt;vocabulary></link></code>. The files
                  pointed to by these elements should be considered constituent parts of the
                  dependent TAN file. In the validation process, failure to access any one of them
                  (calculated recursively) is a fatal error.</para>
               <para><code><link linkend="element-inclusion">&lt;inclusion></link></code> and
                        <code><link linkend="element-vocabulary">&lt;vocabulary></link></code> were
                  developed to reduce duplication (and therefore potential error) in collections of
                  TAN files. Many if not most TAN files are created alongside or in the context of a
                  project, where certain data patterns are repeated. Explicit repetition from one
                  file to the next makes them prone to error. Changes might be made in one file but
                  not in another, introducing version conflicts. <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                        linkend="element-vocabulary">&lt;vocabulary></link></code> provide a
                  specialized method of inclusion that leads to cleaner, smaller files.</para>
               <para>In general, you should first try using <code><link linkend="element-vocabulary"
                        >&lt;vocabulary></link></code>, which points to TAN-voc files that collect
                  vocabulary items common to the project. If that element does not do what you want,
                  then try <code><link linkend="element-inclusion">&lt;inclusion></link></code>. It
                  is normally easier to diagnose a complex set of <code><link
                        linkend="element-vocabulary">&lt;vocabulary></link></code>s than a complex
                  set of <code><link linkend="element-inclusion"
                  >&lt;inclusion></link></code>s.</para>
               <section>
                  <title>Vocabularies</title>
                  <para>Oftentimes, from one file to the next, an editor needs to refer repeatedly
                     to a common set of things, e.g., manuscripts, works of literature, or persons
                     who helped edit the files. </para>
                  <para>Projects are advised to create their own <code><link
                           linkend="element-TAN-voc">&lt;TAN-voc&gt;</link></code> files, populated
                     with commonly used vocabulary. Once set up, the TAN-voc file must be linked to
                     via a <code><link linkend="element-vocabulary">&lt;vocabulary></link></code> in
                     the <code><link linkend="element-head">&lt;head></link></code> of each TAN file
                     that draws from the vocabulary. Vocabulary items can then be invoked either by
                     pointing to <code><link linkend="element-name">&lt;name&gt;</link></code>
                     values, or by assigning an <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code> to a vocabulary item placed in the <link
                        linkend="element-head"><code>&lt;head></code></link>'s <code><link
                           linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>. If
                     you draw upon <code><link linkend="element-name">&lt;name&gt;</link></code>,
                     you may make alterations to capitalization. Hyphens, spaces, and underscores
                     are treated as interchangeable. Capitalization and spelling of <code><link
                           linkend="attribute-xmlid">@xml:id</link></code>, however, must be
                     strictly followed.</para>
                  <para>Vocabulary (TAN-voc) files tend to require frequent change and expansion, so
                     it is recommended that you depend upon only those TAN-voc files that are part
                     of your project, and not those from a different project.</para>
                  <para>In the host file, any attribute that takes multiple IDrefs, e.g.,
                           <code><link linkend="attribute-who">@who</link></code>, <code><link
                           linkend="attribute-type">@type</link></code>, <code><link
                           linkend="attribute-subject">@subject</link></code>, may take a mixture of
                     values that refer to numerous vocabulary items via <code><link
                           linkend="attribute-xmlid">@xml:id</link></code> or <code><link
                           linkend="element-name">&lt;name&gt;</link></code>. But in these
                     attributes spaces are reserved to delimit multiple values, which means that if
                     you refer to a <code><link linkend="element-name">&lt;name&gt;</link></code>,
                     spaces must be replaced with the underscore or hyphen. A <link
                        linkend="attribute-which"><code>@which</code></link> in the host file,
                     however, can take no more than one value, so using spaces is fine. </para>
                  <para><emphasis role="bold"><code><link linkend="attribute-id">@id</link></code>
                        and <code><link linkend="attribute-xmlid">@xml:id</link></code> are
                        case-sensitive, and do not allow spaces. <link linkend="attribute-which"
                              ><code>@which</code></link> and therefore <code><link
                              linkend="element-name">&lt;name&gt;</link></code> are not
                        case-sensitive, and the space, hyphen, and underscore are
                        equivalent.</emphasis></para>
                  <para><emphasis role="bold">If you point to </emphasis><code><link
                           linkend="attribute-id"><emphasis role="bold"
                        >@id</emphasis></link></code><emphasis role="bold"> or
                           </emphasis><code><link linkend="attribute-xmlid"><emphasis role="bold"
                              >@xml:id</emphasis></link></code><emphasis role="bold"> you must
                        respect case and punctuation. If you are pointing to a
                           </emphasis><code><link linkend="element-name"><emphasis role="bold"
                              >&lt;name&gt;</emphasis></link></code><emphasis role="bold"> you can
                        ignore case, and you should probably replace the space with a
                     _.</emphasis></para>
                  <para>TAN includes a number of standard vocabulary (TAN-voc) files for a variety
                     of concepts commonly used in textual scholarship (see <xref
                        linkend="vocabularies-master-list"/>). Vocabulary items have been defined
                     for more than one hundred types of textual divisions, and any of these can be
                     invoked simply by using their names (see <xref
                        xlink:href="#vocabularies-div-types"/>).</para>
                  <para><code><link linkend="element-vocabulary">&lt;vocabulary></link></code>
                     itself may take <link linkend="attribute-which"><code>@which</code></link>, but
                     only to point to one of the extra TAN vocabularies listed in <xref
                        xlink:href="#vocabularies-vocabularies"/>. You cannot point to a customized
                     TAN-voc file via <link linkend="attribute-which"><code>@which</code></link>.
                     This restriction avoids some complexity in the validation routine. See <xref
                        linkend="extra_n_vocabulary"/> on how to use this feature.</para>
                  <para>Files pointed to by <code><link linkend="element-vocabulary"
                           >&lt;vocabulary></link></code> are considered an essential part of any
                     TAN file. Failure to find the target file will throw a fatal error during
                     validation.</para>
               </section>
               <section>
                  <title>Inclusions</title>
                  <para>Whereas vocabularies do not change the host document, inclusions do. Unlike
                     other forms of inclusion you might be familiar with, TAN inclusion is targeted
                     at select elements, <emphasis>never</emphasis> an entire file. TAN inclusion is
                     a two-step process. </para>
                  <para>First, a TAN file is linked to, and therefore made available for inclusion,
                     via <code><link linkend="element-inclusion">&lt;inclusion></link></code>s
                     (inside <link linkend="element-head"><code>&lt;head></code></link>). Like
                           <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code>, an <code><link linkend="element-inclusion"
                           >&lt;inclusion></link></code> does nothing on its own. It merely points
                     to a file that is eligible for inclusions. No actual inclusions occur until the
                     next step.</para>
                  <para>Second, select parts of the included file are invoked in the dependent file.
                     To do so, insert an element X in a valid location, but with nothing but
                           <code><link linkend="attribute-include">@include</link></code>, with one
                     or more values (space-delimited), each pointing to an <code><link
                           linkend="attribute-xmlid">@xml:id</link></code> values of an <code><link
                           linkend="element-inclusion">&lt;inclusion></link></code>. In the
                     validation process, that element X will be replaced with all element Xs found
                     in the inclusion file, resolved recursively, and ignoring duplications (deeply
                     equal elements).</para>
                  <para>For example, a TAN-T file might have a <code>&lt;div
                     include="poem1"></code>. The validation routine will replace that element with
                     every rootmost <code><link linkend="element-div">&lt;div></link></code> in the
                     included file called <code>poem1</code>. </para>
                  <para>Any host file that includes elements from another file inherits any
                     vocabulary associated with the inclusion, and along with it <code><link
                           linkend="attribute-xmlid">@xml:id</link></code> values. This may result
                     in IDrefs pointing to two or more distinct vocabulary items, which may be a
                     benefit or a hindrance. Be familiar with the items you are including. </para>
                  <para>TAN inclusion is very practical for texts. Textual works commonly nest
                     inside each other. By setting up your class-1 files as a series of inclusions,
                     you can reduce validation time, both in the file and in class-2 files that
                     depend upon the transcriptions. See the <code>examples</code> subdirectory for
                     a sample of a Gospel of Matthew including the Sermon on the Mount including the
                     Lord's Prayer. </para>
                  <para>The inclusion technique is also especially useful for vocabulary (TAN-voc)
                     files. A single master TAN-voc file can include other vocabulary files, each
                     devoted to a particular type of item (e.g., one for works, one for scripta).
                     Project files then need to link merely to the master TAN-voc file.</para>
                  <para>You can include a TAN file that itself includes other TAN files. Inclusion
                     is recursive. In any recursive system, circularity is fatal. That is true for
                     TAN inclusion as well, but only within the scope of specified element names. It
                     is perfectly legal for two files to include each other, as long as they do not
                     try to include (directly or indirectly) the same elements, or try to consult
                     each other to resolve any vocabulary.</para>
                  <para>Files pointed to by <code><link linkend="element-inclusion"
                           >&lt;inclusion></link></code> are considered an essential part of any TAN
                     file. Failure to find the target file will throw a fatal error during
                     validation.</para>
               </section>
               <section xml:id="other_related_files">
                  <title>Other related files</title>
                  <para>A TAN file may point to a number of other types of files. The more that are
                     mentioned, the richer the network. <code><link linkend="element-predecessor"
                           >&lt;predecessor&gt;</link></code> and <code><link
                           linkend="element-successor">&lt;successor&gt;</link></code> point to
                     versions of the file that precede and postdate it. </para>
                  <para><code><link linkend="element-source">&lt;source&gt;</link></code> is another
                     type of related file, but it may or may not link to another file. In class-2
                     files <code><link linkend="element-source">&lt;source&gt;</link></code> always
                     points to a class-1 TAN file. In class-1 and class-3 files, <code><link
                           linkend="element-source">&lt;source&gt;</link></code> may point either to
                     a file or to a scriptum (see <xref xlink:href="#domain_model"/>).</para>
                  <para><code><link linkend="element-see-also">&lt;see-also></link></code> can be
                     used to point to any file that has some relationship to a TAN file. The
                     required <code><link linkend="attribute-relationship"
                        >@relationship</link></code> points to one or more <code><link
                           linkend="element-relationship">&lt;relationship&gt;</link></code>
                     vocabulary items. There is no standard TAN vocabulary for relationships.
                     Normally, when a file-to-file relationship is considered important, it becomes
                     a full-fledged standard TAN element.</para>
                  <para>Some TAN formats allow special types of related files (e.g., <code><link
                           linkend="element-redivision">&lt;redivision&gt;</link></code> and
                           <code><link linkend="element-model">&lt;model&gt;</link></code> for
                     class-1 files). See metadata descriptions under specific classes or formats. </para>
               </section>
            </section>
            <section xml:id="adjustments">
               <title>Adjustments</title>
               <para>The fourth major section of <code><link linkend="element-head"
                     >&lt;head></link></code>, which is optional, consists of <code><link
                        linkend="element-adjustments">&lt;adjustments></link></code>, which
                  specifies changes that have been made (class 1), or should be made (class 2), to
                  the sources. </para>
               <para>In class-1 files, these consist of <code><link linkend="element-normalization"
                        >&lt;normalization></link></code>s and <code><link linkend="element-replace"
                        >&lt;replace&gt;</link></code>s; see <xref
                     xlink:href="#normalizing_transcriptions"/>. </para>
               <para>Class-2 files allow <code><link linkend="element-skip"
                     >&lt;skip&gt;</link></code>, <code><link linkend="element-rename"
                        >&lt;rename&gt;</link></code>, <code><link linkend="element-equate"
                        >&lt;equate&gt;</link></code>, and <code><link linkend="element-reassign"
                        >&lt;reassign&gt;</link></code> as adjustments; see <xref
                     xlink:href="#class_2_metadata"/>.</para>
            </section>
            <section>
               <title>Local vocabulary items and ID assignments: <code><link
                        linkend="element-vocabulary-key">&lt;vocabulary-key></link></code></title>
               <para>The fifth major part of <code><link linkend="element-head"
                     >&lt;head></link></code>, <code><link linkend="element-vocabulary-key"
                        >&lt;vocabulary-key></link></code>, allows you to declare any vocabulary
                  items specific to the file. It also allows you to take vocabulary items existing
                  in other TAN-voc files (whether defined in <code><link
                        linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or standard
                  TAN vocabulary), and assign them <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code>s that are valid only in the current file. Anything in
                        <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code>, and any TAN-voc files pointed to via
                        <code><link linkend="element-vocabulary">&lt;vocabulary&gt;</link></code>,
                  will overwrite default TAN vocabulary.</para>
               <para>These id assignments can be supplemented with <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code>es, which are used to
                  assign an id to one or more ids. This practice resembles what text editors do when
                  naming groups of manuscripts. Each manuscript is given a siglum, say a single
                  lowercase Greek or Latin letter, and the manuscripts are grouped together into
                  families, with each family given its own siglum, say an uppercase letter. If the
                  editor wishes to indicate that a whole family of manuscripts departs from a
                  particular reading, the family siglum is all that is needed. An <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code> works much the same way,
                  and can be used for any vocabulary items. For example, if a textual division can
                  be legitimately called both a rubric and a heading, you could assign
                     <code>rubr</code> and <code>hd</code> as ids in the <code><link
                        linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> to the
                  vocabulary items for the rubric and the heading, and then insert <code>&lt;alias
                     xml:id="rubrichead" idrefs="rubr hd"&gt;</code>. Then, in that file,
                     <code>&lt;div n="1" type="rubrichead"></code> would identify that <code><link
                        linkend="element-div">&lt;div></link></code> as being both a rubric and a
                  head.</para>
               <para>Unlike other pointing attributes, the <code><link linkend="attribute-idrefs"
                        >@idrefs</link></code> of an <code><link linkend="element-alias"
                        >&lt;alias&gt;</link></code> cannot point to the <code><link
                        linkend="element-name">&lt;name&gt;</link></code> value of vocabulary items.
                  They can refer only to the id values of locally defined instances of <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>. This restriction reduces
                  confusion, and avoids some complexity in the resolution and validation of a TAN
                  file.</para>
               <para><code><link linkend="element-alias">&lt;alias&gt;</link></code>es may recurse,
                  as long as there is no circularity. That is, <code><link
                        linkend="attribute-idrefs">@idrefs</link></code> in an <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code> may refer to any
                        <code><link linkend="attribute-xmlid">@xml:id</link></code> or <code><link
                        linkend="attribute-id">@id</link></code>, not only to a vocabulary item but
                  to another <code><link linkend="element-alias">&lt;alias&gt;</link></code>. </para>
               <para>In most cases <code><link linkend="element-alias">&lt;alias&gt;</link></code>
                  should refer to items of the same type. In a few situations mixed groups do not
                  pose a problem, for example mixing <code><link linkend="element-person"
                        >&lt;person&gt;</link></code>s, <code><link linkend="element-algorithm"
                        >&lt;algorithm&gt;</link></code>s, and <code><link
                        linkend="element-organization">&lt;organization&gt;</link></code>s. TAN
                  validation will indicate whether mixed typology introduces errors.</para>
               <para>Because <code><link linkend="attribute-xmlid">@xml:id</link></code> may not
                  contain certain types of characters, such as common punctuation marks, and because
                        <code><link linkend="element-alias">&lt;alias&gt;</link></code> must be able
                  to coin unusual ids (especially for grammatical features), <code><link
                        linkend="attribute-id">@id</link></code> may be used instead of <code><link
                        linkend="attribute-xmlid">@xml:id</link></code> in <code><link
                        linkend="element-alias">&lt;alias&gt;</link></code>.</para>
            </section>
            <section xml:id="responsibility">
               <title>Responsibility</title>
               <para>The sixth section of a <code><link linkend="element-head"
                     >&lt;head></link></code> declares who is responsible for the file. It consists
                  of a <code><link linkend="element-file-resp">&lt;file-resp&gt;</link></code> and
                  one or more <code><link linkend="element-resp">&lt;resp></link></code>s. The
                  persons, organizations, or algorithms pointed to in <code><link
                        linkend="element-file-resp">&lt;file-resp&gt;</link></code> must include at
                  least one who has a tag URN whose namespace matches the namespace in the tag URN
                  of the root element's <code><link linkend="attribute-id">@id</link></code>. </para>
               <para>This requirement strengthens the effort to make sure that each TAN file is
                  associated with the person or persons who are or were responsible for the file.
                        <code><link linkend="element-person">&lt;person></link></code>s so
                  identified by <code><link linkend="element-file-resp"
                     >&lt;file-resp&gt;</link></code> are called primary agents, and are bound to
                  the global variable <code><link linkend="variable-primary-agents"
                        >$primary-agents</link></code>. If a claim is made in a TAN file, and no
                        <code><link linkend="attribute-claimant">@claimant</link></code> is
                  explicitly declared, it is assumed that the <code><link
                        linkend="variable-primary-agents">$primary-agents</link></code> are making
                  the claim.</para>
            </section>
            <section>
               <title>Change log</title>
               <para>The change log, the seventh section of the <code><link linkend="element-head"
                        >&lt;head></link></code> consists of one or more <code><link
                        linkend="element-change">&lt;change></link></code>s, which provide a partial
                  history of the file. The entire history is calculated from every attribute that
                  has a date or timeDate value, which can be fetched via the function <code><link
                        linkend="function-get-doc-history">tan:get-doc-history</link>()</code> or
                  the global variable <code><link linkend="variable-doc-history"
                     >$doc-history</link></code>.</para>
               <para>The change log is an effective way to communicate with those who might use your
                  files. In all likelihood, a user will download from the master location a local
                  copy. You might make changes or updates to your master copy. Anyone depending upon
                  a copy will be warned, during Schematron validation, of each <code><link
                        linkend="element-change">&lt;change></link></code> that postdates the value
                  of their <code><link linkend="attribute-accessed-when"
                     >@accessed-when</link></code>. If you have introduced an important or
                  disruptive change, you can mark your <code><link linkend="element-change"
                        >&lt;change></link></code> with <code><link linkend="attribute-flag"
                        >@flag</link></code>, that allows the following values: <code>warning</code>
                  (default value), <code>error</code>, <code>info</code>, <code>fatal</code>. By
                  marking a change as <code>info</code>, you lower the level of a change's
                  importance; <code>error</code> raises the level. The value <code>fatal</code> will
                  halt the validation process in the dependent file altogether.</para>
               <para>If you receive change messages during validation, and you want to stop them,
                  merely update the value of <code><link linkend="attribute-accessed-when"
                        >@accessed-when</link></code> to the current date.</para>
            </section>
            <section>
               <title>Pending work</title>
               <para>The last section of a <code><link linkend="element-head"
                     >&lt;head></link></code> lists all pending tasks that yet need to be applied to
                  a file. These are itemized as a list of <code><link linkend="element-comment"
                        >&lt;comment></link></code>s in <code><link linkend="element-to-do"
                        >&lt;to-do&gt;</link></code>. A file with an empty <code><link
                        linkend="element-to-do">&lt;to-do&gt;</link></code> is assumed to be no
                  longer in progress, so there must be a <code><link
                        linkend="element-master-location">&lt;master-location></link></code>
                  provided.</para>
               <para>Like the change log, the <code><link linkend="element-to-do"
                        >&lt;to-do&gt;</link></code> effectively communicates cautionary notes to
                  those who might use your files. Anyone depending upon a copy will be warned,
                  during Schematron validation, of each item in the list. The report is not
                  dependent upon when the file was last consulted (<code><link
                        linkend="attribute-accessed-when">@accessed-when</link></code>), because
                  this is a collection of standing, unresolved issues. </para>
               <para>One benefit of <code><link linkend="element-to-do">&lt;to-do&gt;</link></code>
                  is that you can release your material before it is finished. Other users will have
                  fair warning about what is imperfect or incomplete.</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_1">
         <title>Class-1 TAN files, representations of textual objects
            (<emphasis>scripta</emphasis>)</title>
         <para>This chapter provides general background to class-1 TAN files and their elements and
            attributes. For detailed discussion of individual elements or attributes, see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri,
            stones, or any other objects with writing on them—collectively termed here
               <emphasis>scripta</emphasis> (sg. <emphasis>scriptum</emphasis>). Class-1 files are
            the foundation of any TAN project. No TAN-A-tok or TAN-A-lm file can be created without
            at least one class-1 file, and most TAN-A files depend upon many of them. </para>
         <para>There are two types of class-1 formats, identified by the root element. <code><link
                  linkend="element-TAN-T">&lt;TAN-T></link></code> is a simple, generic format, with
            plain text inside a simple tree structure. <code>&lt;TEI></code> (also referred to in
            this manual as TAN-T(EI)), on the other hand, can be complex and highly expressive.
            Because the two formats function almost identically, the generic TAN-T format is
            described first, followed by supplemental comments on TAN-TEI.</para>
         <section xml:id="transcription_principles">
            <title>Principles and assumptions</title>
            <section>
               <title>General</title>
               <para>(For more general principles and assumptions applying to all TAN files, not
                  just class 1, see <xref linkend="design_principles"/>.)</para>
               <para>Class-1 formats are designed for faithful but judiciously normalized digital
                  transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of
                  a single work found in a single scriptum (text-bearing object), segmented and
                  uniquely labeled with a single, preferably familiar, reference system. </para>
               <para>Editors of TAN-T(EI) files should be able to read, write, and proofread texts
                  in the languages of the transcriptions. They should understand the texts well
                  enough to segment them and label them according to the conventions used for those
                  works. They should be able to distinguish the text of a primary source from its
                  editorial apparatus. They should be familiar with normalizing conventions for
                  texts from the period, language, and culture. They should know how the
                  transcription might be used in other scholarly fields, e.g., translation studies,
                  corpus linguistics.</para>
               <para>Editors need not understand everything about their texts, and they need not
                  have any specialized skill in grammar or lexicography. They need not know the
                  morphology of individual words, or how individual parts of the text have been
                  translated. Those skills are more profitably applied to other TAN formats. </para>
               <para>TAN-T(EI) editors stand at the foundation level of the Text Alignment Network.
                  Because other files will depend upon TAN-T(EI) files, careful proofreading is
                  important. Eliminating as many typographical errors as possible before publication
                  will maximize the utility of a TAN-T(EI) file. On the other hand, TAN has been
                  designed with the assumption that most files in circulation have typographical
                  errors that can and should be corrected as they are found. If you are aware that a
                  text needs proofreading, but you still want to make it available, simply leave a
                        <code><link linkend="element-comment">&lt;comment></link></code> in the
                        <code><link linkend="element-to-do">&lt;to-do&gt;</link></code> part of the
                     <link linkend="element-head"><code>&lt;head></code></link>.</para>
               <para>If you are creating a TAN-T(EI) file, you are doing so primarily to facilitate
                  alignment and annotation, which requires use of a suitable reference system (see
                     <link linkend="reference_system">reference systems</link>). Transcription files
                  should be segmented and labeled according to a reference system that is familiar
                  and can be easily applied to other versions of the same text in other languages.
                  If possible, semantic mileposts (clauses, sentences, paragraphs, chapters) should
                  be prioritized over visual (lines, columns, pages, volumes). Any transcription can
                  be furnished multiple reference systems, but it is advisable to do so on the basis
                  of separate files, linked by <code><link linkend="element-redivision"
                        >&lt;redivision&gt;</link></code>s in the <code><link linkend="element-head"
                        >&lt;head></link></code>. See <xref linkend="reference_system"/>.</para>
            </section>
            <section xml:id="domain_model">
               <title>Domain model</title>
               <para>Contributors and users of TAN files must sharply distinguish between a scriptum
                  (text-bearing object) and a conceptual work, e.g., between a specific printed copy
                  of the <emphasis>Iliad</emphasis> and the <emphasis>Iliad</emphasis> concieved
                  generally. The former has materiality (digital files are treated here as being
                  material) and the latter does not. Even though both are constitutively necessary
                  for any transcription, the two are always differentiated in the TAN-T(EI) format:
                        <code><link linkend="element-source">&lt;source&gt;</link></code> and
                        <code><link linkend="attribute-src">@src</link></code> point to physical
                  exemplars; <code><link linkend="element-work">&lt;work&gt;</link></code>,
                        <code><link linkend="attribute-work">@work</link></code>, and <code><link
                        linkend="element-version">&lt;version></link></code> to the conceptual.
                  Adherence to this distinction is quite important.</para>
               <para>Some readers may be reminded at this point of the domain model defined by the
                  Functional Requirements for Bibliographical Records (FRBR), which identifies in
                  its Group 1 (Products of intellectual &amp; artistic endeavor) four types of
                  entities: <emphasis>work</emphasis>, <emphasis>expression</emphasis>,
                     <emphasis>manifestation</emphasis>, and <emphasis>item</emphasis>. A work is "a
                  distinct intellectual or artistic creation" and an expression is the conceptual,
                  immaterial realization of a work. Both <emphasis>work</emphasis> and
                     <emphasis>expression</emphasis> are terms for conceptual, non-material
                  entities. A manifestation, on the other hand, is "the physical embodiment of an
                  expression" and an item is a single exemplar of a manifestation. <footnote>
                     <para>Quotations in this section come from International Federation of Library
                        Associations and Institutions, <emphasis>Functional Requirements for
                           Bibliographic Records: Final Report</emphasis>, amended and corrected
                        (February 2009), <link xlink:href="http://www.ifla.org/VII/s13/frbr/"
                        />.</para>
                  </footnote></para>
               <table frame="all">
                  <title>Examples of FRBR Group 1 Entities</title>
                  <tgroup cols="4">
                     <colspec colname="c2" colnum="1" colwidth="1*"/>
                     <colspec colname="c3" colnum="2" colwidth="1*"/>
                     <colspec colname="c4" colnum="3" colwidth="1*"/>
                     <colspec colname="c5" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Work</entry>
                           <entry>Expression</entry>
                           <entry>Manifestation</entry>
                           <entry>Item</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><emphasis>Iliad</emphasis></entry>
                           <entry>Caroline Alexander's English translation of the
                                 <emphasis>Iliad</emphasis>.</entry>
                           <entry>the print run identified with ISBN 978-0062046284</entry>
                           <entry>A specific copy</entry>
                        </row>
                        <row>
                           <entry>The Psalms</entry>
                           <entry>The (Hebrew) Masoretic Psalter</entry>
                           <entry>The 1820 printing of George Offor's edition of the Hebrew
                              Psalms</entry>
                           <entry>Biblioteca Palatina Cod. Parm. 1699</entry>
                        </row>
                        <row>
                           <entry><emphasis>A River Runs Through It</emphasis></entry>
                           <entry>
                              <para>Norman MacClean's original version</para>
                              <para>The 1992 film version</para>
                           </entry>
                           <entry>
                              <para>Print run ISBN 0226500608</para>
                              <para>Blue Ray disc UPC code 004339632533</para>
                           </entry>
                           <entry>
                              <para>Author's personal print copy</para>
                              <para>Reference print CGB 7432-7438 (deposited in the Library of
                                 Congress)</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>TAN's domain model differs slightly. The most important difference is
                  abandonment of FRBR's <emphasis>expression</emphasis>, which was found to be
                  problematic when developing sample TAN data. The term
                     <emphasis>expression</emphasis> was intended to describe a conceptual,
                  non-material entity, but the FRBR guidelines defined and explained it in vague or
                  material terms. <footnote>
                     <para>The problems are illustrated by wording in the specifications:
                           "<emphasis>Expression</emphasis> encompasses, for example, the <emphasis
                           role="bold">specific</emphasis> words, sentences, paragraphs, etc. that
                        result from the realization of a work <emphasis role="bold">in the form of a
                           text</emphasis>....defined, however, so as to exclude aspects of physical
                        form, such as typeface and page layout, <emphasis role="bold">that are not
                           integral</emphasis> to the intellectual or artistic <emphasis role="bold"
                           >realization</emphasis> of the work as such." (ibid., p. 19, emphasis
                        added) That is, <emphasis>expression</emphasis> includes integral aspects of
                        physical form (e.g., typeface that <emphasis>is</emphasis> integral to the
                        realization). "Inasmuch as <emphasis role="bold">the form of expression is
                           an inherent characteristic of the expression</emphasis>, any change in
                        form (e.g., from alpha-numeric notation to spoken word) results in a new
                        expression." (p. 20, emphasis added)</para>
                  </footnote>Even the very term <emphasis>expression</emphasis> and FRBR's preferred
                  synonym, <emphasis>realization</emphasis>, imply materiality (nothing can be
                  expressed or realized without a material medium). Further, FRBR's
                     <emphasis>expression</emphasis> does not easily handle creative adaptations of
                  works that are themselves arguably works in their own right. For example,
                  Euripides' <emphasis>Medea</emphasis> was adapted several centuries later by
                  Seneca the Younger. Seneca's <emphasis>Medea</emphasis> is arguably merely an
                  expression, yet it has itself been subject to various editions and performances,
                  i.e., expressions. But FRBR does not accommodate expressions of expressions. If
                  Seneca's <emphasis>Medea</emphasis> is treated as a work in its own right, its
                  expression relationship to Euripides' origin is lost, since FRBR does not
                  accommodate works that are expressions of other works.</para>
               <para>In the TAN domain model, <emphasis>expression</emphasis> is altogether dropped.
                  There is only one type of conceptual, non-material entity, namely, a work.</para>
               <para>The term <emphasis>version</emphasis> in TAN is applied to a work that
                  substantially follows some other work, e.g., translations and adaptations. But
                  such versions are themselves still works. One work is indicated to be the version
                  of another in a class-1 file through the <code><link linkend="element-work"
                        >&lt;work></link></code> and <code><link linkend="element-version"
                        >&lt;version></link></code> declarations.</para>
               <para>As for material entities, FRBR's <emphasis>manifestation</emphasis> and
                     <emphasis>item</emphasis> are combined in TAN through the term
                     <emphasis>scriptum</emphasis>. A scriptum is a text-bearing object, e.g., book,
                  manuscript, pamphlet, tombstone, traffic sign, digital file (digital media is
                  interpreted as being material). When <emphasis>scriptum</emphasis> is used in a
                  TAN file, it points either to a single physical item or to a set of physical items
                  that for all intents and purposes are indistinguishable (i.e., a scriptum
                  reproduced mechanically). A scriptum that points to a manuscript points only to
                  that one particular manuscript. But a scriptum that points to a printed book or a
                  digital file is understood as applying to all copies of that printed book or
                  digital file. </para>
               <para>There is at present no formal mechanism to specify whether a scriptum points to
                  one object or a set of objects. The distinction must be inferred from a scriptum's
                  IRI + name pattern. In cases of potential ambiguity, it is up to creators of a TAN
                  file to assign to the scriptum IRIs that avoid confusion. For example, to point to
                  Edward Gibbon's personally annotated copy of the 1763 edition of Herodotus (now
                  held by the Wren Library, Trinity College, Cambridge University), one should not
                  use <link xlink:href="https://lccn.loc.gov/92189906"/> or <link
                     xlink:href="http://www.worldcat.org/oclc/27188122"/>, which point to the set of
                  all copies. In this case, one may need to mint their own IRI, based on the Wren
                  Library's acquisition number, RW.50.15.</para>
               <para>In summary, the TAN domain model defines two kinds of entities: works and
                  scripta. Works, which are immaterial, conceptual entities, may contain other
                  works, or they may be versions of other works or work-versions. Scripta, which are
                  material entities, may contain other scripta, and they may refer either to a
                  single object or to a set of copies. A work may be instantiated in many scripta,
                  and similarly, a scriptum may contain many works. Most work-scriptum relationships
                  can be inferred from the <code><link linkend="element-head"
                     >&lt;head></link></code> of a class-1 file, and they may be expressed in a
                        <code><link linkend="element-TAN-A">&lt;TAN-A></link></code> file.</para>
               <table frame="all">
                  <title>Examples of TAN Entities</title>
                  <tgroup cols="2">
                     <colspec colname="c3" colnum="1" colwidth="1*"/>
                     <colspec colname="c4" colnum="2" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Work</entry>
                           <entry>Scriptum</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>
                              <para><emphasis>Iliad</emphasis></para>
                              <para>Caroline Alexander's English translation of the
                                    <emphasis>Iliad</emphasis>.</para>
                           </entry>
                           <entry>
                              <para>the print run identified with ISBN 978-0062046284</para>
                              <para>a specific copy</para>
                           </entry>
                        </row>
                        <row>
                           <entry>
                              <para>The Psalms</para>
                              <para>The (Hebrew) Masoretic Psalter</para>
                           </entry>
                           <entry>
                              <para>The 1820 printing of George Offor's edition of the Hebrew
                                 Psalms</para>
                              <para>Biblioteca Palatina Cod. Parm. 1699</para>
                           </entry>
                        </row>
                        <row>
                           <entry>
                              <para>Norman MacClean's <emphasis>A River Runs Through
                                 It</emphasis></para>
                              <para>The 1992 film <emphasis>A River Runs Through
                                 It</emphasis></para>
                           </entry>
                           <entry>
                              <para>Print run ISBN 0226500608</para>
                              <para>Author's personal print copy</para>
                              <para>Blue Ray disc UPC code 004339632533</para>
                              <para>Reference print CGB 7432-7438 (deposited in the Library of
                                 Congress)</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </section>
            <section>
               <title>One version, one work, one scriptum, one reference system</title>
               <para><emphasis>Every TAN-T(EI) file must be restricted to a transcription of a
                     single version of a single work found on a single scriptum, segmented and
                     labeled according to a single reference system</emphasis>. </para>
               <para>The principle above is critical to the the success of the network. It reduces
                  the risk of confusion and simplifies the files. It follows the generally advisable
                  principle, that complex data should disaggregated into several different simple
                  data structures. Different types of complexity can be built later, as
                  needed.</para>
               <section xml:id="textual_objects">
                  <title>One scriptum</title>
                  <para>Each TAN-T(EI) file must transcribe one and only one text-bearing object or
                     scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a
                     bottlecap. If the object you've chosen has been made mechanically and is
                     virtually indistinguishable from other objects created by the same process
                     (e.g., copies of a printed book or copies of a digital file), then the entire
                     set of copies (what some cataloguers call a <emphasis>manifestation</emphasis>)
                     is to be regarded as the scriptum. </para>
                  <para>Identifying and naming a scriptum might require an editor's discernment and
                     judgment. For example, some manuscripts have been split up, their parts now
                     residing in multiple libraries around the world; other manuscripts are
                     composites, made of several manuscripts. In such cases, you may need to define
                     your scriptum in a way that might not match the way others define it. But the
                     decision is your prerogative, not theirs. You have both the right and
                     responsibility to define your object in the way that you think will most
                     benefit users of your files.</para>
                  <para>The scriptum is declared via <code><link linkend="element-source"
                           >&lt;source></link></code>, which either takes the IRI + name pattern, or
                     points to a <code><link linkend="element-scriptum"
                        >&lt;scriptum&gt;</link></code> vocabulary item. It is a good idea to name
                     your scriptum with an <code><link linkend="element-IRI">&lt;IRI></link></code>
                     value in the form of an <code>http</code> URL that points to a detailed entry
                     in a library catalogue. Doing so allows users to retrieve extensive, structured
                     bibliographical information. You also save yourself the hassle of having to
                     write a detailed, structured bibliographical description. If a URL cannot be
                     found for <code><link linkend="element-IRI">&lt;IRI></link></code>, you may
                     simply coin a tag URN or a UUID. Alternatively, if you find another TAN file
                     that uses the same scriptum-source, you can add its <code><link
                           linkend="element-name">&lt;name&gt;</link></code>s and <code><link
                           linkend="element-IRI">&lt;IRI></link></code>s with the existing IRI +
                     name pattern. Multiple <code><link linkend="element-name"
                        >&lt;name&gt;</link></code>s and <code><link linkend="element-IRI"
                           >&lt;IRI></link></code>s for a vocabulary time are encouraged.</para>
                  <para>If you need to specify exactly where on a scriptum a work-version appears
                     (e.g., page range), <code><link linkend="element-comment"
                        >&lt;comment></link></code> or <code><link linkend="element-desc"
                           >&lt;desc&gt;</link></code> should be used.</para>
               </section>
               <section xml:id="conceptual_works">
                  <title>One Work</title>
                  <para>The transcription must be restricted to a single creative work, identified
                     by <code><link linkend="element-work">&lt;work></link></code> (part of the
                     declarations section of <code><link linkend="element-head"
                        >&lt;head></link></code>). </para>
                  <para>Many scripta have more than one work. Identifying the creative work you
                     transcribe is, once again, your prerogative. Suppose the scriptum you have is a
                     Bible. You define the work. Perhaps you wish to encode the entire Bible and
                     treat it as a single work. Or maybe you wish to treat only the New Testament as
                     the work, or the Tetraevengelion, or the Gospel of Matthew, or a specific
                     episode in that gospel, or merely the Beatitudes. Use whichever work you like,
                     but make sure that the TAN-T(EI) file contains nothing but the work you have
                     declared. It should be a complete representation of what is found on the
                     object, even if only partially preserved, and respect as far as is practical
                     the order of the text in the scriptum. </para>
                  <para>Normally the order the text appears in the scriptum will match the logical
                     order provided by the reference system (see below). But when there are
                     discrepancies, the order of the scriptum should take priority.</para>
                  <para>The requirement to provide the entirety of the work-version as found on the
                     scriptum is a significant departure from the fourth principle of <xref
                        xlink:href="#assumptions_creating_data"/>, which allows for incomplete
                     assertions or data. <emphasis>The transcription in a class-1 file should
                        include the entirety of the work-version chosen, within the particular
                        scriptum</emphasis>. If you are aware that the transcription is incomplete,
                     leave a <code><link linkend="element-comment">&lt;comment></link></code> to
                     that effect in the <code><link linkend="element-head">&lt;head></link></code>'s
                           <code><link linkend="element-to-do">&lt;to-do&gt;</link></code>,
                     identifying which portions are missing from the transcription.</para>
                  <para>Well-known works may have a suitable IRI already assigned to them, say by
                     means of a <link xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>
                     entry. Most works have not been assigned IRIs or are named in IRI vocabularies
                     that are not well known. You may assign any work your own URN, through a UUID
                     or a tag URN. </para>
               </section>
               <section xml:id="work-versions">
                  <title>One version</title>
                  <para>The transcription must be restricted to a single version of the work,
                     identified perhaps by <code><link linkend="element-version"
                        >&lt;version></link></code> (part of the declarations section of <code><link
                           linkend="element-head">&lt;head></link></code>). In most cases,
                           <code><link linkend="element-version">&lt;version></link></code> is
                     unnecessary, because <code><link linkend="element-work">&lt;work></link></code>
                     in conjunction with <code><link linkend="element-source"
                        >&lt;source></link></code> will normally identify a particular work-version.
                     But if the source carries multiple versions (e.g., a bilingual edition of a
                     text), then <code><link linkend="element-version">&lt;version></link></code>
                     should be included, to specify which version has been transcribed. <code><link
                           linkend="element-version">&lt;version></link></code> can also be used to
                     declare explicitly that the work mentioned in <code><link
                           linkend="element-version">&lt;version></link></code> is a version of the
                     work mentioned in <code><link linkend="element-work"
                     >&lt;work></link></code>.</para>
                  <para>If you have a scriptum with multiple versions of a work, and you wish to
                     transcribe them all, each version should be given its own separate TAN-T(EI)
                     file. </para>
                  <para>There may be cases where individual textual divisions are repeated, not so
                     much because they represent a different version, but because they are variants
                     that are integral to the work-version chosen. For example, an edition of a poem
                     may occasionally have a line that is repeated by the editor as a possible local
                     variation. Creating a separate file for such individual cases would be both
                     impractical and misleading. Standard TAN vocabulary for div types includes as a
                     standard item <code>variant</code>, to accommodate occasional variants. For
                     example:</para>
                  <programlisting>. . . . .
&lt;div type="title" n="title">
   &lt;div type="variant" n="orig">The Place&lt;/div>
   &lt;div type="variant" n="subscript" xml:lang="grc">Ὁ Τόπος&lt;/div>
&lt;/div>
. . . . .</programlisting>
                  <para>Notes should be included only if they are an integral part of the primary
                     work (i.e., by the same author, not by a later editor). If you think the notes
                     to a work are important, and legitimately a work in their own right, consider
                     putting them in their own TAN-T(EI) file, or converting them to claims in a
                     TAN-A file.</para>
                  <para>Very few work-versions have IRIs. It is advisable to assign a tag URN or a
                     UUID. If the IRI you have used for <code><link linkend="element-work"
                           >&lt;work></link></code> is in a namespace that you own or control, then
                     you are entitled to modify it, and you may wish merely to add a suffix to the
                     work IRI. For example, you might have
                        <code>tag:urn:example.com,2001:work:a</code> defined for the work; a 1987
                     German translation might be specified as
                        <code>tag:urn:example.com,2001:work:a:ver:1987:deu</code>.</para>
               </section>
               <section xml:id="reference_system">
                  <title>One reference system</title>
                  <para>Every TAN transcription must be segmented into a hierarchy of labeled
                     divisions, defined in the <code><link linkend="element-body"
                        >&lt;body></link></code> through <code><link linkend="element-div"
                           >&lt;div></link></code>s and their <code><link linkend="attribute-n"
                           >@n</link></code> values. </para>
                  <para>Those divisions, whenever possible, should align with the reference system
                     that prevails for the work across different versions or translations, in what
                     is sometimes called a canonical reference system. Because even the most
                     familiar reference system admits degrees and dispute, the term
                        <emphasis>canonical</emphasis> is problematic. It is avoided in these
                     guidelines. We refer simply to a work's <emphasis role="italic">reference
                        system</emphasis>. </para>
                  <para>If you have your choice, preference should be given to reference systems
                     that follow the semantic contours of the work, not the physical features of a
                     particular scriptum. Chapter, paragraph, and sentence numbers are preferable to
                     volume, page, and line numbers, because other versions of the work (e.g.,
                     translations, paraphrases) will only roughly, if at all, follow a reference
                     system based on features found in a particular scriptum. </para>
                  <para>Sometimes a scriptum-based reference system is inescapable, or is the most
                     common reference system for a work (e.g., Porphyry's commentary on Aristotle's
                        <emphasis>Categories</emphasis>). It is perfectly acceptable to adopt that
                     system, but it may entail more labor during the alignment process. Translations
                     using this system will rarely correspond to the points of division.</para>
                  <para>If a given work has more than one common reference system (e.g., the works
                     of Plato and Aristotle, which have two reference systems—logical and
                     scriptum-oriented—both of which are standard and important), then one good
                     practice is to create two class-1 files with identical transcriptions, each one
                     structured by its own reference system. Place in each file a <code><link
                           linkend="element-redivision">&lt;redivision&gt;</link></code> pointing to
                     the other. When you validate either file in the verbose phase, you will be
                     notified if there are textual discrepancies between the transcriptions. If you
                     are using Oxygen or another XML editor that supports Schematron Quick Fixes,
                     you will be provided a way to update one text to match the other with just a
                     few keystrokes. </para>
                  <para>Having two or more alternatively divided editions can be quite useful. They
                     could serve as the basis for reference cross-indexes, or to help convert other
                     versions of the work from one reference system to the other.</para>
                  <para>Alternatively, you can use the TAN-TEI approach. Choose one reference system
                     as the primary way to label your <code><link linkend="element-div"
                           >&lt;div></link></code>s, and convert the other references to anchors
                     such as <code>&lt;pb></code>, <code>&lt;cb></code>, <code>&lt;lb></code>,
                        <code>&lt;milestone></code>. Under this method, the logical references
                     (those based on logical units such as paragraphs, chapters, sections) are best
                     given to the <code><link linkend="element-div">&lt;div></link></code>s, and the
                     material ones to the anchors. Bear in mind, however, that typological semantics
                     are diminished with anchors, and there is no convenient way to retain
                     hierarchical structures, or disambiguate one anchor-based reference scheme from
                     another.</para>
                  <para>If there is a good reference system, but the divisions are overly lengthy,
                     you may introduce subdivisions. But there is no guarantee that the provisional
                     subdivisions you introduce will be adopted by other editors who create or edit
                     TAN versions of the same work. Editors working independently upon the same text
                     and subdividing it will likely produce their own schemes. Class-2 formats
                     provide a mechanism via <code><link linkend="element-adjustments"
                           >&lt;adjustments></link></code> to reconcile some basic differences. But
                     a discordant scheme might be best handled simply by creating a copy, and
                     restructuring it according to the preferred system, making sure related files
                     refer to each other through <code><link linkend="element-redivision"
                           >&lt;redivision&gt;</link></code>.</para>
                  <para>If a work does not have a reference system, or if you think that the ones
                     that exist are inadequate or misguided, create one of your own. If you develop
                     your own reference system, be sure to design it so that it can be easily
                     applied to any version of the work, including translations. Prefer logical
                     divisions of text over scriptum-based ones.</para>
                  <para xml:id="numeration-systems">TAN supports five major methods of reference numeration:<orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals</emphasis>. 1, 2, 3,
                              etc.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Roman numerals</emphasis>. Values up to 5000,
                              utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with
                              liberal syntactic rules (within a roman numeral, any digit preceding
                              one of a higher value will be deducted from the total value; all
                              others are added).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences</emphasis>. The
                              26-letter Latin alphabet, with numbers higher than 26 (or any multiple
                              of 26) beginning with the letter a incrementally repeated, e.g., y
                              (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase allowed.<footnote>
                                 <para>This is not the hexavigesimal (base 26) system, where a is 0,
                                    b is 1, z is 25, aa is 00, ab is 01, etc.</para>
                              </footnote></para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals + alphabetic
                                 sequences</emphasis>. Arabic numerals followed immediately by an
                              alphabetic sequence. The second item is to be calculated as a
                              subsequence of the first item, with the lack of a second item taking
                              highest priority. E.g., 4, 4a, 4b, 4c....</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences + Arabic
                                 numerals</emphasis>: As above, but with alphabetic sequence
                              preceding Arabic numerals.</para>
                        </listitem>
                     </orderedlist></para>
                  <para>See <code><link linkend="function-letter-to-number"
                           >tan:letter-to-number()</link></code> and references there to TAN
                     functions for converting numbering systems.</para>
                  <para>The TAN validation process attempts to convert all values of <code><link
                           linkend="attribute-n">@n</link></code> to Arabic numerals. Some values
                     are ambiguously Roman numerals or alphabetic sequences. For example,
                        <code>c</code> could mean 3 (alphabetic sequence) or 100 (Roman numeral).
                     Such numerals are assumed to be Roman, unless you supply in the <code><link
                           linkend="element-head">&lt;head></link></code> a <code><link
                           linkend="element-numerals">&lt;numerals&gt;</link></code> and assign
                           <code><link linkend="attribute-priority">@priority</link></code> to
                     specify <code>letters</code> (or <code>roman</code>, the implicit, default
                     value).</para>
                  <section xml:id="extra_n_vocabulary">
                     <title>Extra <code><link linkend="attribute-n">@n</link></code>
                        vocabulary</title>
                     <para>Sometimes <code><link linkend="attribute-n">@n</link></code> is given not
                        values consisting not of numerals but of names, commonly to identify works
                        within works, e.g., poems within a cycle, books in the Bible, or Surahs of
                        the Qur'an. For non-numerical values of <code><link linkend="attribute-n"
                              >@n</link></code>, different conventions are a common problem. The
                        abbreviation chosen by one person or project is rarely the same as that
                        adopted by the next. To avoid this long-standing issue, you may want to use
                        extra TAN vocabulary for <code><link linkend="attribute-n">@n</link></code>.
                        If you include in the head of your TAN file <code>&lt;vocabulary
                           which="bible eng"/></code>, then any non-numeric values of <code><link
                              linkend="attribute-n">@n</link></code> will be checked against the
                        corresponding TAN-voc file (in this case, the TAN-voc file at
                           <code>/vocabularies/extra/n.bible.eng.tan-voc.xml</code>, one of several
                        available in that directory). This, in turn, will will allow other files to
                        refer to that <code><link linkend="element-div">&lt;div></link></code> by
                        any other <code><link linkend="element-name">&lt;name&gt;</link></code> that
                        is a synonym. For example, in a class-1 file pointing to the TAN English
                        Bible vocabulary above, a <code>&lt;div type="book"
                           n="matt">...&lt;/div></code> would be regarded as containing the work the
                        Gospel of Matthew. Any class-2 file that refers to that class-1 file as a
                        source may use any synonym listed in the extra vocabulary file
                           <code>n.bible.eng.tan-voc.xml</code>, i.e., <code>Mt</code>,
                           <code>Mat</code>, <code>Matt</code>, or <code>Matthew</code> (or their
                        lowercase equivalents). An extra benefit of this method is that such
                              <code><link linkend="element-div">&lt;div></link></code>s are also
                        marked as works in their own right, identified by the <code><link
                              linkend="element-IRI">&lt;IRI></link></code>s of the target TAN
                        vocabulary items.</para>
                     <para>If you use extra TAN vocabulary, it is recommended you include in the
                        declarations section of your <code><link linkend="element-head"
                              >&lt;head></link></code> an <code><link linkend="element-n-alias"
                              >&lt;n-alias&gt;</link></code>. This element, along with its
                              <code><link linkend="attribute-div-type">@div-type</link></code>,
                        specifies exactly which types of <code><link linkend="element-div"
                              >&lt;div></link></code>s are eligible for this kind of aliasing on
                              <code><link linkend="attribute-n">@n</link></code>. Technically, it is
                        not necessary, but including it can considerably speed the validation
                        process on long files.</para>
                     <para>The goal behind the extra vocabularies is to eliminate the need to worry
                        about what abbreviations are used to name well-known, unnumbered <code><link
                              linkend="element-div">&lt;div></link></code>s. It is hoped that in
                        future releases of TAN these extra vocabularies will grow in number and
                        quality. You can write your own TAN-voc file to build your own set of n
                        aliases. The standard extra TAN <code><link linkend="attribute-n"
                           >@n</link></code> vocabularies should provide a good model:<itemizedlist>
                           <listitem>
                              <para><xref linkend="vocabularies-n-bible-eng"/></para>
                           </listitem>
                           <listitem>
                              <para><xref linkend="vocabularies-n-bible-spa"/></para>
                           </listitem>
                           <listitem>
                              <para><xref linkend="vocabularies-n-quran-eng-ara"/></para>
                           </listitem>
                           <listitem>
                              <para><xref linkend="vocabularies-n-unlabeled-divs-1-eng"/></para>
                           </listitem>
                        </itemizedlist></para>
                  </section>
               </section>
            </section>
            <section xml:id="normalizing_transcriptions">
               <title>Normalizing transcriptions</title>
               <para>You should declare how you have normalized the transcription via <code><link
                        linkend="element-adjustments">&lt;adjustments></link></code> and its
                  children, e.g., <code><link linkend="element-normalization"
                        >&lt;normalization></link></code> or <code><link linkend="element-replace"
                        >&lt;replace&gt;</link></code>. (For suggestions on values for <code><link
                        linkend="element-normalization">&lt;normalization></link></code> see <xref
                     linkend="vocabularies-normalizations"/>.)</para>
               <para>Generally speaking, normalization entails the suppression of things extraneous
                  to or separable from the work-version you have chosen. You are encouraged to omit
                  parenthetical editorial insertions (especially quotation references inserted by a
                  modern editor), stray handwritten remarks, discretionary word-breaking hyphens,
                  editorial comments, inserted cross-references, and reference numerals (page
                  numbers, section numbers, etc.). If chapter 4 of a text begins "4." or "IV" then
                  leave out that labeling numeral—you've already indicated it in <code><link
                        linkend="attribute-n">@n</link></code>, so there's no need to clutter the
                  transcription with it. Remember, scholars who use your file will be concerned with
                  things like word-for-word alignments and lexico-morphological analysis, and
                  putting in a modern editor's "4" might contaminate research results. For the same
                  reason, you should resolve ligatures and correct unintended typographical errors. </para>
               <para>The goal is a transcription whose text is free of the interpretive voice of
                  later editors. You should remove from the text anything that is not part of the
                  work proper and would interfere with detailed word-for-word alignment, or would
                  require extra preprocessing or postprocessing work for other users. If you are
                  breaking a transcription into individual lines, and you are required to break a
                  word, do so with either the soft hyphen (<code>&amp;#xad;</code>), the zero-width
                  space (<code>&amp;#x200b;</code>), or the zero-width joiner
                     (<code>&amp;#x200d;</code>). TAN processors will automatically normalize the
                  space of ever leaf <code><link linkend="element-div">&lt;div></link></code> . If
                  either of those special characters are found at the end then it will be deleted
                  and the text from the next leaf <code><link linkend="element-div"
                     >&lt;div></link></code> (if there is one) will immediately follow without
                  intervening space; if those two characters do not occur at the end, then a space,
                     <code>&amp;#x20;</code>, will be added at the end. Regardless of how a leaf div
                  ends, the rest of its space will be normalized. For more details, see <xref
                     linkend="whitespace"/>.</para>
               <para>In a digital source, variable lengths of special spacing marks (e.g., General
                  Punctuation U+2000..U+200B) and other Unicode points not allowed by TAN (see <xref
                     xlink:href="#deprecated-unicode-points"/>) should be converted to ordinary
                  characters, and superscript combining Roman letters (U+0363..U+036F) should
                  probably be converted to their non-combining counterparts. All Unicode must be
                  normalized to NFC forms (see <xref linkend="normalization"/>). </para>
               <para>In some ambiguous areas, you can use TAN-TEI both to normalize and to preserve
                  what is in the scriptum. Suppose, for example, a manuscript has reference numerals
                  that are <emphasis>sui generis</emphasis>. That is, these reference numbers do not
                  correspond to the "canonical" reference scheme, and are scribal adjustments to the
                  text's structure (sometimes mistaken). On the one hand, such reference numerals
                  are metadata, and should arguably be deleted; on the other, they are part of the
                  text, and witness to how a text was read and changed over time. A middle-ground
                  approach would move these references to TAN-TEI's <code>&lt;milestone
                     rend="[TEXT]"></code>, substituting <code>[TEXT]</code> for the reference text.
                  In that way, the numerals are properly removed from the main text, but the
                  information is retained. Generally speaking, TEI's <code>@rend</code> is an
                  excellent way to remove something from a transcription while keeping it in the
                  file.</para>
               <para>Overall, normalization is a difficult, understudied topic. Scholars are not in
                  the habit of documenting everything they normalize. Many have so internalized
                  their normalization principles that they are unaware of them. Not all decisions
                  will be clear-cut. You may justly hesitate before normalizing orthography,
                  punctuation, accentuation, or capitalization. Some aspects of Unicode that permit
                  different conventions may need special consideration. You may need to deliberate
                  on whether an unusual or rarely used Unicode character might be misinterpreted or
                  hinder searches. Document any decisions in the <code><link
                        linkend="element-adjustments">&lt;adjustments></link></code>. </para>
               <para>Whether you use <code><link linkend="element-normalization"
                        >&lt;normalization></link></code> or <code><link linkend="element-replace"
                        >&lt;replace&gt;</link></code> is up to you. The former can be used to apply
                  a class of changes to a vocabulary item. The latter provides a precise,
                  regular-expression-based method of describing exactly what has been changed, and
                  the order in which those changes took place. A <code><link
                        linkend="element-replace">&lt;replace&gt;</link></code> might help users to
                  better understand the path that led from the input to the output, but the process
                  cannot be reverse-engineered to produce the original. If it is important to
                  document exactly what the pre-normalized version of a text was like, use
                        <code><link linkend="element-predecessor">&lt;predecessor&gt;</link></code>
                  or a similar element available in the key links section of the <code><link
                        linkend="element-head">&lt;head></link></code> (see <xref
                     xlink:href="#other_related_files"/>) to point to the original.</para>
               <para>If you find it very difficult to bring yourself to normalize to the depth
                  advised above, try first making a (non-TAN) TEI file, and create the transcription
                  you have in mind as the ideal. Once that is finished, create a second, TAN
                  version, and be more aggressive in your normalization, with <code><link
                        linkend="element-see-also">&lt;see-also></link></code> pointing to the first
                  approach. </para>
               <section>
                  <title xml:id="normalizing-annotations">Normalizing annotations</title>
                  <para>The footnotes or endnotes in a scriptum should be normalized. Many, most, or
                     all should likely be deleted. Before deciding, distinguish between those that
                     are an intrinsic part of the work you're transcribing from those that aren't.
                     Those that aren't can be removed, or they can be put into a separate TAN-T(EI)
                     file, perhaps linking the two through <code><link linkend="element-see-also"
                           >&lt;see-also></link></code>, and hopefully structuring both files with
                     the same reference system, to facilitate alignment. Another way to approach the
                     task is to convert some or all of the notes you're removing into <code><link
                           linkend="element-TAN-A">&lt;TAN-A></link></code>
                     <code><link linkend="element-claim">&lt;claim></link></code>s.</para>
                  <para>Footnotes, endnotes, glosses, or marginalia that are intrinsic parts of the
                     work present special challenges for encoding in general, and normalization in
                     particular. </para>
                  <para>First is the issue of connecting an annotation to the text annotated. When
                     we encounter a superscript number—a note signal—while reading the text of a
                     printed book, we infer that we are being invited to find a companion footnote
                     that comments on the text we have just read. But specifically what text? Is it
                     only the preceding word? Is it a word or phrase that occurs earlier in the
                     sentence? Does the annotation cover earlier sentences, the entire paragraph, or
                     even prior paragraphs? For some notes, identifying the text being annotated
                     requires interpretation.</para>
                  <para>In a digital file, the connection between an annotation and its text cannot
                     be so vague; it requires a decision and a commitment. Here are three possible
                     ways to approach annotations in a TAN file:</para>
                  <para><orderedlist>
                        <listitem>
                           <para>Use the <code>&lt;note></code> feature of TAN-TEI (see related
                                 <link
                                 xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-note.html"
                                 >TEI documentation</link>). This will allow you to connect the
                              annotation to merely an anchor in the text, i.e., to no text
                              whatsover. </para>
                           <programlisting>&lt;div n="1" type="p">
   &lt;p>The process occurred in New York, among other places.&lt;ref rend="1"/>
      &lt;note>&lt;p>&lt;ref rend="1"/>On New York, see: X.&lt;/p>&lt;/note>
   &lt;/p>
&lt;/div></programlisting>
                        </listitem>
                        <listitem>
                           <para>Move each annotation into a <code><link linkend="element-div"
                                    >&lt;div></link></code> with a <code><link
                                    linkend="attribute-type">@type</link></code> that implies that
                              it is an annotation (e.g., <code>scholium</code>) and place it
                              immediately after the <code><link linkend="element-div"
                                    >&lt;div></link></code> it annotates.</para>
                           <programlisting>&lt;div n="1" type="p">The process occurred in New York, among other places.&lt;/div>
&lt;div n="n1" type="footnote">On New York, see: X.&lt;/div></programlisting>
                           <para>Note in the example above that <code>n1</code> is used to make sure
                              that <code>1</code> unambiguously points to only one <code><link
                                    linkend="element-div">&lt;div></link></code>.</para>
                        </listitem>
                        <listitem>
                           <para>As #2, but also write a <code><link linkend="element-TAN-A"
                                    >&lt;TAN-A></link></code> file that more precisely connects each
                              annotation to the text it
                              annotates.<programlisting>&lt;claim verb="annotates">
   &lt;subject src="text" ref="n1"/>
   &lt;object src="text">
      &lt;from-tok ref="1" val="The"/>
      &lt;through-tok ref="1" val="York"/>
   &lt;/object>
&lt;/claim></programlisting></para>
                        </listitem>
                     </orderedlist>The first option is expeditious, and will allow you to be as
                     precise or imprecise as you like. Validation is not affected, but you should be
                     aware that the <code>&lt;note></code> will be treated as a constituent part of
                     its parent <code><link linkend="element-div">&lt;div></link></code>. The second
                     option is also relatively easy, but it entails a decrease in precision. The
                     third option provides immense precision, permits multiple annotations on the
                     same text range, and allows notes to target overlapping ranges of text. But the
                     task could be time-consuming, if only because you will need to determine the
                     range of text targeted by each annotation, and the targeted text might be quite
                     messy or vague. You will need to take stock of how precise and comprehensive
                     you choose to make your connections. (See also <link
                        linkend="accuracy-precision-comprehensiveness">accuracy, precision, and
                        comprehensiveness</link>.)</para>
                  <para>Remember that the note signals in the main text and in the footnote area are
                     metadata meant to help readers link corresponding passages of texts and, in the
                     spirit of normalizing, should be deleted. In a TAN-TEI file you can replace a
                     note signal with <code>&lt;ref></code> (see above). </para>
               </section>
            </section>
         </section>
         <section>
            <title>Class 1 metadata</title>
            <para>The <code><link linkend="element-head">&lt;head></link></code> of a class-1 file
               is much like that of other formats, with some extra options. </para>
            <para>In the key declarations area (see <xref xlink:href="#key_declarations"/>), class-1
               files may allow <code><link linkend="element-n-alias">&lt;n-alias&gt;</link></code>.
               See <xref xlink:href="#reference_system"/> for context on how to use this
               element.</para>
            <para>In the section devoted to links to other digital resources (see <xref
                  xlink:href="#inclusions-and-vocabularies"/>), class-1 files allow several extra
               types of files.</para>
            <para>One <code><link linkend="element-model">&lt;model&gt;</link></code> is allowed, to
               point to another class-1 file that provides a model for the reference system that has
               been adopted. The model should be the same work. It may be in a different language,
               or come from a different source/scriptum. During verbose validation, any differences
               between a class-1 file and its model will be presented as warnings, since small
               differences are nearly always inevitable.</para>
            <para>Zero or more <code><link linkend="element-redivision"
                  >&lt;redivision&gt;</link></code>s are allowed. Each one points to an alternative
               transcription that restructures the same transcription in according to a different
               reference system. A class-1 file and any redivisions must have identical text in the
                     <code><link linkend="element-body">&lt;body></link></code>. <code><link
                     linkend="element-redivision">&lt;redivision&gt;</link></code> is an important
               alternative to the knotty, longstanding problem that besets texts that admit multiple
               reference systems. In a traditional TEI file, one must adopt a primary reference
               system, and add other reference systems through milestone-like anchors. TEI anchors
               do not have the semantic underpinnings needed to cycle through the milestones from
               one primary reference system from one to another. The TAN alternative is to encode
               same transcription in multiple files, one per reference system, linked through
                     <code><link linkend="element-redivision">&lt;redivision&gt;</link></code>. This
               may appear to contradict another principle, that one should not repeat themselves.
               But that is the easier principle to repair. During verbose validation, a file's text
               will be checked against every <code><link linkend="element-redivision"
                     >&lt;redivision&gt;</link></code>, and specific areas that differ will be
               flagged. Should users wish, a Schematron Quick Fix will allow a user to synchronize a
               local file against a redivided version.</para>
            <para>Zero or more <code><link linkend="element-annotation"
                  >&lt;annotation&gt;</link></code>s point to class-2 files that use the file as a
                     <code><link linkend="element-source">&lt;source&gt;</link></code>. This type of
               linked resource is helpful for keeping track of key alignments and
               annotations.</para>
            <para>Zero or more <code><link linkend="element-companion-version"
                     >&lt;companion-version&gt;</link></code>s point to different versions of the
               same work in the same scriptum. This feature is useful for correlating multiple
               versions of a work that appear in a single scriptum, e.g., the original text and a
               facing translation in a bilingual edition.</para>
            <para>The adjustment section of the <code><link linkend="element-head"
                  >&lt;head></link></code> (see <xref xlink:href="#adjustments"/>) allows zero or
               more <code><link linkend="element-normalization">&lt;normalization></link></code>s
               and <code><link linkend="element-replace">&lt;replace&gt;</link></code>. See <xref
                  xlink:href="#normalizing_transcriptions"/>.</para>
         </section>
         <section xml:id="tan-t_data">
            <title>Class 1 data</title>
            <para>The sole purpose of the <code><link linkend="element-body">&lt;body></link></code>
               of a class-1 file is to contain an ordered, segmented transcription of a single
               version of a single work from a scriptum. <code><link linkend="element-body"
                     >&lt;body></link></code> must take <code><link linkend="attribute-xmllang"
                     >@xml:lang</link></code>, specifying the predominant language of the text. If a
               change in language occurs in a descendant <code><link linkend="element-div"
                     >&lt;div></link></code>, ensure that its <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> also changes.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> takes one or more
                     <code><link linkend="element-div">&lt;div></link></code>s, each of which govern
               either other <code><link linkend="element-div">&lt;div></link></code>s, or text (or
               TEI elements), but never both. TAN files adopt a non-mixed content model (see <xref
                  linkend="non-mixed_content"/>).</para>
            <para>The term <emphasis>leaf div</emphasis> refers to those <code><link
                     linkend="element-div">&lt;div></link></code>s that contain only text, and not
               other <code><link linkend="element-div">&lt;div></link></code>s.</para>
            <para>Within this treelike structure of <code><link linkend="element-div"
                     >&lt;div></link></code>s, the concatenation of <code><link
                     linkend="attribute-n">@n</link></code> values, starting from the most rootward
                     <code><link linkend="element-div">&lt;div></link></code>, provides the
               reference system used by class-2 files to refer to parts of TAN-T(EI) files. A given
                     <code><link linkend="element-div">&lt;div></link></code> may have more than one
               reference, if its <code><link linkend="attribute-n">@n</link></code> or any
                     <code><link linkend="attribute-n">@n</link></code> it inherits has multiple
               values. Every permutation is calculated, and they are treated as synonymous ways to
               refer to that <code><link linkend="element-div">&lt;div></link></code>.</para>
            <para>The rule of combinatorial inheritance also applies when <code><link
                     linkend="attribute-n">@n</link></code> has as its value a range of numbers. For
               example, if <code><link linkend="attribute-n">@n</link></code> has the value "1-3"
               then it will match for 1, 2, and 3. Such ranges are important for translations, where
               there might not be precise one-to-one correlation with the divisions in the original.
               Applications that handle texts with one-to-many alignment mappings can used different
               strategies to reconcile the differences. See <code><link
                     linkend="function-merge-expanded-docs">tan:merge-expanded-docs</link>()</code>
               for discussion.</para>
            <para>In previous versions of TAN, there was a requirement that each leaf <code><link
                     linkend="element-div">&lt;div></link></code> should have a unique reference.
               That requirement has been relaxed, because there are cases where non-unique leaf
                     <code><link linkend="element-div">&lt;div></link></code>s are required.<footnote>
                  <para>Some scripta are encoded such that leaf divs are broken up (see Bodëús's
                     edition of Aristotle's <emphasis>Categories</emphasis>, at 2a35, 2b5, and
                     2b6b). And some translations must be encoded so that leaf divs interleave.
                     Further, one TAN-T's leaf divs might easily become another TAN-T's non-leaf
                     divs, and vice versa. The distinction between leaf and non-leaf div is
                     arbitrary, so both types should be expected to adhere to the same kind of
                     reference system rules.</para>
               </footnote>In a TAN-T(EI) file, for any two <code><link linkend="element-div"
                     >&lt;div></link></code>s that share the same reference, it is not allowed that
               one be a leaf <code><link linkend="element-div">&lt;div></link></code> and the other
               not. To do otherwise would entail a mixed content model. It is also further assumed
               that all <code><link linkend="element-div">&lt;div></link></code>s that share the
               same reference are consecutive, constituent parts of the same <code><link
                     linkend="element-div">&lt;div></link></code>. That is, any two <code><link
                     linkend="element-div">&lt;div></link></code>s with the same reference are not
               alternatives to each other, but are rather disjoint parts. For true alternatives, see
               discussion above on using <code>variant</code> in <code><link
                     linkend="attribute-type">@type</link></code>.</para>
         </section>
         <section xml:id="tan-tei">
            <title>Transcriptions using the Text Encoding Initiative (<code>&lt;TEI></code>)</title>
            
               
            <para>This section is to be read in conjunction with <xref linkend="class_1"/> and <xref
                  linkend="TEI"/>, which address related technical issues.</para>
               
            
            <section>
               <title>TAN-TEI</title>
            <para>Some creators and editors of transcriptions will find the rather stripped-down
               TAN-T format inadequate. Some may wish to mark up the text further. Some may already
               have a library of transcriptions whose annotations are desirable to keep, even if
               uninteresting to every user. In these cases, you should use TAN-TEI, a customization
               of the Text Encoding Intiative (TEI) format, which is well known for its
               expressiveness, its stability, its flexibility, and its widespread use in textual
               scholarship.</para>
            <para>TEI was designed to be maximally expressive and flexible, to serve the detailed
               needs of scholars in the humanities. In serving this mission, TEI has come to define
               more than five hundred different elements, and more than two hundred attributes
               (roughly six times more than are defined in TAN). Of course, any given TEI file uses
               only a small subset of those elements and attributes, and TEI itself comes in
               different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to
               TEI All, which opens up almost the entire library. </para>
            <para>Although TEI XML is oftentimes described as a standard, it lacks charactistics one
               normally expects of a standard. It is very flexible, admits flavors and
               interpretation, and is best used when it is customized. Individuals and projects may
               define their own subset of TEI elements, to constrict or expand the allowable rules
               as they see fit. TAN-TEI is one of those customizations, based on TEI All. The major
               difference between TEI All and TAN-TEI is that the latter imposes extra strictures,
               to ensure that transcriptions are maximally likely to be interchangeable with other
               TAN-TEI files.</para>
            <para>All TEI files are validated against a <link
                  xlink:href="https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ST.html#STIN"
                  >TEI-conformant schema</link> normally as an XML DTD, RELAX NG, or W3C Schema.
               TAN's TEI-conformant schema is based upon the <code>TAN-TEI.odd</code> file in the
                  <code>schemas</code> directory, converted to a RELAX-NG file, <code>TEI.rnc</code>
               and <code>TEI.rng</code>, to define the structural rules of TAN-TEI files. There is
               an additional layer of validation, through the related Schematron process
                  (<code>TEI.sch</code>), which performs detailed validation not possible in a
               TEI-conformant schema. In the discussion below, it is important to distinguish
               between structural validation and Schematron validation. See <xref
                  linkend="validating_tan_files"/>.</para>
               </section>
            <section>
               <title>TEI customization</title>
            <para>TAN's customization of the TEI can be summarized as follows (the default namespace
               in this section is the TEI namespace,
               <code>http://www.tei-c.org/ns/1.0</code>):</para>
            <para>
               <table frame="all">
                  <title>Synopsis of TAN-TEI customization</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1*"/>
                     <colspec colname="c3" colnum="2" colwidth="3.21*"/>
                     <thead>
                        <row>
                           <entry>TEI element</entry>
                           <entry>Strictures</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must have <code><link linkend="attribute-id"
                                          >@id</link></code> with tag URN</para>
                                 </listitem>
                                 <listitem>
                                    <para>must have <code><link linkend="attribute-TAN-version"
                                             >@TAN-version</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>takes a new child element, <code><link
                                             linkend="element-head">&lt;head></link></code>, placed
                                       between <code>&lt;teiHeader></code> and
                                          <code>&lt;text></code>; it and its descendants must be in
                                       the TAN namespace,
                                          <code>xmlns:tan="tag:textalign.net,2015:ns"</code>
                                    </para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code>&lt;text></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>There are no extra strictures, but during Schematron
                                       validation (not RELAX-NG), this element and any children
                                          <code>&lt;front></code> and <code>&lt;back></code> will be
                                       ignored. Of its children, only <code><link
                                             linkend="element-body">&lt;body></link></code> will be
                                       Schematron validated. </para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-body">&lt;body></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-xmllang"
                                             >@xml:lang</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>any non-<code><link linkend="element-div"
                                          >&lt;div></link></code> children will be ignored during
                                       Schematron validation; most often only <code><link
                                             linkend="element-div">&lt;div></link></code> should be
                                       children</para>
                                 </listitem>
                                 <listitem>
                                    <para>contents must be restricted to a single version of a
                                       single work</para>
                                 </listitem>
                                 <listitem>
                                    <para>any and all text nodes will be treated as part of the
                                       transcription</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-div">&lt;div></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>may encompass a textual division of whatever size you like
                                       (TEI defines <code><link linkend="element-div"
                                             >&lt;div></link></code> as being larger than block-like
                                       or paragraph-like textual divisions; TAN's <code><link
                                             linkend="element-div">&lt;div></link></code> is much
                                       more like HTML's).</para>
                                 </listitem>
                                 <listitem>
                                    <para>must take elements; either they all are <code><link
                                             linkend="element-div">&lt;div></link></code>s (perhaps
                                       interleaved with anchors such as <code>&lt;pb></code>) or
                                       none of them are <code><link linkend="element-div"
                                             >&lt;div></link></code>s (non-mixed model)</para>
                                 </listitem>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-type"
                                             >@type</link></code> and <code><link
                                             linkend="attribute-n">@n</link></code> (or only <link
                                          linkend="attribute-include"
                                       ><code>@include</code></link>)</para>
                                 </listitem>
                                 <listitem>
                                    <para><code><link linkend="attribute-type">@type</link></code>
                                       may take multiple values, space delimited, pointing via IDref
                                       to a vocabulary item</para>
                                 </listitem>
                                 <listitem>
                                    <para><code><link linkend="attribute-n">@n</link></code> allows
                                          synonyms, sequences, and ranges, and must match the
                                          regular expression defined by <code><link
                                                linkend="variable-attr-n-regex"
                                                >$tan:attr-n-regex</link></code>. If <code><link
                                                linkend="attribute-n">@n</link></code> is to be
                                          given more than one value, those items must be separated
                                          by a space or a comma. A hyphen-minus, - (U+002D, the most
                                          common form of hyphen), always has special meaning in
                                                <code><link linkend="attribute-n">@n</link></code>,
                                          specifying a range. This feature is useful for cases where
                                          a <code><link linkend="element-div">&lt;div></link></code>
                                          straddles more than one standard reference number (e.g., a
                                          translation of Aristotle that cannot be easily tied to
                                          Bekker numbers). If you need to use a hyphen-like
                                          character in an <code><link linkend="attribute-n"
                                                >@n</link></code> that does not specify a range of
                                          numbers, consider ‐ (U+2010 HYPHEN), ‑ (U+2011
                                          NON-BREAKING HYPHEN), ‒ (U+2012 FIGURE DASH), – (U+2013 EN
                                          DASH), or − (U+2212 MINUS SIGN).</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>TAN-TEI files have two heads, each designed for different purposes. Whereas the
               TAN <link linkend="element-head"><code>&lt;head></code></link> is meant to be brief
               and restricted to only those matters relevant to the transcription, the
                  <code>&lt;teiHeader></code> permits quite an expansive range of metadata, and may
               be used to encode a variety of things, including those that are tangential or
               irrelevant to the data. Unlike the TAN <link linkend="element-head"
                     ><code>&lt;head></code></link>, whose data is designed to be both computer- and
               human-readable, <code>&lt;teiHeader></code> was designed for data to be read
               principally by humans; although it can accommodate IRIs, it was not designed around
               them. Further, a TAN <link linkend="element-head"><code>&lt;head></code></link> can
               never be empty and valid; a bare-bones <code>&lt;teiHeader></code> with no actual
               text content, such as the following, is considered
               valid:<programlisting>&lt;teiHeader>
   &lt;fileDesc>
      &lt;titleStmt>&lt;title/>&lt;/titleStmt>
      &lt;publicationStmt>&lt;p/>&lt;/publicationStmt>
      &lt;sourceDesc>&lt;p/>&lt;/sourceDesc>
   &lt;/fileDesc>
&lt;/teiHeader></programlisting></para>
            <para>TAN's Schematron validation process ignores the contents of
                  <code>&lt;teiHeader></code>, since its contents are unpredictable and therefore
               not reliably parsable. If your <code>&lt;teiHeader></code> has any kind of metadata
               that needs to appear in the TAN <link linkend="element-head"
                  ><code>&lt;head></code></link> (see <xref linkend="metadata_head"/> and <xref
                  linkend="transcription_principles"/>), the conversion needs to be performed
               manually, since (as mentioned above) the two headers are incommensurate, and writing
               each one requires a different mentality.</para>
            <para>In a TAN-TEI file, the TAN <code><link linkend="element-head"
                  >&lt;head></link></code> must be in the TAN namespace, i.e., <code>&lt;head
                  xmlns="tag:textalign.net,2015:ns"></code>. Alternatively you might write
                  <code>&lt;tan:head xmlns:tan="tag:textalign.net,2015:ns"></code>, but this would
               require all descendant elements to be prefixed <code>tan:</code>.</para>
            <para>Within any leaf <code><link linkend="element-div">&lt;div></link></code>, you may
               use whatever TEI markup you wish, to whatever level of depth or complexity. Most
               users of your TAN-TEI file will be interested in the text; only a subset will care
               about any markup within leaf <code><link linkend="element-div"
               >&lt;div></link></code>s. </para>
            <para>TEI files are flexible, permitting different approaches to markup. A TAN-TEI file
               should not be scriptum-oriented, i.e., it should not try to replicate how the text
               appears or looks on the object. That is because the TAN-TEI file will be used in
               intertextual comparisons, where the transcription is compared to transcriptions from
               a wide variety of sources. </para>
               </section><section>
                  <title>Converting TEI to TAN-TEI</title>
            <para>You may have a TEI file that you wish to convert to TAN-TEI. As a matter of
               practicality, it is helpful to envision the conversion process as falling in three
               steps:</para>
            <para>
               <orderedlist>
                  <listitem>
                     <para>Structure: insert new processing instructions (pointing to files to
                        perform TAN-TEI structural and Schematron validation); adjust root element
                        by supplying a tag URN for <code><link linkend="attribute-id"
                           >@id</link></code> and <code><link linkend="attribute-TAN-version"
                              >@TAN-version</link></code>.</para>
                  </listitem>
                  <listitem>
                     <para>Metadata: create new <code><link linkend="element-head">&lt;head
                              xmlns="tag:textalign.net,2015:ns"></link></code> and populate
                        it.</para>
                  </listitem>
                  <listitem>
                     <para>Data: edit <code><link linkend="element-body">&lt;body></link></code> to
                        make sure all text nodes are restricted to the content of a single version
                        of a single work; restructure <code><link linkend="element-body"
                              >&lt;body></link></code> content into nesting <code><link
                              linkend="element-div">&lt;div></link></code>s with correct <code><link
                              linkend="attribute-type">@type</link></code> and <code><link
                              linkend="attribute-n">@n</link></code> values.</para>
                  </listitem>
               </orderedlist>
            </para>
            <para>It has been the experience of those who have made TEI to TAN-TEI conversions that
               step 2 is the most time-consuming, particularly in finding suitable IRIs. But step 3
               should not be underestimated, either. Many people write TEI files with a focus on the
               original textual object, and they do not normalize to the level expected in a TAN
               file. Some TEI files have been written with little attention paid to space and space
               normalization. Some TEI files are so laden with annotations that the text is
               impossible to read. In general, the more simple the TEI file the better, with
               annotations pushed to external files.</para>
            <para>Some TEI markup is already implicit, or is easily calculable (e.g.,
                  <code>&lt;w></code> to mark words, which should already comport with the
               tokenization declared in the <code><link linkend="element-head"
                  >&lt;head></link></code>; users of <code>&lt;w></code> easily lose track of where
               space is and isn't). Some TEI markup can be expressed in a class-2 file (e.g.,
               lexico-morphological data, which should be expressed in a TAN-A-lm file).</para>
               <para>If you have a TEI odd file that you wish to preserve, but incorporate the TAN
                  .odd file, you may be able to do this manually, integrating your odd file with
                  TAN's. In the future, an application may be written to assist in this process.
                  When you write your new odd file, you will want to generate a set of .rng or .rnc
                  files and place them in the TAN <code>schemas</code> directory. Be sure to give
                  them a unique name, something other than <code>TEI.*</code> or
                     <code>TAN-TEI.*</code>, so that your generated schema files do not overwrite
                  the standard TAN ones.</para>
         </section></section>
      </chapter>
      <chapter xml:id="class_2">
         <title>Class-2 TAN files, annotations of texts</title>
         <para>This chapter provides general background to class-2 TAN files. For detailed
            discussion of individual elements and attributes see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>There are three types of class-2 files:<orderedlist>
               <listitem>
                  <para><emphasis role="bold">TAN-A</emphasis> files provide broad, macroscopic
                     alignment of multiple versions of any number of works. It also supports a wide
                     variety of annotations on texts.</para>
               </listitem>
               <listitem>
                  <para><emphasis role="bold">TAN-A-tok</emphasis> files provide narrow, microscopic
                     alignment of any two class-1 files, annotating word-for-word or
                     character-for-character correspondences between the two texts.</para>
               </listitem>
               <listitem>
                  <para><emphasis role="bold">TAN-A-lm</emphasis> files express annotations
                     pertaining to lexico-morphology (grammatical part-of-speech), for either a
                     single class-1 file or a language in general.</para>
               </listitem>
            </orderedlist></para>
         <para>In translation studies, it is common to use the term <emphasis>source</emphasis> (or
               <emphasis>sources</emphasis>) to refer to a translated text and the term
               <emphasis>target</emphasis> to refer to the translation. TAN, however, has been
            designed for situations where it may not be clear which text is the target and which is
            the source. Further, there is a more generic use of <emphasis>source</emphasis> and
               <emphasis>target</emphasis> that prevails in many other contexts. In these
            guidelines, therefore, the term <emphasis role="italic">target</emphasis> never refers
            to a text as such (rather, it normally refers to a file that is being pointed to), and
            when we use the word <emphasis role="italic">source</emphasis>, we are referring only to
            one of the class-1 files upon which a class 2 alignment depends.</para>
         <section xml:id="class_2_common">
            <title>Common elements</title>
            <section xml:id="class_2_metadata">
               <title>Class 2 metadata (<code><link linkend="element-head"
                  >&lt;head></link></code>)</title>
               <para>Class-2 files share a few common features in their metadata, mostly to
                  facilitate the human-friendly reference system discussed below.</para>
               <para>All class-2 files have as their sources nothing other than class-1 files.
                  Therefore each <code><link linkend="element-source">&lt;source></link></code> must
                  take the <xref xlink:href="#digital_entity_metadata"/>. </para>
               <para>Editors of class-2 files must be able to name or number word-tokens in a
                  transcription, and to determine an appropriate definition of "token," via an
                  optional <code><link linkend="element-token-definition"
                        >&lt;token-definition></link></code>. See <xref linkend="defining_tokens"
                  />.</para>
               <para>Inevitably, some class 1 sources for the same work will differ from each other.
                  Perhaps works or div types were not defined with the same IRIs, or perhaps one
                  version follows an idiosyncratic reference system. If sources need to be
                  reconciled, alterations may be specified in <code><link
                        linkend="element-adjustments">&lt;adjustments&gt;</link></code>, which
                  stipulates a set of actions that should be applied to the sources that have been
                  named. The following adjustment actions are supported:</para>
               <orderedlist>
                  <listitem>
                     <para><code><link linkend="element-skip">&lt;skip&gt;</link></code>, to allow
                        you to ignore specific <code><link linkend="element-div"
                           >&lt;div&gt;</link></code>s, deeply or shallowly.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="element-rename">&lt;rename&gt;</link></code>, to
                        allow you to rename specific <code><link linkend="element-div"
                              >&lt;div&gt;</link></code>s.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="element-equate">&lt;equate&gt;</link></code>, to
                        allow you to provisionally establish some <code><link linkend="attribute-n"
                              >@n</link></code> values as being synonymous.</para>
                  </listitem>
                  <listitem>
                     <para><code><link linkend="element-reassign">&lt;reassign&gt;</link></code>, to
                        allow you to split leaf <code><link linkend="element-div"
                           >&lt;div&gt;</link></code>s and move their parts elsewhere in the
                        structure. </para>
                  </listitem>
               </orderedlist>
               <para>These adjustment actions allow you to reconcile discordant sources without
                  changing them directly. </para>
               <para>Skips, renames, and equates are first applied to the source as received. If a
                  particular source <code><link linkend="element-div">&lt;div&gt;</link></code> is
                  the target of more than one adjustment action, only the first one will be applied
                  according to action priority: <code><link linkend="element-skip"
                        >&lt;skip&gt;</link></code>, <code><link linkend="element-rename"
                        >&lt;rename&gt;</link></code> based on <code><link linkend="attribute-ref"
                        >@ref</link></code>, <code><link linkend="element-rename"
                        >&lt;rename&gt;</link></code> based on <code><link linkend="attribute-n"
                        >@n</link></code>, then <code><link linkend="element-equate"
                        >&lt;equate&gt;</link></code>. This action priority also corresponds to the
                  amount of time needed to process the adjustments. Numerous <code><link
                        linkend="element-skip">&lt;skip&gt;</link></code> actions are applied very
                  quickly. Numerous <code><link linkend="element-reassign"
                     >&lt;reassign&gt;</link></code>s however can be time-consuming, because it
                  requires tokenizing the text.</para>
               <para>Because of this priority order, some actions might not be performed. For
                  example, if you deeply skip a <code><link linkend="element-div"
                     >&lt;div></link></code>, no renaming adjustments will be made to its children. </para>
               <para>Skips, renames, and equates are applied in one pass, based on the original
                  reference system, then <code><link linkend="element-reassign"
                        >&lt;reassign&gt;</link></code>s are applied to the the newly adjusted
                  source. If you rename a div, then want to reassign it, you must do so based on the
                  new name, not the original. </para>
               <para>Each adjustment action adds time to the validation routines. On lengthy texts
                  these can become quite time-consuming. Take, for example, the Tanakh / Old
                  Testament in Hebrew, Greek Septuagint, and English (King James Version). Each of
                  these differs from the other in the names of books, and the numeration of some
                  chapters and verses (primarily the books of Psalms, Jeremiah, Joel, and Hosea). To
                  completely reconcile these three versions requires at least 1 <code><link
                        linkend="element-skip">&lt;skip&gt;</link></code>, 237 <code><link
                        linkend="element-rename">&lt;rename&gt;</link></code>s and 3 <code><link
                        linkend="element-equate">&lt;equate&gt;</link></code>s, and 31 <code><link
                        linkend="element-reassign">&lt;reassign&gt;</link></code>s. Applying these
                  actions to all three versions can take about two minutes (tested on computer with
                  an Intel i5-8250U, 12 GB ram), before any other significant validation checks on
                  anything insed the <code><link linkend="element-body">&lt;body></link></code> of
                  the class-2 file.<footnote>
                     <para>In earlier generations of TAN, this process took upwards of an
                        hour.</para>
                  </footnote> If such processing times are unacceptable, you are advised to keep
                        <code><link linkend="element-adjustments">&lt;adjustments&gt;</link></code>s
                  to a minimum or to apply them to relatively small texts. </para>
               <para>Further, adjustment actions were intended primarily to address common
                  irregularities between files, to apply some last minute touches, or perhaps to
                  drop certain parts of texts. Adjustments were not designed to provide extensive,
                  deep corrections. If a source must be changed in numerous places to reconcile it
                  with other sources, you should create a new version of the source, reorganized as
                  you prefer. Then in both the new and original versions of the class-1 files insert
                        <code><link linkend="element-redivision">&lt;redivision&gt;</link></code>,
                        <code><link linkend="element-predecessor">&lt;predecessor&gt;</link></code>,
                        <code><link linkend="element-successor">&lt;successor&gt;</link></code>, or
                        <code><link linkend="element-see-also">&lt;see-also></link></code> to link
                  the two versions.</para>
               <para>There is a TAN application that remodels one text in the image of another. See
                     <code>applications/remodel/remodel text.xsl</code>. The output of that
                  application requires editing, but it can reduce the amount of work required. TAN
                  tools for Oxygen's author mode can also be used to correct that newly segmented
                  text. </para>
            </section>
            <section>
               <title>Class 2 data (<code><link linkend="element-body"
                  >&lt;body></link></code>)</title>
               <para>Data types differ greatly between the class 2 formats. However, they all share
                  one thing in common: the <code><link linkend="element-body"
                     >&lt;body></link></code> consists of a series of claims, and responsibility for
                  those claims should be attributed to the persons, organizations, or algorithms
                  making the claims. Therefore, each <code><link linkend="element-body"
                        >&lt;body></link></code> may take <code><link linkend="attribute-claimant"
                        >@claimant</link></code> and perhaps <code><link
                        linkend="attribute-claim-when">@claim-when</link></code>, specifying by
                  IDref who should be credited or blamed with the material. If either attribute is
                  missing, it is assumed that the claims are the responsibility of the persons
                  listed in <link linkend="element-file-resp">&lt;file-resp&gt;</link>. The values
                  of <code><link linkend="attribute-claimant">@claimant</link></code> and
                        <code><link linkend="attribute-claim-when">@claim-when</link></code> are
                  weakly inheritable.</para>
            </section>
            <section xml:id="pointer-syntax">
               <title>Class 2 pointer syntax: referencing texts</title>
               <para>The class 2 formats have been designed to be human readable, particularly text
                  references. In ordinary conversation, when refering to specific parts of a work,
                  we prefer to use the numbers or names of pages, paragraphs, sentences, lines,
                  words, letters, and so forth, and sometimes relational words (e.g., "first"). We
                  might say, for example, "See page 4, second paragraph, the last four words."
                  Sometimes we quote the very text itself: "See page 4, second paragraph, first
                  sentence, second occurence of 'pull'." </para>
               <para>Those familiar conventions are the basis for the TAN pointer syntax, which
                  differs from other pointer systems (e.g., URLs, XPath, and XPointer). TAN pointers
                  apply common reference terminology to four strata of a text: works, divisions,
                  word tokens, and characters. <emphasis>Works</emphasis>, defined above (see <xref
                     linkend="conceptual_works"/>), are declared by the <emphasis>source</emphasis>
                  (which may not have more than one work). <emphasis>Divisions</emphasis> are
                  defined by the <code><link linkend="element-div">&lt;div></link></code> structure
                  of each source. <emphasis>Tokens</emphasis> are words of the text in those
                  divisions, defined according to one or more <code><link
                        linkend="element-token-definition">&lt;token-definition&gt;</link></code>s
                  declared in the class-2 file. And <emphasis>characters</emphasis> are defined as
                  individual base letters in a word token (any modifier character is treated in
                  concert with the last preceding base character; see <xref
                     linkend="combining_characters"/>).</para>
               <para>This approach not only makes the syntax human readable but mitigates the effect
                  of changes to the sources. For example, if a <code><link linkend="element-div"
                        >&lt;div></link></code> is deleted, moved, or changed, the alteration
                  affects only references specific to that <code><link linkend="element-div"
                        >&lt;div></link></code> and its descendants; the rest of the reference
                  system remains intact. </para>
               <para>The four parts of TAN's reference system are explained below, but you should
                  consult other parts of the guidelines, or study TAN examples, to see how they are
                  used in practice.</para>
               <section>
                  <title>Referencing works: <code><link linkend="attribute-work"
                     >@work</link></code></title>
                  <para>This section applies only to TAN-A files, because the other class-2 files do
                     not make claims about works <emphasis>per se</emphasis>.</para>
                  <para>TAN-A files refer to works via meaningful IDrefs that point to the class-1
                     sources that transcribe the work/work-version, e.g., <code><link
                           linkend="attribute-work">work</link>="hamlet"</code>. The reference is
                     understood to apply not merely to that particular source, but to any TAN-T file
                     that claims to transcribe that work or work-version. (On the relationship
                     between works and work-versions see <xref linkend="domain_model"/>.) Thus, the
                     id of the source-scriptum becomes a proxy or alias for the work. </para>
                  <para>Any work may also be defined through a vocabulary item <code><link
                           linkend="element-work">&lt;work&gt;</link></code>, either locally in the
                           <code><link linkend="element-vocabulary-key"
                        >&lt;vocabulary-key></link></code> or in a TAN-voc file linked via
                           <code><link linkend="element-vocabulary"
                     >&lt;vocabulary&gt;</link></code>. The work would then be referred to by
                           <code><link linkend="attribute-xmlid">@xml:id</link></code> or
                           <code><link linkend="element-name">&lt;name&gt;</link></code> of the
                     particular vocabulary item.</para>
               </section>
               <section xml:id="referencing-divisions">
                  <title>Referencing textual divisions: <code><link linkend="attribute-ref"
                           >@ref</link></code></title>
                  <para>Portions of text, i.e., <code><link linkend="element-div"
                        >&lt;div></link></code>s, perhaps altered if <code><link
                           linkend="element-adjustments">&lt;adjustments&gt;</link></code>s have
                     been invoked (see <xref linkend="metadata_head"/>, are pointed to via
                           <code><link linkend="attribute-ref">@ref</link></code>. A <code><link
                           linkend="attribute-ref">@ref</link></code> is constructed by taking the
                     values of <code><link linkend="attribute-n">@n</link></code> in the <code><link
                           linkend="element-div">&lt;div></link></code> in question along with its
                     ancestor <code><link linkend="element-div">&lt;div></link></code>s, and joining
                     them with non-word characters. For example, <code><link linkend="attribute-ref"
                           >@ref</link>="I.1.1"</code> might point to the following:</para>
                  <para>
                     <programlisting>&lt;div type="act" n="1">
   &lt;div type="scene" n="1">
      <emphasis role="bold">&lt;div type="line" n="1">
         . . . . . .
      &lt;/div></emphasis>
      . . . . . .
   &lt;/div>
   . . . . . .
&lt;/div></programlisting>
                  </para>
                  <para>A <code><link linkend="attribute-ref">@ref</link></code> can express
                     sequences and ranges of <code><link linkend="element-div"
                        >&lt;div></link></code>s. In the example <code><link linkend="attribute-ref"
                           >ref</link>="1.2-4, 1.5"</code>, the hyphen and comma, which are reserved
                     to signify ranges and series, are reserved. <emphasis role="bold">A hyphen
                        always means "from...through" and a comma always means "and"</emphasis>. In
                     the TAN format, commas are always paratactic, not hypotactic. For example, if
                     referring to Hamlet, <code><link linkend="attribute-ref"
                        >ref</link>="I,2,3"</code> is not a single reference to <code><link
                           linkend="element-div">&lt;div></link></code>, act I scene 2 line 3, but
                     rather three of them: act I, act 2, and act 3 (notice how the commas in the
                     attribute value behave like the commas in the written phrase). If you mean to
                     say act I, scene 2, line 3 try <code><link linkend="attribute-ref"
                        >ref</link>="I.2.3"</code> or <code><link linkend="attribute-ref"
                        >ref</link>="I 2 3"</code>.</para>
                  <para>The periods (full stops) in <code><link linkend="attribute-ref"
                        >@ref</link>="I.1.1"</code> are hypotactic markers, but they are arbitrary,
                     and could be replaced with any mix of non-word character you like (except the
                     hyphen or comma), including spaces, e.g., <code><link linkend="attribute-ref"
                           >ref</link>="I:1 1"</code>. The numeral system is also arbitrary. You may
                     use any supported numeration system (see <link xlink:href="#numeration-systems"
                        >section on numeration systems</link>), even if the source uses a different
                     one. Semantic equivalents to the preceding example are <code><link
                           linkend="attribute-ref">ref</link>="A I i"</code> and <code><link
                           linkend="attribute-ref">ref</link>="1:a:I"</code>. Just remember, if you
                     use either the Roman numeral system or alphabetic sequences, include a
                           <code><link linkend="element-numerals">&lt;numerals&gt;</link></code> in
                     the <code><link linkend="element-head">&lt;head></link></code> to specify which
                     system should prevail in case of ambiguities (e.g., whether <code>c</code>
                     means 3 or 100). Roman numerals are the default, but it is a good idea to be
                     explicit.</para>
               </section>
               <section xml:id="attr_pos_and_val">
                  <title>Referencing tokens: <code><link linkend="attribute-pos">@pos</link></code>
                     and <code><link linkend="attribute-val">@val</link></code></title>
                  <para>To point to a token one normally uses <code><link linkend="element-tok"
                           >&lt;tok></link></code>, with one or more attributes, in three possible
                     configurations:</para>
                  <para>
                     <orderedlist>
                        <listitem>
                           <para><emphasis role="italic"><code><link linkend="attribute-val"
                                       >@val</link></code> or <code><link linkend="attribute-rgx"
                                       >@rgx</link></code> alone</emphasis>: one or more tokens are
                              pointed to by value. For example, <code><link linkend="attribute-val"
                                    >val</link> = "bird"</code>, points to every occurence of the
                              token <code>bird</code>; <code><link linkend="attribute-rgx"
                                    >rgx</link> = "b.+d"</code> finds every word that begins with a
                              b, ends with a d, and has some characters in-between. Every value of
                                    <code><link linkend="attribute-rgx">@rgx</link></code> is
                              implicitly bound to the beginning and end of the string (see
                              below).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="italic"><code><link linkend="attribute-pos"
                                       >@pos</link></code> alone</emphasis>: one or more tokens are
                              pointed to by numerical position, via one or more digits, or the
                              phrase <code>last</code> or <code>last-</code> plus a digit, joined by
                              hyphens or commas. For example, <code>2, 4-6, last-2 - last</code>
                              refers to the second, fourth, fifth, sixth, antepenult, penult, and
                              final tokens in a passage. The numerical value to which the keyword
                                 <code>last</code> resolves depends upon the context length.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="italic"><emphasis role="italic"><code><link
                                          linkend="attribute-val">@val</link></code> or <code><link
                                          linkend="attribute-rgx">@rgx</link></code> combined with
                                          <code><link linkend="attribute-pos"
                                    >@pos</link></code></emphasis></emphasis>: a combination of the
                              previous two methods. For example, <code><link linkend="attribute-val"
                                    >@val</link>="bird" <link linkend="attribute-pos"
                                 >@pos</link>="2, 4"</code> picks the second and fourth occurences
                              of the token <code>bird</code>.</para>
                        </listitem>
                     </orderedlist>
                  </para>
                  <para>During Schematron validation, if <code><link linkend="attribute-pos"
                           >@pos</link></code> is missing, it is assumed to mean <code>*</code> or
                        <code>1 - last</code>; if neither <code><link linkend="attribute-val"
                           >@val</link></code> nor <code><link linkend="attribute-rgx"
                        >@rgx</link></code> appear, the assumption is <code><link
                           linkend="attribute-rgx">@rgx</link></code> with value <code>.+</code>
                     (any characters). That is, by default, <code><link linkend="attribute-pos"
                           >@pos</link></code> points to every instance and <code><link
                           linkend="attribute-val">@val</link></code>/<code><link
                           linkend="attribute-rgx">@rgx</link></code> to every token.</para>
                  <para>When using <code><link linkend="attribute-pos">@pos</link></code> make sure
                     you know the context. For example, the attribute combination <code>val="bird"
                        pos="last-1"</code> will produce an error if the token <code>bird</code>
                     does not occur at least two times in the given context.</para>
                  <para>It is advisable to use <code><link linkend="attribute-val"
                        >@val</link></code> or perhaps <code><link linkend="attribute-rgx"
                           >@rgx</link></code>, and not merely <code><link linkend="attribute-pos"
                           >@pos</link></code>. If your source's text changes, and there is no
                           <code><link linkend="attribute-val">@val</link></code>, it may be
                     difficult to determine the original intent of a claim, to determine whether
                     changes need to be made. <code><link linkend="attribute-val">@val</link></code>
                     is easier than <code><link linkend="attribute-rgx">@rgx</link></code> to
                     process in applications, particularly when compiling statistics or estimating
                     probabilities. Furthermore, <code><link linkend="attribute-val"
                        >@val</link></code> is generally speaking more efficient to process than is
                           <code><link linkend="attribute-rgx">@rgx</link></code>. A <code><link
                           linkend="attribute-rgx">@rgx</link></code> is more efficient only if it
                     replaces numerous instances of <code><link linkend="attribute-val"
                        >@val</link></code>.</para>
                  <para><code><link linkend="attribute-rgx">@rgx</link></code> is a regular
                     expression that must match an entire word-token. For example, <code><link
                           linkend="attribute-rgx">@rgx</link>="re.d"</code> will match the tokens
                     "rend" and "read" but will not match "already", "rends", or "bread". If you
                     wish to allow for characters at the beginning or end, use
                        <code>".*re.d.*"</code>. For more on regular expressions, see <xref
                        linkend="regular_expressions"/>.</para>
               </section>
               <section>
                  <title>Referencing characters: <code><link linkend="attribute-chars"
                        >@chars</link></code></title>
                  <para>Individual letters are always specified by <code><link
                           linkend="attribute-chars">@chars</link></code>, which points to a
                     specific position, e.g., <code><link linkend="attribute-chars">chars</link>="2,
                        7, last"</code>. Combining characters are excluded from these counts; see
                        <xref xlink:href="#combining_characters"/>.</para>
               </section>
            </section>
         </section>
         <section xml:id="TAN-A">
            <title>General annotations and alignments (<code><link linkend="element-TAN-A"
                     >&lt;TAN-A></link></code>)</title>
            <para>TAN-A is the format for macroscopic, division-based alignment and annotations of
               class-1 sources. It allows you to align any number of versions of any number of works
               on the basis of <code><link linkend="element-div">&lt;div></link></code>s. The A also
               stands for annotations, because the TAN-A format allows you to make general
               assertions, usually but not necessarily about texts. TAN-A is a type of advanced RDF
               for textual scholarship (see <xref xlink:href="#rdf_and_lod"/>).</para>
            <section>
               <title>Root element and header</title>
               <para>The root element of a TAN division-based alignment file is <code><link
                        linkend="element-TAN-A">&lt;TAN-A></link></code>.</para>
               <para>TAN-A's <code><link linkend="element-head">&lt;head></link></code> has zero or
                  more <code><link linkend="element-source">&lt;source></link></code>s.</para>
               <para>Any concepts that will be mentioned in the <code><link linkend="element-claim"
                        >&lt;claim&gt;</link></code>s (the only children of <code><link
                        linkend="element-body">&lt;body></link></code>) need to be supplied in
                        <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code> or an associated TAN-voc file invoked by
                        <code><link linkend="element-vocabulary"
                  >&lt;vocabulary&gt;</link></code>.</para>
            </section>
            <section xml:id="tan-a_body">
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A file
                  takes, in addition to the customary optional attributes (see <xref
                     linkend="edit_stamp"/>), <code><link linkend="attribute-claimant"
                        >@claimant</link></code>, <code><link linkend="attribute-object"
                        >@object</link></code>, <code><link linkend="attribute-subject"
                        >@subject</link></code>, or <code><link linkend="attribute-verb"
                        >@verb</link></code>, stipulating the default values for any enclosed
                  claims.</para>
               <para>The rest of the body consists of zero or more <code><link
                        linkend="element-claim">&lt;claim&gt;</link></code>s, each of which
                  represents one or more claims. Claims can be made about a variety of things,
                  e.g.,:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para>to index quotations and allusions;</para>
                     </listitem>
                     <listitem>
                        <para>to specify the subjects and topics dealt with particular textual
                           passages;</para>
                     </listitem>
                     <listitem>
                        <para>to connect commentary or notes from one source to another;</para>
                     </listitem>
                     <listitem>
                        <para>to indicate where other scripta have different readings (apparatus
                           criticus);</para>
                     </listitem>
                     <listitem>
                        <para>to establish work-version relationships.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para><code><link linkend="element-claim">&lt;claim&gt;</link></code>'s data model is
                  inspired by the Resource Description Framework (RDF; see <xref
                     linkend="rdf_and_lod"/>), where each statement consists of three items termed a
                  subject, a predicate, and an object. The first and third are thought of as nodes,
                  and the second as a connector (or edge) between the nodes. RDF adopts a graph
                  model, where the connector (edge) always links exactly two nodes.</para>
               <para>RDF is adequate for a limited range of scholarly assertions. An RDF statement
                  lacks context or qualifiers. No simple RDF statement, called a triple, can
                  indicate who made the assertion, or when, or if it was uttered with any doubt or
                  nuance. Sometimes we wish to claim a bare negation, e.g., "Aristotle was not the
                  author of <emphasis>De mundo</emphasis>"—which cannot be expressed in RDF.</para>
               <para>Any TAN claim that is exported to any RDF format should adopt the principles of
                     <link xlink:href="RDF*">RDF*</link>, which allows for complex, reified RDF
                  statements. As of this writing, the specifications for RDF* are still being
                  written.</para>
               <para>TAN's <code><link linkend="element-claim">&lt;claim&gt;</link></code> extends
                  the graph RDF model into a hypergraph, where the connector (edge) links two or
                  more nodes. The following adjustments are made:<orderedlist>
                     <listitem>
                        <para>Every claim <emphasis>must</emphasis> have at least one <emphasis
                              role="bold">claimant</emphasis>, some person, organization, or
                           algorithm to be credited/blamed for the assertion.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim <emphasis>must</emphasis> have at least one <emphasis
                              role="bold">subject</emphasis>, the topic of the claim.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim <emphasis>must</emphasis> have at least one <emphasis
                              role="bold">verb</emphasis> (in RDF called
                              <emphasis>predicate</emphasis>), specifying something about the
                           subject.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold"
                              >adverb</emphasis>, qualifying the verb.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may assert a level or range of <emphasis role="bold"
                              >certainty</emphasis>, between zero and one, reflecting how certain
                           the claimant is of the claim.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold"
                              >object</emphasis>, an entity or value expected by the verb.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold">temporal
                              qualifier</emphasis>, restricting the claim to a specific time.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have at least one <emphasis role="bold">locative
                              qualifier</emphasis>, restricting the claim to a specific geographical
                           region.</para>
                     </listitem>
                     <listitem>
                        <para>Every claim may have other components specially defined by the verb.
                           Currently, this entails for select verbs a language qualifier
                                 (<code><link linkend="attribute-in-lang">@in-lang</link></code>,
                                 <code><link linkend="element-in-lang"
                           >&lt;in-lang&gt;</link></code>) and a reference qualifier (<code><link
                                 linkend="element-at-ref">&lt;at-ref&gt;</link></code>).</para>
                     </listitem>
                  </orderedlist>Items 1-3 above are required parts of any claim. Items 4-9 may be
                  rendered as being required, optional, or disallowed by a <code><link
                        linkend="element-verb">&lt;verb&gt;</link></code>'s definition. For example,
                  a <code><link linkend="element-verb">&lt;verb&gt;</link></code> representing an
                  idea that in normal discourse is intransitive (e.g., sleep) can be defined such
                  that <code><link linkend="element-object">&lt;object></link></code> is not
                  allowed. </para>
               <para>Furthermore, a <code><link linkend="element-verb">&lt;verb&gt;</link></code>
                  may be defined to restrict what kinds of objects or subjects are allowed. For
                  example, the standard TAN verb <code>lacks_text_at</code> (see
                     <code>vocabularies/verbs.TAN-voc.xml</code>) is defined to allow only scripta
                  as a subject. No objects are allowed. Rather, a <code><link
                        linkend="element-claim">&lt;claim></link></code> with this verb expects one
                  or more <code><link linkend="element-at-ref">&lt;at-ref&gt;</link></code>s, which
                  restricts the claim to a particular passage in a TAN-T file. Examples:</para>
               <para>A <code><link linkend="element-verb">&lt;verb&gt;</link></code> can specify
                  that an object must be data, and it can also define the type of data allowed and
                  its permitted lexical form. <code><link linkend="element-verb"
                     >&lt;verb&gt;</link></code>s take a special extension to their IRI + name
                  pattern, permitting constraints that specify what is allow, required, or
                  prohibited. Some examples of <code><link linkend="element-verb"
                        >&lt;verb&gt;</link></code> vocabulary items:</para>
               <para>
                  <example>
                     <title>Examples of verb vocabulary items</title>
                     <programlisting>&lt;verb xml:id="wrote">
   &lt;IRI>http://rdaregistry.info/Elements/u/P60663&lt;/IRI>
   &lt;name>is author of&lt;/name>
   &lt;constraints>
      &lt;subject status="required" item-type="person"/>
      &lt;object status="required" item-type="work version"/>
   &lt;/constraints>
&lt;/verb>
. . . . . . .
&lt;verb group="zero_objects">
    &lt;IRI>tag:textalign.net,2015:verb:lacks-text&lt;/IRI>
    &lt;name>lacks text&lt;/name>
    &lt;name>lacks text at&lt;/name>
    &lt;desc>At the &amp;lt;at-ref>, the textual entity referred to by the subject lacks
        any text. The claim takes no object.&lt;/desc>
    &lt;constraints>
        &lt;subject status="required" item-type="scriptum"/>
        &lt;object status="disallowed"/>
        &lt;at-ref status="required"/>
    &lt;/constraints>
&lt;/verb>
. . . . . . .
&lt;verb xml:id="survives-in-original-language">
   &lt;IRI>tag:kalvesmaki.com,2014:verb:work-survives-in-original-language&lt;/IRI>
   &lt;name>original work is extant to what degree&lt;/name>
   &lt;desc>This verb is used to describe the degree to which a work survives in the
      original language of composition. It takes as object an xs:double between 0 and 1,
      representing the approximate percentage that is extant. This property does not
      stipulate how close to the first or earliest version the extant material
      is.&lt;/desc>
   &lt;constraints>
      &lt;subject status="required" item-type="work version"/>
      &lt;object status="required" content-datatype="double"
         content-lexical-constraint="[01]\.0*|0\.\d+"/>
   &lt;/constraints>
&lt;/verb></programlisting>
                  </example>
               </para>
               <para>Other examples of <code><link linkend="element-verb"
                  >&lt;verb&gt;</link></code>s can be found at
                     <code>vocabularies/verbs.TAN-voc.xml</code>.</para>
               <para>Claims may refer to other claims. That is, <code><link linkend="element-claim"
                        >&lt;claim></link></code>s can nest inside each other (e.g., X claims that Y
                  claims that Z claims that...). Or a <code><link linkend="element-claim"
                        >&lt;claim></link></code> may take an <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code>, whose value can then be cited as the object or
                  subject of any other <code><link linkend="element-claim"
                  >&lt;claim></link></code>.</para>
               <para>If a <code><link linkend="element-claim">&lt;claim></link></code> is about a
                  work or source in general, as a whole, one or more IDrefs may be placed in
                        <code><link linkend="attribute-subject">@subject</link></code> or
                        <code><link linkend="attribute-object">@object</link></code>. But if the
                  claim is about a specific part of the textual object, then more information is
                  needed, so the attributes cannot be used.</para>
               <para>Such textual references come in three flavors: assertions pertaining to a work,
                  assertions pertaining to a work in only some versions, and assertions pertaining
                  to scripta. In the first case, <code><link linkend="element-subject"
                        >&lt;subject></link></code> or <code><link linkend="element-object"
                        >&lt;object></link></code> must take <code><link linkend="attribute-work"
                        >@work</link></code>, with IDrefs pointing to vocabulary items for
                        <code><link linkend="element-work">&lt;work&gt;</link></code>s. In the
                  second case, <code><link linkend="attribute-src">@src</link></code> is used,
                  pointing by IDref to the applicable <code><link linkend="element-source"
                        >&lt;source></link></code>s. In the third case <code><link
                        linkend="attribute-scriptum">@scriptum</link></code> is used, pointing to
                  vocabulary items for <code><link linkend="element-scriptum"
                        >&lt;scriptum&gt;</link></code>. Remember, you may combine commonly grouped
                  IDrefs in an <code><link linkend="element-alias"
                  >&lt;alias&gt;</link></code>.</para>
               <para>A <code><link linkend="attribute-work">@work</link></code> means that the claim
                  applies to any versions of the work, whether a source or not; a <code><link
                        linkend="attribute-src">@src</link></code> specifies that the claim applies
                  only to the specific <code><link linkend="element-source"
                     >&lt;source></link></code>s, and not to every possible version. In each case,
                        <code><link linkend="element-subject">&lt;subject></link></code> or
                        <code><link linkend="element-object">&lt;object></link></code> may be given
                  more attributes and elements to restrict the claim to specific parts of the work
                  or source, with <code><link linkend="attribute-ref">@ref</link></code>,
                        <code><link linkend="element-tok">&lt;tok></link></code>, <code><link
                        linkend="attribute-val">@val</link></code>, <code><link
                        linkend="attribute-pos">@pos</link></code>, and <code><link
                        linkend="attribute-chars">@chars</link></code>, following the conventions
                  used in pointing to parts of texts (see <xref xlink:href="#pointer-syntax"
                  />).</para>
               <para>If a <code><link linkend="element-subject">&lt;subject></link></code> or
                        <code><link linkend="element-object">&lt;object></link></code> points via
                        <code><link linkend="attribute-scriptum">@scriptum</link></code> to a
                  scriptum, specifying the claim necessarily takes a different approach than that
                  used for <code><link linkend="attribute-work">@work</link></code> or <code><link
                        linkend="attribute-src">@src</link></code>. Bear in mind, it is encouraged
                  in these guidelines to avoid scriptum-oriented methods of dividing class 1 files.
                  Therefore, clarifying a portion of a scriptum (e.g., a particular manuscript folio
                  number) requires an apparatus that likely does not correspond to a TAN file.
                  Therefore, a <code><link linkend="element-subject">&lt;subject></link></code> or
                        <code><link linkend="element-object">&lt;object></link></code> with a
                        <code><link linkend="attribute-scriptum">@scriptum</link></code> can be
                  restricted to a particulary region through descendant <code><link
                        linkend="element-div">&lt;div></link></code>s that specify via <code><link
                        linkend="attribute-n">@n</link></code> and <code><link
                        linkend="attribute-type">@type</link></code> specific parts of the scriptum.
                  These scriptum filters, unlike TAN-T <code><link linkend="element-div"
                        >&lt;div></link></code>s, are always empty; their sole purpose is to point
                  in native terms to a specific region on a scriptum.</para>
               <para>Multiple values in any component of a <code><link linkend="element-claim"
                        >&lt;claim></link></code> are distributed, which means that one <code><link
                        linkend="element-claim">&lt;claim></link></code> might contain multiple
                  assertions. For example, <code>&lt;claim subject="A B" verb="taught promoted"
                     object="X Y Z"/></code> has within it twelve claims (the combinatory
                  permutations of the three attributes' individual values). The exception to this
                  general rule is <code><link linkend="attribute-adverb">@adverb</link></code>,
                  whose multiple values are taken as ampliative and restrictive. For example,
                     <code>&lt;claim subject="A" adverb="probably not" verb="taught"
                     object="X"/></code> is a single claim, not two, even though <code><link
                        linkend="attribute-adverb">@adverb</link></code> has two values.</para>
               <para>A limited set of verbs have been defined in standard TAN vocabulary; see <xref
                     xlink:href="#vocabularies-verbs"/>. The strictures defined in these verbs are
                  checked during Schematron validation. For a brief discussion on defining your own
                  verbs in a TAN-voc file see <xref linkend="tan-voc-data"/>.</para>
               <para>Aspects of the discussion can be illustrated with select examples of
                  claims:</para>
               <para>
                  <example>
                     <title>Examples of claims</title>
                     <programlisting>&lt;claim subject="cpg2440-syr" verb="translates" object="cpg2440"/>
. . . . . . .
&lt;claim subject="Λ" adverb="perhaps" verb="reads">
   &lt;at-ref src="grc" ref="1 a 5">
      &lt;tok pos="1-2"/>
   &lt;/at-ref>
   &lt;object>τις ἀποδιδῷ&lt;/object>
&lt;/claim>
. . . . . . .
&lt;claim subject="cpg2430" verb="has-incipit">
   &lt;object xml:lang="grc">Ἐπειδή μοι πρώην δεδήλωκας ἀπὸ τοῦ ἁγίου ὄρους ἐν τῇ Σκίτει
      καθεζομένῳ&lt;/object>
&lt;/claim>
. . . . . . .
&lt;claim verb="edits" adverb="partially" object="cpg2430">
   &lt;subject which="Muyldermans_1932">
      &lt;div n="84-89, 91-92" type="page"/>
   &lt;/subject>
&lt;/claim>
. . . . . . .
&lt;claim verb="paraphrases">
   &lt;subject work="pr" ref="13"/>
   &lt;object work="nt" ref="1Th 2:6"/>
&lt;/claim>
. . . . . . .
&lt;claim verb="quotes">
   &lt;subject src="grc-Mu1931" ref="I 87"/>
   &lt;object work="lxx" ref="Wis 13:5"/>
&lt;/claim>
. . . . . . .
&lt;claim verb="alludes_to">
   &lt;subject work="KG" ref="II 86"/>
   &lt;object work="lxx" ref="Ex 25:30"/>
   &lt;object work="nt" ref="Heb 9:2"/>
&lt;/claim></programlisting>
                  </example>
               </para>
            </section>
         </section>
         <section xml:id="tan-a-tok">
            <title>Token-based annotations and alignments (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>)</title>
            <para>TAN-A-tok files facilitate the microscopic alignment of two related sources. The
               format is intended to allow you to specify exactly where, how, and why two
               transcriptions align, and to do so on the most granular level possible. TAN-A-tok
               files also allow you to express levels of confidence or alternative opinions. A
               TAN-A-tok file takes two class-1 sources, which should be two different versions of
               the same work. Most often, one will be a translation of the other, but the format can
               be used for two versions of the text in the same language, e.g., paraphrase,
               revision.</para>
            <para>Creators and editors of TAN-A-tok files should be able to read the languages of
               their sources and to explain as precisely as possible the relationship between the
               two sources. They should be prepared to think about and specify types of textual
               reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A files
               are because of the level of detail involved.</para>
            <para>To simplify the file, token alignment is restricted to two texts, referred to
               jointly as a <emphasis role="italic">bitext</emphasis>. Each half of the bitext must
               be a TAN-T(EI) file. It is assumed that those two sources share some special
               relationship, direct or indirect, and relate through one or more types of textual
               reuse: translation, paraphrase, commentary, and so forth. Some of these bitexts, such
               as literal translations, may line up quite nicely word for word. Others, such as
               paraphrases, may line up sporadically, vaguely, ambiguously, or, in places, not at
               all. Annotating a bitext is oftentimes not easy, and requires you to consider and
               declare assumptions you have made in two key areas: the relationship that holds
               between two scripta and the types of reuse that was involved in turning one version
               into the other (or a common ancestor into both).</para>
            <para><emphasis role="bold">Relationship of sources' scripta</emphasis>. What is the
               physical relationship or history that connects the two sources' scripta? Is one a
               direct descendant (copy) of the other? If not, what common ancestor do they share?
               Here you should consider the material aspect of the bitext, because you are trying to
               answer how object A's text relates to object B's. See <xref
                  xlink:href="#vocabularies-bitext-relations"/>.</para>
            <para><emphasis role="bold">Types of reuse</emphasis>. What categories of text reuse do
               you consider operative? Users of your data should be informed of the paradigm you
               bring to your analysis. You may wish to keep your categories nondescript and somewhat
               vague, using generic terms such as <emphasis>translation</emphasis>,
                  <emphasis>paraphrase</emphasis>, <emphasis>quotation</emphasis>, without much
               specificity. On the other hand, you may subscribe to a detailed view of text reuse.
               Perhaps you have adopted field-specific categories such as <emphasis>obligatory
                  explicitation</emphasis>, <emphasis>optional explicitation</emphasis>,
                  <emphasis>pragmatic explicitation</emphasis>, or <emphasis>translation-inherent
                  explicitation</emphasis>. You may also wish to declare secondary types of reuse,
               such as <emphasis role="italic">scribal omission</emphasis> or <emphasis
                  role="italic">dittography</emphasis>, to declare secondary types of reuse that may
               have intervened. You must declare at least one type of reuse. Or you may use those
               that are built into the TAN format. See <xref xlink:href="#vocabularies-reuse-types"
               />.</para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a token-based alignment file is <code><link
                        linkend="element-TAN-A-tok">&lt;TAN-A-tok></link></code>.</para>
               <para>The TAN-A-tok header builds upon the core and class 2 headers (see <xref
                     linkend="metadata_head"/> and <xref linkend="class_2_metadata"/>).</para>
               <para>TAN-A-tok files take exactly two <code><link linkend="element-source"
                        >&lt;source></link></code>s. The sequence is arbitrary. Each <code><link
                        linkend="element-source">&lt;source></link></code> must take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>.</para>
               <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
                  takes, in addition to all the elements allowed in class-2 files (see <xref
                     linkend="class_2_metadata"/>), two elements unique to TAN-A-tok: <code><link
                        linkend="element-bitext-relation">&lt;bitext-relation></link></code> and
                        <code><link linkend="element-reuse-type">&lt;reuse-type></link></code>. The
                  former describes the genealogical relationship between each source's scripta. The
                  second attends to the qualitative aspect of the bitext relationship. See
                  above.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-tok
                  file takes, in addition to the customary optional attributes (see <xref
                     linkend="edit_stamp"/>), required <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code> and
                        <code><link linkend="attribute-reuse-type">@reuse-type</link></code>, which
                  take one or more IDrefs from <code><link linkend="element-bitext-relation"
                        >&lt;bitext-relation></link></code> and <code><link
                        linkend="element-reuse-type">&lt;reuse-type></link></code>, indicating the
                  default values that govern the alignment. </para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-align">&lt;align></link></code>s,
                  each of which collects sets of <code><link linkend="element-tok"
                     >&lt;tok></link></code>s from one or both sources, known collectively as a
                     <emphasis role="italic">token cluster</emphasis>. Clusters may overlap, to
                  handle translations in which words fall in one-to-one, one-to-many, many-to-one,
                  and many-to-many relationships. The independence of token clusters allows you to
                  register differences of opinion about the same set of tokens. An <code><link
                        linkend="element-align">&lt;align></link></code> may take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>, in case you or someone else
                  wishes to refer to a particular <code><link linkend="element-align"
                        >&lt;align></link></code>.</para>
               <para>Nothing should be inferred from silence in a TAN-A-tok file. There is no
                  requirement that everything in a source <emphasis>must</emphasis> be encoded or
                  described. In writing and editing a TAN-A-tok file you do not commit yourself to
                  saying everything possible about the bitext. You might choose to encode only a few
                  token clusters. Tokens that are not referred to should not be interpreted as gaps
                  in a translation. All that can be inferred is that the creators and editors of the
                  TAN-A-tok file have said nothing about the tokens. (See discussion on <link
                     linkend="accuracy-precision-comprehensiveness">comprehensiveness</link>.) In
                  fact it is oftentimes preferable to have a TAN-A-tok file that points to only a
                  selection of tokens; a file with tens of thousands of <code><link
                        linkend="element-align">&lt;align></link></code>s could take a very long
                  time to validate, or to process in applications.</para>
               <para>Any token may be a member of as many <code><link linkend="element-align"
                        >&lt;align></link></code>s as you like. In fact, this is preferred if you
                  wish to register competing claims or alternatives.</para>
               <para>If you wish to declare that one or more words in a source were omitted from a
                  translation or inserted into one—that is, words in one source have no match in the
                  other—you must do so through a <emphasis role="italic">one-sided
                     alignment</emphasis>, i.e., a token cluster that has tokens from only one
                  source. A one-sided alignment implies insertions or omissions.</para>
               <para>If there are multiple values in <code><link linkend="attribute-reuse-type"
                        >@reuse-type</link></code> or <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code>, the
                  intersection, not the union, of those values is to be understood. For example,
                     <code>reuse-type="translation paraphrase"</code> would indicate that the token
                  cluster results from an activity that is both translation and paraphrase, not one
                  or the other. If a particular <code><link linkend="element-align"
                        >&lt;align></link></code> might be one reuse type or the other, but not
                  both, then create two <code><link linkend="element-align"
                  >&lt;align></link></code>s, qualifying each one with a different value for
                        <code><link linkend="attribute-reuse-type">@reuse-type</link></code>. Then
                  add <code><link linkend="attribute-cert">@cert</link></code>, indicating through a
                  decimal number between 0 and 1 how confident you are that that particular
                  reuse-type is accurate. <code><link linkend="attribute-cert2">@cert2</link></code>
                  can also be added, in case you do not want to commit yourself to such a precise
                  number.</para>
               <para>Commonly, <code><link linkend="element-tok">&lt;tok></link></code>s include
                        <code><link linkend="attribute-ref">@ref</link></code>, pointing to a leaf
                        <code><link linkend="element-div">&lt;div></link></code>. But this is not
                  required. The <code><link linkend="attribute-ref">@ref</link></code> may point to
                  a <code><link linkend="element-div">&lt;div></link></code> that takes other
                        <code><link linkend="element-div">&lt;div></link></code>s, or <code><link
                        linkend="attribute-ref">@ref</link></code> may be altogether absent. If a
                        <code><link linkend="element-tok">&lt;tok></link></code> lacks a <code><link
                        linkend="attribute-ref">@ref</link></code> then it means that the claim is
                  true for all instances of that word in the source, no matter where found.</para>
               <para>
                  <example>
                     <title>Examples of TAN-A-tok anas</title>
                     <programlisting>&lt;align>
    &lt;tok src="ring1881" ref="2" val="pocket"/>
    &lt;tok src="ring1987" ref="2" val="pocket"/>
&lt;/align>
. . . . . . .
&lt;align reuse-type="stylistic_minus">
   &lt;tok src="grc" ref="Col 1 4" pos="11 - 12"/>
   &lt;tok src="syr" ref="Col 1 4" pos="7" chars="last-2 - last"/>
&lt;/align></programlisting>
                  </example>
               </para>
            </section>
         </section>
         <section xml:id="tan-a-lm">
            <title>Lexico-morphology (<code><link linkend="element-TAN-A-lm"
                  >&lt;TAN-A-lm&gt;</link></code>)</title>
            <para>TAN-A-lm files are used to annotate a class-1 source by specifying the lexical and
               morphological properties of its tokens or morphemes. </para>
            <para>Every TAN-A-lm file has two different types of dependencies: a class 1 source
               (optional) and the grammatical rules defined in one or more TAN-mor files. This
               section therefore should be read in close conjunction with <xref linkend="TAN-mor"
               />).</para>
            <para>TAN-A-lm files are either <emphasis>source-specific</emphasis> or
                  <emphasis>language-specific</emphasis>. </para>
            <para>Source-specific TAN-A-lm files depend exclusively upon one class-1 source.
               Source-specific TAN-A-lm files are useful for closely analyzing the grammatical
               properties of the words in one particular text. Well-curated source-specific TAN-A-lm
               files are enormously useful for other applications, e.g., quotation detection. Any
               source-specific TAN-A-lm file can be converted into a language-specific one, to be
               used as noted below.</para>
            <para>Language-specific TAN-A-lm files depend upon an unknown number of sources. Some
               language-specific TAN-A-lm files might be based upon a small, specific corpus,
               perhaps just one text. Others might rely upon a vast, general one. Language-specific
               TAN-A-lm files are useful for building language resources for computer applications.
               Many language-specific TAN-A-lm files become the basis for a local language catalog,
               which can be used to populate a new source-specific TAN-A-lm file.</para>
            <section>
               <title>Principles and assumptions</title>
               <para>Editors of TAN-A-lm files should understand the vocabulary and grammar of the
                  languages of their sources. They should have a good sense of the rules established
                  by the lexical and grammatical authorities adopted. They should be familiar with
                  the conventions and assumptions of the TAN-mor files being used.</para>
               <para>Although you must assume the point of view of a particular grammar and lexicon,
                  you need not hold to a single one. In addition, you may bring to the analysis your
                  own expertise and supply lexical headwords unattested in published
                  authorities.</para>
               <para>Although TAN-A-lm files are simple, they can be laborious to write and edit,
                  more than any other type of TAN file. They can also be hard to read if the
                  morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file
                  to use tools to create and edit the data.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a lexico-morphological file is TAN-A-lm.</para>
               <para>If the file is source-specific, <code><link linkend="element-source"
                        >&lt;source></link></code> points to the one and only TAN-T(EI) file that is
                  the object of analysis. If the file is language-specific, <code><link
                        linkend="element-for-lang">&lt;for-lang></link></code> is used in the
                  declarations section of the <code><link linkend="element-head"
                     >&lt;head></link></code> to indicate the languages that are covered. </para>
               <para>For highly inflected languages, language-specific TAN-A-lm files can be
                  enormous in size or quantity. To improve performance when validating and
                  processing numerous or large language-specific TAN-A-lm files, the <code><link
                        linkend="element-head">&lt;head></link></code> may also include <code><link
                        linkend="element-tok-starts-with">&lt;tok-starts-with&gt;</link></code> and
                        <code><link linkend="element-tok-is">&lt;tok-is&gt;</link></code>. It is
                  common for language-specific TAN-A-lm files to be cataloged in a <code><link
                        linkend="element-collection">&lt;collection></link></code> file. These
                  become part of the local language catalog, bound to the global parameter
                     <code>$tan:lang-catalog-map</code>, found in
                     <code>parameters/params-application-language.xsl</code>. By including in that
                  parameter your collections to language-specific TAN-A-lm files, you open up those
                  resources to use in a variety of other applications. In that <code><link
                        linkend="element-collection">&lt;collection></link></code> file, the
                  individual <code><link linkend="element-doc">&lt;doc></link></code>s that point to
                  language-specific TAN-A-lm files should include as children any <code><link
                        linkend="element-tok-starts-with">&lt;tok-starts-with&gt;</link></code> and
                        <code><link linkend="element-tok-is">&lt;tok-is&gt;</link></code> as in the
                  original.</para>
               <para>
                  <example>
                     <title>Example of a catalog entry for a language-specific TAN-A-lm file</title>
                     <programlisting>&lt;doc href="lat-tan-a-lm-abu.xml" TAN-version="2021" 
   id="tag:kalvesmaki.com,2015:tan-a-lm:lat:perseus:abu" 
   lexicon="LS" morphology="perseus-dik" claimant="xslt1" root="TAN-A-lm">
  &lt;name xmlns="tag:textalign.net,2015:ns">Perseus lexico-morphological permutations 
   devoted exclusively to abu&lt;/name>
  &lt;license xmlns="tag:textalign.net,2015:ns" which="Attribution-ShareAlike 3.0 Unported" 
   licensor="perseus"/>
  &lt;for-lang xmlns="tag:textalign.net,2015:ns">lat&lt;/for-lang>
  &lt;tok-starts-with xmlns="tag:textalign.net,2015:ns">Abu&lt;/tok-starts-with>
  &lt;tok-starts-with xmlns="tag:textalign.net,2015:ns">abu&lt;/tok-starts-with>
  &lt;tok-starts-with xmlns="tag:textalign.net,2015:ns">abú&lt;/tok-starts-with>
&lt;/doc></programlisting>
                  </example>
               </para>
               <para>Conversion from a source-specific TAN-A-lm to a language-specific one is a
                  one-way operation. There is at present no mechanism for automatically
                  reconstructing the corpus that underlies a language-specific TAN-A-lm file. </para>
               <para><code><link linkend="element-vocabulary-key">&lt;vocabulary-key></link></code>
                  takes the elements other class-2 files take (see <xref linkend="class_2_metadata"
                  />. It also permits two elements unique to TAN-A-lm: <code><link
                        linkend="element-lexicon">&lt;lexicon></link></code> (optional) and
                        <code><link linkend="element-morphology">&lt;morphology></link></code>
                  (mandatory). Any number of lexica and morphologies may be declared; the order is
                  inconsequential. </para>
               <para>There is, at present, no TAN format for lexica and dictionaries. So even if a
                  digital form of a dictionary is identified through the <xref
                     linkend="digital_entity_metadata"/>, the Schematron validation routine will not
                  attempt to check the TAN-A-lm data against the lexical authorities cited. </para>
               <para>Because you or other TAN-A-lm editors are likely to be authorities in your own
                  right, <code><link linkend="element-person">&lt;person&gt;</link></code> can be
                  treated as if a <code><link linkend="element-lexicon">&lt;lexicon></link></code>,
                  and be referred to by <code><link linkend="attribute-lexicon"
                     >@lexicon</link></code>.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-lm
                  file takes, in addition to the customary optional attributes found in other TAN
                  files (see <xref linkend="edit_stamp"/>), <code><link linkend="attribute-lexicon"
                        >@lexicon</link></code> and <code><link linkend="attribute-morphology"
                        >@morphology</link></code>, to specify the default lexicon and
                  grammar.</para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-ana">&lt;ana></link></code>s
                  (short for analysis), each of which matches one or more tokens (<code><link
                        linkend="element-tok">&lt;tok&gt;</link></code>) to one or more lexemes or
                  morphological assertions (<code><link linkend="element-lm"
                     >&lt;lm&gt;</link></code>, which takes zero or more <code><link
                        linkend="element-l">&lt;l&gt;</link></code>s followed by one or more
                        <code><link linkend="element-m">&lt;m&gt;</link></code>s). </para>
               <para>An <code><link linkend="element-ana">&lt;ana></link></code> may take a
                        <code><link linkend="attribute-tok-pop">@tok-pop</link></code>, to specify
                  the number of tokens that the assertion applies to. This is particularly helpful
                  for language-specific files based upon a limited corpus of texts, where the
                  underlying data for the assertion might be difficult or impossible to retrieve.
                  The token population can be used to calibrate levels of certainty, or to compare
                  statistical profiles of one TAN-A-lm file against another.</para>
               <para>If you wish to point to a linguistic token that straddles more than one token,
                  you should use multiple <code><link linkend="element-tok">&lt;tok></link></code>s,
                  wrapping them in a <code><link linkend="element-group">&lt;group></link></code>. </para>
               <para>Any token may be the object of as many <code><link linkend="element-ana"
                        >&lt;ana&gt;</link></code>s as you like. In fact, this is preferred if you
                  wish to register competing claims or alternatives.</para>
               <para>Claims within an <code><link linkend="element-ana">&lt;ana&gt;</link></code>
                  are distributed. That is, every combination of <code><link linkend="element-l"
                        >&lt;l&gt;</link></code> and <code><link linkend="element-m"
                        >&lt;m&gt;</link></code> (governed by <code><link linkend="element-lm"
                        >&lt;lm&gt;</link></code>) is asserted to be true for every <code><link
                        linkend="element-tok">&lt;tok></link></code> or <code><link
                        linkend="element-group">&lt;group></link></code>. </para>
               <para>If an <code><link linkend="element-lm">&lt;lm&gt;</link></code> lacks an
                        <code><link linkend="element-l">&lt;l&gt;</link></code>, the token value its
                  itself, calculated by each <code><link linkend="element-tok"
                     >&lt;tok></link></code>, is taken to be the default value of the lexeme. </para>
               <para>All assertions are assumed to be made with 100% confidence unless <code><link
                        linkend="attribute-cert">@cert</link></code> is invoked. This still holds
                  even when a <code><link linkend="element-tok">&lt;tok></link></code> is the
                  subject of multiple <code><link linkend="element-ana">&lt;ana&gt;</link></code>s,
                  because it is possible to be completely confident that a given word has two
                  different grammatical profiles in the target text (e.g., puns, wordplay).</para>
               <para>Many TAN-A-lm files will be generated by an algorithm that automatically lists
                  all possible morphological values of each token. It is advised that such automatic
                  calculations always include in their output <code><link linkend="attribute-cert"
                        >@cert</link></code>, with weighted values. That is, if an algorithm
                  identifies two possible lexico-morphological profiles for a word, but one occurs
                  nine times more than the other, then it is advised that this be reflected in the
                  two resultant elements, e.g.: <code>&lt;lm cert="0.9">...&lt;/lm></code> and
                     <code>&lt;lm cert="0.1">...&lt;/lm></code>. If an algorithm is written with a
                  more sophisticated way to weigh possibilities, then adjust the value of
                        <code><link linkend="attribute-cert">@cert</link></code> accordingly. Be
                  certain that the <code><link linkend="element-algorithm"
                     >&lt;algorithm&gt;</link></code> is credited in the <code><link
                        linkend="element-vocabulary-key">&lt;vocabulary-key></link></code> and in a
                        <code><link linkend="element-resp">&lt;resp></link></code>.</para>
               <para>As with TAN-A-tok files, not every word needs to be explained or described. In
                  fact, this is oftentimes undesirable, to avoid files that are overly long and
                  time-consuming to validate or process.</para>
               <para>A TAN-A-lm file is rendered more efficient when claims can be grouped. If a
                  particular token invariably has a single lexico-morphological profile, this can be
                  declared once, in a <code><link linkend="element-tok">&lt;tok></link></code> that
                  does not have <code><link linkend="attribute-ref">@ref</link></code>. If the token
                  has a particular profile in a given region of text, it can be specified through a
                        <code><link linkend="attribute-ref">@ref</link></code> that encompasses the
                  specified region. You do not need to provide a <code><link linkend="element-tok"
                        >&lt;tok></link></code> for every token, which would entail restricting
                        <code><link linkend="attribute-ref">@ref</link></code> to leaf divs. You may
                  do so, but such an approach can result in very long files that are time-consuming
                  to validate, process, and edit. It is more advantageous to declare
                  lexico-morpological properties more generally, thereby replacing numerous leaf-div
                        <code><link linkend="element-tok">&lt;tok></link></code>s.</para>
               <para>The benefits in processing time are significant. In early versions of TAN, the
                  lexico-morphogical values of the Greek Septuagint (8.3 MB) were converted to a
                  TAN-A-lm file of 407,811 <code><link linkend="element-tok"
                  >&lt;tok></link></code>s, one per token per leaf div, grouped in 52,703
                        <code><link linkend="element-ana">&lt;ana&gt;</link></code>s (25.8 MB).
                  Early 2020 validation routines took about 25 minutes (2018 validation routines
                  took hours). The long processing time is due primarily to the TAN-A-lm file
                  itemizing every single token in the text. That same file was revised to be more
                  declarative along the lines advocated above. If a particular token had only one
                  lexico-morphological profile throughout the text, then every instance was reduced
                  to a single <code><link linkend="element-ana">&lt;ana&gt;</link></code>, with no
                        <code><link linkend="attribute-ref">@ref</link></code> in <code><link
                        linkend="element-tok">&lt;tok></link></code>. When a particular token value
                  had different lexico-morphological profiles, <code><link linkend="attribute-ref"
                        >@ref</link></code> targeted the rootmost <code><link linkend="element-div"
                        >&lt;div></link></code> that encompassed them all. This revision resulted in
                  a smaller file (15.8 MB; 158,376 <code><link linkend="element-tok"
                     >&lt;tok></link></code>s in 54,335 <code><link linkend="element-ana"
                        >&lt;ana&gt;</link></code>s) that validated in about a third of the time
                  (8.5 minutes).</para>
               <para>In general, there is always a trade-off between convenience and efficiency. If
                  your priority is speed, you should break a large file into several smaller ones,
                  perhaps recombining them in a master file via <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> (see <xref
                     linkend="inclusions-and-vocabularies"/>).</para>
               <para>Applications can be written to convert TAN-A-lm <code><link linkend="element-m"
                        >&lt;m&gt;</link></code> data from one morphological system to another. This
                  is a two-step process facilitated by the functions <code><link
                        linkend="function-morphological-code-conversion-maps"
                        >tan:morphological-code-conversion-maps</link>(</code>) and <code><link
                        linkend="function-convert-morphological-codes"
                        >tan:convert-morphological-codes</link>(</code>). See documentation in these
                  guidelines or in
                  <code>functions/language/TAN-fn-language-extended.xsl</code>.</para>
               <para>
                  <example>
                     <title>Examples of TAN-A-lm data</title>
                     <programlisting>&lt;ana>
    &lt;group>
        &lt;tok ref="1" pos="1 - last-1"/>
    &lt;/group>
    &lt;lm>
        &lt;l>ring-a-ring-a-rose&lt;/l>
        &lt;m>NNS&lt;/m>
    &lt;/lm>
&lt;/ana>
. . . . . . .
&lt;ana>
   &lt;tok ref="10 6 3 2" pos="4"/>
   &lt;tok ref="10 6 3 3" pos="15"/>
   &lt;tok ref="10 6 4 2" pos="37"/>
   &lt;lm>
      &lt;l>Σωκράτης&lt;/l>
      &lt;m>n e - s - - - m g -&lt;/m>
   &lt;/lm>
&lt;/ana>
. . . . . . .
&lt;ana>
   &lt;tok val="τούτῳ"/>
   &lt;lm>
      &lt;l>οὗτος&lt;/l>
      &lt;m cert="0.358311302048909457">p d - s - - - m d&lt;/m>
      &lt;m cert="0.241688697951090546">p d - s - - - n d&lt;/m>
      &lt;m cert="0.2">p - - s - - - m d&lt;/m>
      &lt;m cert="0.2">p - - s - - - n d&lt;/m>
   &lt;/lm>
&lt;/ana>
. . . . . . .
&lt;ana>
   &lt;tok val="ABERRO"/>
   &lt;tok val="Aberro"/>
   &lt;tok val="aberro"/>
   &lt;lm>
      &lt;l>aberro&lt;/l>
      &lt;m>v - 1 s p i a&lt;/m>
   &lt;/lm>
&lt;/ana></programlisting>
                  </example>
               </para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_3">
         <title>Class-3 TAN Files, Varia</title>
         <para>This chapter provides general background to class-3 TAN files, which are devoted to
            formats that do not fit the other two classes. For detailed discussion of specific
            elements and attributes, see <xref linkend="elements-attributes-and-patterns"/>.</para>
         <section xml:id="TAN-voc">
            <title>Vocabulary (<code>TAN-voc</code>)</title>
            <para>All too often, a project has a set of vocabulary it draws from time and again. To
               repeat the <xref xlink:href="#pattern-iri_and_name"/> can be both tedious and
               treacherous. If a project with hundreds of TAN files decides to change or augment its
               vocabulary it could take a long time to find and make all the changes, everywhere and
               consistently.</para>
            <para>The TAN-voc format addresses that problem. It is intended to allow a project to
               define, edit, and augment the IRI + name patterns for recurrent vocabulary. TAN
               includes several standard TAN-voc files under the subdirectory
                  <code>vocabularies</code>, supporting commonly used concepts such as token
               definitions, div types, licenses, and many more. For a complete list of predefined
               TAN keywords, see <xref linkend="vocabularies-master-list"/></para>
            <para>It is quite common for a person or team to build vocabulary items gradually while
               developing a corpus, which means that TAN-voc files tend to change and grow. You can
               organize your vocabulary in whatever manner makes sense. You might create one large
               TAN-voc file that has all your project's vocabulary. Or you might break out the
               vocabulary, one file per type. Each approach has strengths and weaknesses. If you
               break your vocabulary into many files, you should designate one of them as your point
               of main import, and include the other TAN-voc files via <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code>s (along with
                  <code>&lt;</code><code><link linkend="element-group">group</link></code><code>
                  include="[IDREFS]"/></code> or <code>&lt;</code><code><link linkend="element-item"
                     >item</link></code><code> include="[IDREFS]"/></code>, pointing to the IDrefs
               of the included TAN-voc files). Doing so prevents you from having to insert numerous
                     <code><link linkend="element-vocabulary">&lt;vocabulary&gt;</link></code>s in
               your other TAN files.  </para>
            <para>For more details on how this format relates to other TAN formats, see <xref
                  linkend="inclusions-and-vocabularies"/>.</para>
            <section>
               <title>Root Element and Head</title>
               <para>A TAN-voc file has <code><link linkend="element-TAN-voc"
                     >&lt;TAN-voc&gt;</link></code> as the root element.</para>
               <para>The <code><link linkend="element-vocabulary-key"
                     >&lt;vocabulary-key></link></code> of a TAN-voc file takes, in addition to core
                  vocabulary items, any number of <code><link linkend="element-group-type"
                        >&lt;group-type&gt;</link></code>s. </para>
               <para>A TAN-voc file may draw directly from the vocabulary in its body, as if it were
                  referring to itself via <code><link linkend="element-vocabulary"
                        >&lt;vocabulary&gt;</link></code>.</para>
            </section>
            <section xml:id="tan-voc-data">
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-voc
                  file consists simply of <code><link linkend="element-item"
                     >&lt;item&gt;</link></code>s or <code><link linkend="element-verb"
                        >&lt;verb&gt;</link></code>s, perhaps gathered into groups via <code><link
                        linkend="element-group">&lt;group&gt;</link></code> or <code><link
                        linkend="attribute-group">@group</link></code>. These groups have, at
                  present, no effect upon other TAN files that use them, but they have been valuable
                  in certain applications. For example, the standard TAN-voc file for <code><link
                        linkend="element-div-type">&lt;div-type&gt;</link></code>
                     (<code>vocabularies/div-types.TAN-voc.xml</code>) groups textual division types
                  into a rudimentary typology that allows applications to be designed to decide
                  programmatically whether a particular division should be treated as a block or
                  inline element, or whether it should be indented.</para>
               <para>The <code><link linkend="attribute-affects-attribute"
                     >@affects-attribute</link></code> or <code><link
                        linkend="attribute-affects-element">@affects-element</link></code>, both
                  weakly inheritable, defines the scope of the vocabulary items, i.e., what elements
                  or attributes can the items be legitimately used for. The vocabulary item will be
                  eligible only for specified attributes or elements.</para>
               <para>Nearly all <code><link linkend="element-item">&lt;item&gt;</link></code>s in a
                  TAN-voc file contain the IRI + name pattern or a derived pattern. The only
                  exceptions are <code><link linkend="element-item">&lt;item&gt;</link></code>s
                  pertaining to token definitions, which instead of <code><link
                        linkend="element-IRI">&lt;IRI></link></code>s take <code><link
                        linkend="element-token-definition">&lt;token-definition&gt;</link></code>s.
                  See <xref linkend="defining_tokens"/>.</para>
               <para><code><link linkend="element-verb">&lt;verb&gt;</link></code> includes, in
                  addition to the IRI + name pattern, the option to have <code><link
                        linkend="element-constraints">&lt;constraints&gt;</link></code> added. Those
                  constraints define what components are permitted in any <code><link
                        linkend="element-claim">&lt;claim&gt;</link></code> that uses the
                        <code><link linkend="element-verb">&lt;verb&gt;</link></code>. At this time,
                  verb constraints are an experimental feature. Only those constraints that mirror
                  standard TAN vocabulary for verbs, <code>vocabularies/verbs.TAN-voc.xml</code>,
                  will be supported during validation. Study that file for examples of how to build
                  a <code><link linkend="element-verb">&lt;verb&gt;</link></code>. See <xref
                     linkend="tan-a_body"/> on the use of verbs in a TAN-A file.</para>
            </section>
         </section>
         <section xml:id="TAN-mor">
            <title>Morphological Concepts and Patterns (<code>TAN-mor</code>)</title>
            <para>TAN-mor files are used to delineate the morphological characteristics or features
               of a given language, to assign codes to those features, and to define rules governing
               the application of those codes. It is a kind of schema language for the grammar of
               human languages. </para>
            <para>The format allows specificity, flexibility, and responsiveness. Grammatical rules
               may be constructed to return warnings and error messages to users who use a code or
               pattern incorrectly, or not in accordance with best practices. Such rules may be
               qualified, or made contingent upon certain conditions.</para>
            <para>This chapter should be read in close conjunction with <xref linkend="tan-a-lm"
               />.</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>Certain assumptions and recommendations are made regarding morphology files,
                  complementing the more general ones; see <xref linkend="design_principles"
                  />.</para>
               <para>TAN-mor files are restricted exclusively to describing the categories and rules
                  for the grammar of a natural language. Editors of these files should be well
                  versed with the grammar of the languages they are describing, and generally
                  acquainted with how the grammars of comparable languages work.</para>
               <para>The TAN-mor format has been designed under the assumption that patterns of word
                  inflection and formation can be categorized, classified, named, and described. It
                  has also been assumed that scholars may reasonably differ, perhaps radically, on
                  how grammatical features should be defined and applied. TAN-mor allows scholars to
                  declare clearly their operative assumptions and views. It is up to other users to
                  decide whether or not to adopt them.</para>
               <para>The TAN-mor format has also been designed to cater to two different approaches
                  to morphological codes: categorized or uncategorized. </para>
               <para>Categorized codes are interpreted according to position. <code>a b c</code>
                  would mean something different than <code>c b a</code>. For example, Perseus
                     (<link xlink:href="http://www.perseus.tufts.edu/hopper/"/>) has traditionally
                  categorized codes for morphological analysis of Greek, Latin, and other highly
                  inflected languages. Every code has ten positions, each one corresponding to a
                  major grammatical category, with the first two being the major and minor parts of
                  speech, and the subsequent categories devoted to person, number, tense, and so
                  forth. Each word that is analyzed must have a value, even if a hyphen or null. A
                     <code>d</code> in one position means something different from a <code>d</code>
                  in another.</para>
               <para>Uncategorized codes, on the other hand, assign one unique code to each
                  grammatical feature. In this approach, codes may be combined and arranged at will.
                     <code>a b c</code> would be identical to <code>c b a</code>. This approach is
                  viable for any language (including highly inflected ones such as Greek or Latin),
                  but it is in practice most often applied to languages that are not highly
                  inflected, e.g., the Brown and Penn sets for English.</para>
               <para>TAN-mor morphological codes may not include either the space or the hyphen, and
                  unlike IDrefs, they are case insensitive. For example, the codes <code>NOUN</code>
                  and <code>noun</code> are interchangeable.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a morphological rule file is <code><link
                        linkend="element-TAN-mor">&lt;TAN-mor></link></code>.</para>
               <para>Zero or more <code><link linkend="element-source">&lt;source></link></code>s
                  refer to the grammars or related works that account for the morphological rules.
                  If the categories, codes, and rules are not based upon any published work, then
                        <code><link linkend="element-source">&lt;source></link></code> may be
                  omitted. Any TAN-mor file without a source may be inferred to be based upon the
                  personal knowledge of the persons or organizations identified in <code><link
                        linkend="element-file-resp">&lt;file-resp&gt;</link></code>.</para>
               <para>A language declaration is made in the header: one or more <code><link
                        linkend="element-for-lang">&lt;for-lang></link></code>s.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-mor
                  file takes the customary optional attributes found in other TAN files (see <xref
                     linkend="edit_stamp"/>). </para>
               <para><code><link linkend="element-body">&lt;body></link></code> contains interleaved
                  rules and grammatical codes, either categorized or not.</para>
               <para>Grammatical rules consist of a series of <code><link linkend="element-rule"
                        >&lt;rule&gt;</link></code>s, perhaps filtered by attribute tests, and
                  perhaps filtered by children <code><link linkend="element-where"
                        >&lt;where&gt;</link></code>s with attribute tests. These tests are
                  evaluated against the context various <code><link linkend="element-m"
                        >&lt;m></link></code>s in a dependent TAN-A-lm file.</para>
               <para>Attribute tests are as follows:<itemizedlist>
                     <listitem>
                        <para><code><link linkend="attribute-m-matches">@m-matches</link></code>
                           (regular expression): <code><link linkend="element-m"
                              >&lt;m></link></code> matches the pattern. </para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-tok-matches">@tok-matches</link></code>
                           (regular expression): one of the values of <code><link
                                 linkend="element-tok">&lt;tok></link></code> in the given
                                 <code><link linkend="element-ana">&lt;ana&gt;</link></code> matches
                           the pattern (regular expression).</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-m-has-codes">@m-has-codes</link></code>
                           (space-delimited strings): <code><link linkend="element-m"
                              >&lt;m></link></code> has the specified feature codes.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-m-has-how-many-codes"
                                 >@m-has-how-many-codes</link></code> (integer): <code><link
                                 linkend="element-m">&lt;m></link></code> has the given number of
                           feature codes.</para>
                     </listitem>
                  </itemizedlist></para>
               <para>If all the attributes in a <code><link linkend="element-rule"
                        >&lt;rule&gt;</link></code> or any of its children <code><link
                        linkend="element-where">&lt;where&gt;</link></code>s evaluate true against a
                  context, then the process allows the actual ruels to be evaluated. Those rules are
                  found in the enclosed <code><link linkend="element-assert"
                     >&lt;assert></link></code>s or <code><link linkend="element-report"
                        >&lt;report></link></code>s, which declare rules that must be followed, or
                  must never be followed, by any dependent TAN-A-lm file. </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> and
                        <code><link linkend="element-report">&lt;report></link></code> will be
                  checked only if the conditions declared by the attributes in the enclosing
                        <code><link linkend="element-where">&lt;where&gt;</link></code> are met
                  :</para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> also has one
                  or more of the truth conditions above. If the test proves false in a given
                        <code><link linkend="element-m">&lt;m></link></code> then the <code><link
                        linkend="element-m">&lt;m></link></code> will be marked as erroneous and the
                  message included by the <code><link linkend="element-assert"
                     >&lt;assert></link></code> should be returned.</para>
               <para><code><link linkend="element-report">&lt;report></link></code> has the same
                  effect, but the test looks for the opposite boolean value: the error and message
                  will be returned only if the test proves true.</para>
               <para>Mixed with the rules are codes, either categorized or not. </para>
               <para>If categorized, there are zero or more <code><link linkend="element-category"
                        >&lt;category></link></code>s . Each one sorts <link linkend="element-code"
                        ><code>&lt;code></code></link>s into groups, assigning them <link
                     linkend="element-val"><code>&lt;val></code></link> that are unique within the
                        <code><link linkend="element-category">&lt;category></link></code>. Sequence
                  is important. The first <code><link linkend="element-category"
                        >&lt;category></link></code> defines the features allowed in the first code
                  position, the second in the second, and so forth.</para>
               <para>If not categorized, then there are simply one or more <link
                     linkend="element-code"><code>&lt;code></code></link>s. Each <link
                     linkend="element-code"><code>&lt;code></code></link> has a <code><link
                        linkend="attribute-feature">@feature</link></code> that points to one or
                  more vocabulary items for a grammatical feature, either by IDref or by name. </para>
               <para>TAN has a standard vocabulary file for grammatical features:
                     <code>vocabularies/features.TAN-voc.xml</code>. This vocabulary file encodes
                  746 grammatical features declared in the OLiA Reference Model for Morphology,
                  Morphosyntax and Syntax (<link xlink:href="http://purl.org/olia/olia.owl"/>). See
                     <xref linkend="vocabularies-features"/>.</para>
               <para><link linkend="element-code"><code>&lt;code></code></link> must have a <link
                     linkend="element-val"><code>&lt;val></code></link>, which contains the actual
                  code used, and it may take one or more <code><link linkend="element-desc"
                        >&lt;desc&gt;</link></code>s, to explain how the grammatical features should
                  be interpreted for a given language. This is the ideal place to provide
                  examples.</para>
               <para>In addition to examples below, see sample TAN-mor files in the
                     <code>examples</code> directory.</para>
               <para>
                  <example>
                     <title>Examples of rules and codes</title>
                     <programlisting>&lt;rule m-has-how-many-codes="2-10">
   &lt;report m-matches="^c">A conjunction has no other inflectional
      properties.&lt;/report>
   &lt;report m-matches="^r">A preposition has no other inflectional
      properties.&lt;/report>
   &lt;report m-matches="^i">An interjection has no other inflectional
      properties.&lt;/report>
   &lt;report m-matches="^y">An acronym has no other inflectional properties.&lt;/report>
&lt;/rule>
. . . . . . .
&lt;rule m-matches="^. i">
   &lt;assert m-matches="^[dp]">An interrogative must be either a determiner (d) or a
      pronoun (p).&lt;/assert>
&lt;/rule>
. . . . . .
&lt;code feature="accusative">&lt;val>accusative&lt;/val>&lt;/code>
&lt;code feature="nominative">&lt;val>nominative&lt;/val>&lt;/code>
&lt;code feature="case_dative">&lt;val>dative&lt;/val>&lt;/code>
&lt;code feature="case_genitive">&lt;val>genitive&lt;/val>&lt;/code>
&lt;code feature="case_vocative">&lt;val>vocative&lt;/val>&lt;/code>
. . . . . .
&lt;category feature="feature_person">
   &lt;code feature="first">&lt;val>1&lt;/val>&lt;/code>
   &lt;code feature="second">&lt;val>2&lt;/val>&lt;/code>
   &lt;code feature="third">&lt;val>3&lt;/val>&lt;/code>
&lt;/category></programlisting>
                  </example>
               </para>
            </section>
         </section>
         <section xml:id="catalog-files">
            <title>TAN Catalog Files (<code>collection</code>)</title>
            <para>TAN catalog files are used to locate relevant TAN files and to support the XSLT
               function <code>collection()</code>. They catalog or index any TAN files within a
               local directory and perhaps its subdirectories. </para>
            <para>These catalog files must always be named <code>catalog.tan.xml</code>. They depart
               from all other TAN files in their structure. They have no namespace. They have
               neither body nor head. Rather, they are patterned off the catalog.xml description
               provided by Saxonica (<link xlink:href="https://www.saxonica.com"/>).</para>
            <para>Any XML file passed to the stylesheet <code>applications/create/create TAN catalog
                  file.xsl</code> will automatically generate one of these files, cataloging all the
               files in the local directory.</para>
            <para>The root element of a catalog file is <code><link linkend="element-collection"
                     >&lt;collection></link></code>, with children <code><link linkend="element-doc"
                     >&lt;doc></link></code>s that hold simple metadata about the TAN files that are
               in a directory and its subdirectories. Only TAN files may be registered in a
                     <code><link linkend="element-doc">&lt;doc></link></code>. A <code><link
                     linkend="element-doc">&lt;doc></link></code> may include other material such as
               each file's resolved <code><link linkend="element-head">&lt;head></link></code>, but
               this is not mandated.</para>
         </section>
      </chapter>
   </part>
   <part xml:id="working_with_tan">
      <title>Using the Text Alignment Network</title>
      <chapter>
         <title>Working with TAN files</title>
         <para>This chapter presents ways to manage, create, edit, and share TAN files. These
            suggestions, based upon the experience of users, are both brief and general. To get into
            specifics, read the other chapters in this part of the guidelines, as well as the
            appendixes. </para>
         <section xml:id="local_setup">
            <title>Installation and local setup</title>
            <para>The TAN suite can be downloaded from a master data repository listed at <link
                  xlink:href="http://textalign.net/"/>. The project has been developed using the
               version-control software Git. Whether you download the files directly or you use Git,
               place the TAN code wherever is most convenient on your computer. No extra steps are
               necessary. Once you've downloaded the files, you have everything you need.<footnote>
                  <para>The one exception pertains to the <code>output/js</code> directory, which
                     has Javascript libraries that are designed to handle certain types of output
                     from TAN applications. Documentation in a TAN application will let you know
                     what Javascript dependencies are required.</para>
               </footnote></para>
            <para>Unlike many other applications, you do not install the TAN suite, and you do not
               have to put it in a specific place on your local drive. There is no executable file
               in the suite. You will work with TAN through Oxygen, another XML editor, a text
               editor, or (if you are a power user) the command line.</para>
            <para>You will be creating and editing TAN files. Those files may be set up in whatever
               directory structure you prefer. Because TAN files are part of a network, they are
               meant to be shared and interlinked. So it is beneficial to develop predictable
               directory structures. However you organize your TAN files, keep them separate from
               the suite of core TAN files.</para>
            <para>Many TAN projects will involve dozens of versions of a particular work, and it is
               easy to get confused as to what file does what. Naming files becomes a challenge (the
               filename, not the <code><link linkend="attribute-id">@id</link></code>, on which see
                  <xref xlink:href="#tan-file-id"/>). In projects with many text versions, it is
               recommended that your names for class-1 files start with an acronym or short
               abbreviation for the author and work, followed by the language code, the last name of
               the editor/author of the scriptum, the date when the scriptum was created or
               published. If you have a transcription that has been redivided into multiple TAN
               files linked to each other via <code><link linkend="element-redivision"
                     >&lt;redivision&gt;</link></code>, the reference system might need to be
               mentioned in the filename. Some suggestive examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.grc.1949.minio-paluello.ref-logical.xml</code>: Aristotle's
                           <emphasis>Categories</emphasis>, in Greek, from the 1949 edition by Minio
                        Paluello, following a reference system based on semantic units (paragraphs,
                        sentences, independent clauses).</para>
                  </listitem>
                  <listitem>
                     <para><code>apocr.eng.kjv.1760.xml</code>: apocrypha, English, King James
                        Version, 1760 edition. If the file adopted an unusual reference system, that
                        would be important to include in the name.</para>
                  </listitem>
                  <listitem>
                     <para><code>tlg0059.tlg031.perseus-grc1-Pl.Ti.xml</code>: Plato's
                           <emphasis>Timaeus</emphasis> in Greek. This filename has some duplication
                        in that the catalog number <code>tlg0059</code> already implies
                           <code>Pl</code> and <code>tlg031</code>, <code>Ti</code>, but only an
                        elite few know the meaning of the numerical codes used by the Thesaurus
                        Linguae Graecae.</para>
                  </listitem>
                  <listitem>
                     <para><code>pl.ti.grc.1905.burnet.stephanus.xml</code>: Plato's
                           <emphasis>Timaeus</emphasis> in Greek, Burnet's 1905 edition divided into
                        a system that approximates Stephanus numbers.<footnote>
                           <para>Many classicists refer to Stephanus numbers in Plato's corpus and
                              Bekker numbers in Aristotle's as canonical, as if the systems are
                              immutable and unambiguous. But any edition that claims to follow
                              Stephanus or Bekker numbers always makes slight adjustments to that
                              system. Words do not always break exactly where they do in the
                              19th-century edition, and words and phrases here and there get
                              transposed, inserted, or deleted, inevitably throwing off the
                              lineation. Making one's edition conform exactly to the original line
                              numbers is frequently a fool's errand.</para>
                        </footnote></para>
                  </listitem>
               </itemizedlist></para>
            <para>Some TAN applications, such as <xref xlink:href="#Diff_"/>, use filenames to order
               output. If you wish your class-1 files to be read in chronological order according to
               source, then it is a good practice to put the date in ISO form
                  <code>(YYYY(-MM(-DD)?)?)</code>, placed before any alphabetizable elements that
               are less important. </para>
            <para>In sum, a good sequence for ordering components in a filename would be:
               collection, work, language/version, date, editor/author, reference system.</para>
            <para>Class-2 files are tougher. They unite multiple files and concepts, so
               comprehensive filenames could become very long or unpredictably structured. One
               approach is to make sure that each class-2 file is given a brief but meaningful name
               that points to the research question that motivated its creation. Some examples:<itemizedlist>
                  <listitem>
                     <para><code>ar.cat.grc.1949.minio-paluello-sem-TAN-LM-sample.xml</code>: a
                        sample of lexico-morphological data for Aristotle's
                           <emphasis>Categories</emphasis>, in Greek. Each source-specific TAN-A-lm
                        file has no more than one source, so including the source in the filename
                        does not pose a challenge.</para>
                  </listitem>
                  <listitem>
                     <para><code>nt.grc-syr.selections.TAN-A-tok.xml</code>: a selection of
                        word-for-word correspondences between the Syriac and Greek New
                        Testaments.</para>
                  </listitem>
                  <listitem>
                     <para><code>plato.general.TAN-A.xml</code>: a general alignment and annotation
                        file concerning Plato's works.</para>
                  </listitem>
               </itemizedlist></para>
            <para>Class-3 filenames are a bit easier. It is recommended that TAN-mor files begin
               with the language code then an acronym for the person or group responsible for
               creating the rules and codes. TAN-voc files are written generally to serve a specific
               project or collection, so the collection name and the type of vocabulary should
               suffice. Examples:<itemizedlist>
                  <listitem>
                     <para><code>eng.example.com,2014.1.xml</code>: tagging scheme #1 for English,
                        by the owner of the domain <code>example.com</code> in 2014.</para>
                  </listitem>
                  <listitem>
                     <para><code>ar.cat.general.TAN-voc.xml</code>: general vocabulary items serving
                        a project dealing with Aristotle's <emphasis>Categories</emphasis>.</para>
                  </listitem>
               </itemizedlist></para>
            <para>If you have a local copy of someone else's TAN collection, and you wish to create
               TAN files that depend on them, you will in all likelihood use relative URLs pointing
               to copies of the files stored locally. If those files have <code><link
                     linkend="element-master-location">&lt;master-location></link></code>s pointing
               to their master copies, you should occasionally validate them, to see if there have
               been any updates. </para>
            <para>If you need to move a TAN file from one directory to another, you should think
               about any internal links that might need to be updated. A standard TAN utility, <xref
                  xlink:href="#File_Copier"/>, will copy a file for you and update any relative
               values of <code><link linkend="attribute-href">@href</link></code>. That application
               does not delete the old file, because file deletion is treated as a security risk in
               XSLT.</para>
         </section>
         <section>
            <title>Working with Oxygen XML Editor</title>
            <para>If you use an advanced XML editor such as Oxygen, you should edit your TAN
               collection through a project file, which will help you easily administer your TAN
               files and validate them automatically. Included with the standard TAN suite is a
               basic Oxygen project file, <code>TAN.xpr</code>. Use it as-is, or make a copy and
               configure it to your tastes. You will find that under Configure Transformation
               Scenarios there are preinstalled generic options for the standard TAN utilities and
               applications. </para>
            <para>When you open a TAN file in Author mode, you will find a variety of editing tools,
               primarily for class-1 files. Browse the options in the menu, the toolbars, and the
               context-click menu, to see what is possible. In a future version of TAN, more
               documentation will be provided on how to use these tools.</para>
            <para>The project file discussed above relies upon an Oxygen framework file,
                  <code>tan.frameworks</code>, which drives the functionality of the project. If you
               have another project already underway, you can incorporate the the
                  <code>tan.frameworks</code> file directly, combining it with your other Oxygen
               tools.</para>
         </section>
         <section>
            <title>Creating and populating TAN files</title>
            <para>TAN is a representational format. Every TAN file models some source. If those
               sources are non-digital, it is a relatively straightforward task to create and
               populate a TAN file. Just start editing, using a template (e.g., a file from the
                  <code>examples</code> directory). In some cases, you might benefit by starting
               with an algorithm. For example, optical character recognition (OCR) on an edition
               might give you a dirty but useful start for a TAN-T file. Applying OCR to a printed
               index of quotations might be the first step to a TAN-A file. Despite the computer's
               assistance, the majority of the task will be spent in correcting any conversions.
               Thoughtful attention is needed to making these files suitable for use.</para>
            <para>In many other cases, you want to take something that already exists digitally and
               convert it into a TAN format. Many times, when you find a Word file, a web page, or a
               plain text file that can serve as the basis for a TAN file, the first impulse is to
               copy the desired content, paste it into the body of an new TAN file, then manually
               adjust and correct it. That solution is quick and easy, but short-sighted. You may
               find only hours into the task that you made a major mistake, but that it happened so
               early in the process, you cannot backtrack. Perhaps you have accidentally deleted all
               punctuation when you didn't mean to. Or you eliminated line breaks that you didn't
               realize at the time were useful signals about where <code><link linkend="element-div"
                     >&lt;div></link></code>s should be separated. </para>
            <para>Even if all goes well, after all that hard work you might discover that the
               pre-TAN data sources you started out with have been updated, and other things have
               been corrected. If any significant time has elapsed, you may have forgotten what
               procedure you followed to convert the data. And even if you do remember, you will
               have to repeat the steps again, and dread the day when those pre-TAN sources are
               updated yet again. </para>
            <para>Save yourself time and hassle. Stop fixing files by hand. Instead, build a system
               to convert the files. Create an automated or semiautomated workflow that can be
               applied when needed, so that pre-TAN files can be channeled at will into your TAN
               library. This approach to the editorial task takes some extra investment at the
               outset, but in the long run it can save you many hours of labor. </para>
            <para>A very useful utility is <xref xlink:href="#Body_Builder"/>, which allows you to
               create a list of changes to be made to a particular document, to convert it to TAN-T
               or TAN-TEI (or even generic TEI). Or if you or a project member has experience in
               XSLT, develop your own stylesheets. </para>
            <para>When you find mistakes such as those described above, no harm is done. You can
               simply adjust the Body Builder configuration or XSLT file and re-run your process,
               each time getting better and better results. This approach requires extra work,
               initially. Establishing a stable transformation process can be time-consuming, since
               it requires repeated sequences of trial, error, and diagnosis. But the investment
               pays off in the long run, especially if you are dealing with dozens, hundreds, or
               thousands of files. The routines you write for one set of files might be useful for
               the next.</para>
         </section>
         <section xml:base="validating_tan_files" xml:id="validating_tan_files">
            <title>TAN validation</title>
            <section>
               <title>The process</title>
               <para>TAN files are validated when the file, along with its associated TAN schemas,
                  are passed to a validation engine. Validation can be set up either by pointing
                  explicitly to the schemas within a TAN file (via <code>&lt;?xml-model ?></code>
                  statements in the prolog), or by setting up an Oxygen project or framework to
                  automatically apply the schemas to TAN files (see <xref xlink:href="#local_setup"
                  />). There are two types of TAN validation.</para>
               <para>First, the file structure is checked against RELAX-NG files that define the
                  attributes, elements, and patterns that are allowed or required in a given TAN
                  format. These files are kept in the <code>schemas</code> project subdirectory,
                  according to format name. If you are editing a TAN-T file, for example, its
                  RELAX-NG schema is <code>schemas/TAN-T.rnc</code>.<footnote>
                     <para>The RELAX-NG files are written principally in the compact syntax
                           (<code>.rnc</code>), then converted to XML syntax (<code>.rng</code>).
                        The TAN-TEI format is an exception. Behind the schema
                           <code>schemas/TAN-TEI.rnc</code> is a master file
                           <code>schemas/TAN-TEI.odd</code>. This file, linked as it is with the
                        other RELAX-NG files, is processed by TEI stylesheets to generate the master
                           <code>TAN-TEI.rnc</code> and <code>TAN-TEI.rng</code> files that validate
                        TAN-TEI files. The ODD file is processed against TEI All, the largest of the
                        TEI formats, in the version available at the time of the release of a given
                        TAN version.</para>
                  </footnote></para>
               <para>The second type of validation uses Schematron to apply rules that cannot be
                  expressed in RELAX-NG, e.g., no <code><link linkend="attribute-when"
                     >@when</link></code> should have a date in the future. More than one hundred
                  types of errors are checked during Schematron validation. For a comprehensive list
                  see <link xlink:href="../functions/errors/TAN-errors.xml"/> and <xref
                     xlink:href="#errors"/>. Some of these errors can be quite time-consuming for a
                  computer to check. For example, if a class-1 file has a <code><link
                        linkend="element-redivision">&lt;redivision&gt;</link></code>, the text
                  should be identical. On short texts, the comparison can be made in seconds; on
                  longer ones it might take minutes (see next section, on efficiency). Therefore
                  Schematron validation allows three different levels: terse, normal, and verbose.
                  The names reflect not only how fast each phase takes but how much feedback is
                  provided.</para>
               <para>The Schematron files themselves are rather small. The majority of the work is
                  done by the TAN function library, which takes the file, resolves it, and expands
                  it, inserting errors and help messages along the way. A greatly reduced version of
                  the expanded file, containing only warnings and errors, is then passed back to the
                  Schematron processor as a global variable. The Schematron processor returns as
                  messages any errors or warnings found in the generated file, and any suggested
                  corrections as Schematron Quick Fixes.</para>
               <para>For more details about the TAN validation process, see <xref
                     xlink:href="#validation_mechanics"/>. </para>
            </section>
            <section>
               <title>Efficiency</title>
               <para>TAN's Schematron validation specifies a process that is much more
                  computationally intensive than is its RELAX-NG counterpart. The longer and more
                  complex your TAN file and its dependencies, the longer it will take to validate.
                  Files such as the Ring-a-roses examples in the <code>examples</code> subdirectory
                  will take a split second to validate, but a TAN-T file of the Old Testament of the
                  King James Version has been known to take about 25 seconds to validate in the
                  normal phase, and the whole Bible, about a minute. A TAN-A-lm file with a full
                  morphological analysis of a very long TAN-T file will take a much longer time to
                  validate. </para>
               <para>Tests were performed on TAN-A file that had three very large TAN-T sources
                  (each about 1.6 MB and 8,100 elements). If the TAN-A file had 125 claims,
                  Schematron validation under the normal phase took about 13 seconds (run on Oxygen
                  22.1 on a Windows 10 laptop on in Intel i5-8250U @ 1.60GHz). When the number of
                  claims was expanded to 546, the same process took 63 seconds. When the file had
                  5,421 claims, the file took 78 minutes, 45 seconds to validate.<footnote>
                     <para>Much of the extra time is due to the Schematron evaluation process, not
                        the preparatory work performed by the TAN function library. The library
                        component of the three tests above took up 8.3 seconds, 27.1 seconds, and 23
                        minutes 57 seconds, respectively. The time complexity of the Schematron
                        component grows faster than does that of the XSLT.</para>
                  </footnote></para>
               <para>The figures above are a very significant improvement over the time required in
                  the 2018 version, and no doubt future versions of TAN will bring optimizations to
                  the validation process. Nevertheless, you may need to make decisions that pit
                  speed against convenience. If you want validation to be quick, break files into
                  smaller ones, perhaps to be joined later in a single TAN file via <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code>s. Validating ten
                  component files each with ten thousand elements will take aggregately less time
                  than validating one long file with one hundred thousand elements. Had the example
                  TAN-A file mentioned above been split into 43 different files, the time required
                  for validating the entire collection would have been reduced by 88%.</para>
            </section>
         </section>
         <section>
            <title>Sharing TAN files</title>
            <para>TAN files have been designed to be shared and linked, just like any network of
               files. Most often, TAN files will be created and distributed as collections, not
               single files.</para>
            <para>One way to distribute a collection is to make it available as a repository via Git
               or some other version control software (VCS). This approach has many advantages. You
               can collaborate with a wide variety of people, and preserve an editorial history that
               allows you to branch or backtrack, if needed. VCS features and tools are extremely
               fast and useful.</para>
            <para>Collections may also be distributed through shared syncing services (e.g., Drive,
               Box, or Dropbox), or put on a Web server. In the latter case, it may be difficult for
               users to browse or download your collection of TAN files wholesale. In that case, you
               may wish to expose the collection as a compressed ZIP archive. This saves on your
               server's bandwidth, and it still exposes the files for XML processing. But a ZIP
               archive is not suitable for linking from one TAN file to another, nor is it
               appropriate as a target of <code><link linkend="element-master-location"
                     >&lt;master-location></link></code>. Unpacking a compressed file requires
               writing to the disk, which is treated as a security risk during validation. Such
               zipped archives are good ways to distribute a collection, but they should not be used
               as a primary repository or a master location.</para>
            <para>When you share a TAN file, make sure to include its dependencies, the files
               pointed to by <code><link linkend="element-vocabulary"
                  >&lt;vocabulary&gt;</link></code> or <code><link linkend="element-inclusion"
                     >&lt;inclusion></link></code>. If you are simply trying to email a single file,
               you could send a resolved version, which does not require any other dependencies (see
                  <xref xlink:href="#resolution"/>). </para>
         </section>
         
      </chapter>
      <chapter xml:id="tan-applications-and-utilities">
         <title>Using TAN Applications and Utilities</title>
         <para>TAN files are suited for dozens of types of applications. A few have been developed
            and successfully tested on select projects. The most mature of these have been provided
            in the subdirectories <code>applications</code> and <code>utilities</code>. </para>
         <para>Utilities are designed to assist in import, export, creating, and editing TAN files.
            They tend to support straightforward tasks, and the code is relatively stable.</para>
         <para>Applications, on the other hand, support study and research. Most of these take a set
            of TAN files, process them, and create interactive, dynamic HTML files that let you
            study and analyze textual features and relationships. Applications can have quite
            complicated code bases, and tend to have features that are not fully supported, or are
            in the planning phase.</para>
         <para>TAN utilites and applications are written in XSLT. XSLT, which stands for XSL
            Transformations, version 3.0,<footnote>
               <para>XSL, which stands for Extensible Stylesheet Language, was the predecessor
                  language.</para>
            </footnote> is very powerful, and has a distinctive syntax and design. Many people do
            not know how even to begin to use it. Even some seasoned programmers approaching XSLT
            for the first time can find it baffling or impenetrable. An XSLT application is rather
            different from others that may be more familiar to you. </para>
         <para>This chapter begins with a basic orientation to XSLT. You may not be ready to write
            anything in XSLT, but you can begin to read and understand an XSLT file. We then look at
            how to run an XSLT application, and then look at the standard TAN utilities and
            applications.</para>
         <section xml:id="xslt-orientation">
            <title>First things to know about XSLT</title>
               <section>
                  <title>The process</title>
                  <para>In most computer applications, the expected rules are rather
                  straightforward. Given zero or more inputs, zero or more outputs are returned.
                  Many times the application is driven by a graphical user interface (GUI), to allow
                  the user to configure the application.</para>
                  <para>XSLT applications do not have a GUI. They also have a somewhat different
                  approach to input and output. In the classic approach to XSLT, the input consists
                  of an XSLT stylesheet and an XML file, passed to a processor. But there is
                  opportunity for secondary input. And classically there is one output, but XSLT
                  provides the opportunity to create secondary outputs. The basic model is depicted here:<footnote>
                     <para>The classic view presented here does not take into account another way of
                        configuring an XSLT application, where a particular starting point is
                        designated, the initial template. In those cases, primary input is
                        unnecessary.</para>
                  </footnote></para>
                  <figure>
                     <title>The classic XSLT process </title>
                     <mediaobject>
                        <imageobject>
                           <imagedata fileref="img/xslt.jpeg"/>
                        </imageobject>
                     </mediaobject>
                  </figure>
                  <para>In the classic XSLT process, there are three key requirements:<orderedlist>
                        <listitem>
                           <para>an XML file, to catalyze the process;</para>
                        </listitem>
                        <listitem>
                           <para>a master XSLT file, to declare the rules that should be
                              followed;</para>
                        </listitem>
                        <listitem>
                           <para>an XSLT processor.</para>
                        </listitem>
                     </orderedlist></para>
                  <para>The process begins, actually, with the processor, which is normally given
                  URLs that specify where to find the input, the stylesheet, and where to place the
                  output. The processor fetches the XSLT stylesheet, and looks for any associated
                  components. After compiling the master stylesheet and its dependencies, the rules
                  are applied to the catalyzing XML file. Along the way, the processor may fetch
                  secondary input documents, if the XSLT file so instructs.</para>
                  <para>After all the rules have been applied, the processor saves the primary
                  result document—if there is one—to the specified target URL. If the XSLT rules
                  tate that secondary result documents should be saved at certain locations, the
                  processor does so.</para>
                  <para>Therefore, in any XSLT operation, there are really two possible types of
                  input and two types of output. We use the terms <emphasis>primary input</emphasis>
                  for the catalyzing XML file and <emphasis>secondary input</emphasis> for input
                  that is added during the process. We use the term <emphasis>primary
                     output</emphasis> for the main result tree and <emphasis>secondary
                     output</emphasis> for any other output created along the way. The terms
                     <emphasis>primary</emphasis> and <emphasis>secondary</emphasis> refer only to
                  their position in the process, not their importance to the application. Indeed,
                  there are XSLT applications where the secondary input and secondary output are far
                  more important than the catalyzing input or primary output. Sometimes the primary
                  input does not matter at all, and sometimes there is no primary output.</para>
               <para>You will normally have direct control over the primary input, because you will
                  need to select an XML file to catalyze the process. But any control you might
                  exercise over the secondary input could be hidden. The application might derive
                  secondary input based upon your primary input, or it might provide parameters, to
                  allow you to control the secondary input. </para>
               <para>Likewise, you normally have full control over where the primary output should
                  go. But you may not have that kind of control over the secondary output. You may
                  or may not have control over that.</para>
                  <para>When you get an XSLT file, try to understand first of all what kinds of
                  input is expected, and what types of output are returned, and where. In general,
                  if there is not good documentation and the XSLT does not come from a trusted
                  source, do not try to run it.</para>
               </section>
               <section>
                  <title>Syntax</title>
                  <para>XSLT is itself an XML document, and can be treated in every way as an XML
                     document. If there is something you can do to an XML document, you can do it to
                     an XSLT file too.</para>
               <para>The XML syntax makes the code somewhat more verbose than the syntax of other
                  languages. Many of the instructions are placed in elements, which frequently have
                  opening and closing tags. Unless otherwise specified, white space is flexible, and
                  the document can be reformatted and indented as one likes. Most XSLT files are
                  indented, but in most cases that indentation can be changed or removed without
                  affecting the output.</para>
                  <para>XML in general uses namespaces, to allow mixed vocabularies. So too, XSLT
                  files can interleave elements from different namespaces. In general, most XSLT
                  files do not define a default namespace: that is up to the designer to do. All the
                  XSLT elements are in the namespace <code><link
                        xlink:href="http://www.w3.org/1999/XSL/Transform"/></code>, and bound to the
                  prefix <code>xsl</code>.</para>
                  <para>Because an XSLT file is itself XML, then it can be designed to be the
                  primary input of an XSLT process, even its own. Running an XSLT file against
                  itself can be useful in cases where the primary input is irrelevant.</para>
               </section>
               <section>
                  <title>Modular design</title>
                  <para>An XSLT file may invoke other XSLT files, or be invoked by them, through the
                     <code>&lt;xsl:import></code> and <code>&lt;xsl:include></code> instructions.
                  Inclusions and imports are recursive: the processor looks not just for the ones it
                  imports/includes, but the ones they import/include, and so forth.</para>
                  <para>The modular approach to XSLT allows developers to be more efficient and
                  effective when writing code. Routines that serve one process well can serve
                  another. But it also means that when you first open up an XSLT file, you do not
                  understand what it does until you trace the chain of <code>&lt;xsl:import></code>
                  and <code>&lt;xsl:include></code> instructions, and find all the stylesheets it
                  depends. </para>
               <para>That process can be cumbersome, but straight-forward. More challenging is
                  asking yourself whether the file you began with is a master stylesheet (the
                  intended starting point for a process), or if it is itself a dependency. You may
                  not be able to tell, without documentation. Tracing these lines of dependence is
                  important, because you need to find the appropriate starting point, and understand
                  how it relates to the network of XSLT files.</para>
               </section>
               <section>
                  <title>Declarative statements</title>
                  <para>In most programming languages, you write a list of things for the computer
                  to do, in a specified order, governed by conditional branching. This list-like
                  approach to programming is called imperative programming.</para>
                  <para>XSLT has imperative components, but at its heart, it is a declarative
                  programming language. That is, an XSLT programmer writes not a list of steps to be
                  followed but rather a set of rules or principles that should be observed. It is up
                  to the processor to determine the most efficient path to honor those rules or
                  principles.</para>
                  <para>Imperative and declarative programming can be compared to real-world
                  examples. Suppose you have a pile of candies that need to be sorted. Imperative
                  programming is like telling a child: get one candy; if it is like such-and-such,
                  put it here; repeat. Declarative programming is like telling that same child, I do
                  not care how you do it, but make sure that the final groups look like
                  such-and-such.</para>
                  <para>If you are familiar with Cascading Style Sheets (CSS) you might appreciate
                  better how the XSLT programmer approaches a task. In CSS, styling instructions are
                  provided by selector patterns that match certain elements within the HTML file.
                  CSS instructions can frequently be placed in different groups and orders, and with
                  different levels of specificity, to infer priority. It is up to the browser to
                  take those rules and find the most efficient way to apply the styles. Such a
                  declarative approach allows the writer of CSS to efficiently write, edit, and
                  maintain some rather complex code. </para>
                  <para>Because of its declarative approach, the order of an XSLT's root element
                  children is flexible. Most often, order does not matter. The children of the root
                  element, called declarations, are special, because they stipulate the rules or
                  principles that should be followed. All of the declarations of the stylesheet's
                  modules are also taken into consideration. Which means that when you are reading a
                  particular section of an XSLT file, you might think you understand what is being
                  done. But there may be declarations in other parts of the file or its
                  inclusions/imports that affect whether the particular component you are looking at
                  is called, or in what priority.</para>
                  <para>As a general rule of thumb, when you read an XSLT file to understand what it
                  does, do not put much importance on the order of its declarations. They will not
                  be followed in that order. There are cases where order is important, but coming
                  freshly to an XSLT file, try to get a bird's-eye overview of all the components.
                  Look at all the declarations, wherever they are found. As you read, don't look for
                  the application's steps. Try to understand the intended outcome.</para>
               </section>
               <section>
                  <title>Variables and parameters</title>
                  <para>In most programming languages, you can write something like the following
                     pseudocode...</para>
                  <para>
                     <programlisting>x = 1
x = x - 1
return x</programlisting>
                  </para>
                  <para>...and expect the output 0. The variable x starts with the value 1, but then
                  changes, because variables are mutable.</para>
                  <para>In XSLT, variables and parameters are immutable. You cannot change the value
                  of a variable or parameter. A variable can be destroyed (and along with it, its
                  value), and then a new instatiation of the variable can be created, but once
                  again, within its life (scope), it does not change. If you see two
                     <code>&lt;xsl:param></code> or two <code>&lt;xsl:variable></code> instructions
                  that create variables with the same name, they are in different scopes (or the
                  XSLT is invalid).</para>
                  <para>Both variables and parameters might be in a namespace. If there is a colon
                  in the name, the variable or parameter is bound to a particular namespace. Check
                  the prefix to see its namespace.</para>
                  <para>As a user of an XSLT stylesheet, you should not worry too much about any
                  XSLT variables. Certainly, you can change them if you want, but at that point you
                  are stepping into the role of developer. We assume here you are interested
                  primarily in using, not altering, an XSLT application. Your should focus, instead,
                  upon parameters, but only a certain kind: global, relevant parameters.</para>
                  <para>Global parameters are found exclusively as children of a root element. That
                  is, they are declarations (see previous section). Any parameters that are more
                  deeply nested are local parameters, and you shouldn't change them.</para>
                  <para>Not all global parameters are relevant. If you have a master stylesheet that
                     includes another one, that stylesheet may have global parameters that are
                     designed to accommodate some other including XSLT application. Normally, you
                     will know which global parameters are relevant for your purposes only by
                     studying the file's documentation, or its code.</para>
                  <para>Every global parameter is a developer's invitation to the user to configure
                  the XSLT application. Some parameters exercise an enormous influence over the type
                  of output; others have no effect whatsoever; yet others might cause the
                  application to crash if you put in the wrong value. Before you try to change a
                  parameter, you should understand something about data types. See <xref
                     xlink:href="#configuring-global-parameters"/>.</para>
               </section>
               <section xml:id="xpath-language">
                  <title>XPath language</title>
                  <para>XSLT relies upon a sublanguage called XPath, which is itself a proper subset
                  of another powerful XML programming language, XQuery. You will most commonly read
                  or use XPath expressions in the context of the <code>@select</code> attribute in
                  various XSLT instruction elements.</para>
                  <para>XPath is an enormous topic, and well worth learning. Because this chapter is
                  geared to helping new users quickly get comfortable with using and configuring an
                  XSLT application, we introduce here some very common, useful XPath expressions.
                  They are presented according to four basic concepts: navigation, filter
                  expressions (predicates), operators, and functions.</para>
                  <section>
                     <title>Navigation</title>
                     <para>Every XML file is a tree, and at the heart of XPath is a language for
                     traversing that tree. XPath gets its name, because it was designed to provide a
                     path from one point to many. An XPath expression always assumes some kind
                     starting point for the path. That starting point is called the
                        <emphasis>context</emphasis>, which is commonly a node inside an XML
                     tree.</para>
                     <para>Because this short guide is aimed at users who are configuring global
                     parameters, we will assume in our examples here that the context is the primary
                     input XML document. That means that the context is the document node of the
                     primary input.</para>
                     <para>When an XPath expression begins with a single slash, the document node is
                        selected. The following example shows how to bind to the global parameter
                           <code>$doc-a</code> the document node of the primary input.</para>
                     <programlisting>&lt;xsl:param name="doc-a" as="document-node()" select="/"/></programlisting>
                     <para>Once you start an XPath expression, you add to it by adding new
                     components. This builds the path of traversal. Commonly you want to traverses
                     downward through the tree, toward the leaves. You do this most frequently by
                     element name. If it is in a namespace, you either need to start with the
                     appropriate prefix, or else use an asterisk (represents any prefix), followed
                     by a colon. The following example selects the <code>&lt;tei:TEI></code> root
                     element of the primary input XML document. If the root element is not named TEI
                     and it is not in the namespace bound to the prefix <code>tei</code>, then you
                     will get an error, because this global parameter expects exactly one item, no
                     more, no less.</para>
                     <programlisting>&lt;xsl:param name="tei-root-element" as="element()" select="tei:TEI"/></programlisting>
                     <para>The previous example would have worked as well with
                     <code>/tei:TEI</code>, which says, in effect, go to the document node, then go
                     to the element TEI. We have left it off because we are assuming that the
                     document node of the primary input document is the context (i.e., the assumed
                     starting point for an XPath expression). Another XPath expression comparable to
                     the example above would be <code>*:TEI</code>, which selects the root element
                     if its name is TEI, regardless of what namespace it is in.</para>
                     <para>The nested elements of the tree can be traversed by separating element
                     names with the slash. The following example navigates from the document node
                     leafward to the TEI's body, three levels deep. This example also shows how to
                     use the asterisk alone, which stands for any element.</para>
                     <programlisting>&lt;xsl:param name="tei-text" as="element()?" select="tei:TEI/*/tei:body"/></programlisting>
                     <para>If you want to go deeply into the document, and select a variety of
                        elements, you can do so with the double-slash operator, which navigates down
                        to all descendants.</para>
                     <programlisting>&lt;xsl:param name="tei-abs" as="element()*" select="tei:TEI//tei:ab"/></programlisting>
                     <para>The example above selects every <code>&lt;ab></code> in a TEI document.
                        If one <code>&lt;ab></code> nests inside another, both are picked.</para>
                     <para>To select an attribute, use the <code>@</code> sign. In the following
                     example the XPath expression points to an attribute that is bound to a
                     namespace via the prefix <code>xml</code>. One commonly finds
                        <code>@xml:id</code>, <code>@xml:lang</code>, <code>@xml:space</code>, but
                     most of the attributes you encounter will not have namespaces, even if their
                     parent elements have them.</para>
                     <programlisting>&lt;xsl:param name="tan-t-lang" as="attibute()" select="tan:TAN-T/tan:body/@xml:lang"/></programlisting>
                     <para>To select any attribute, use <code>@*</code>. The following example
                     selects all the attributes in <code>&lt;change></code> elements in a TAN file.
                     Note the use of the asterisk for the root element. This expression will work no
                     matter which TAN format is used.</para>
                     <programlisting>&lt;xsl:param name="change-attrs" as="attribute()+" select="*/tan:head//tan:change/@*"/></programlisting>
                     <para>You can use parentheses and commas to group and add nodes. In this
                        example, the XPath expression points to the TAN <code>&lt;body></code>, then
                        selects all the children comment nodes, text nodes, and elements.</para>
                     <programlisting>&lt;xsl:param name="interesting-nodes" as="item()*" select="*/tan:body/(text(), comment(), *)"/></programlisting>
                     <para>There is a slightly simpler way to do the preceding example, and it also
                     finds any processing instructions:</para>
                     <programlisting>&lt;xsl:param name="interesting-nodes" as="item()*" select="*/tan:body/node()"/></programlisting>
                     <para>In an XPath expression <code>node()</code> finds everything except
                     attributes and namespaces.</para>
                     <para>There is much, much more about XPath navigation, but the samples above
                        should get you started. See <link
                           xlink:href="https://www.w3.org/TR/2017/REC-xpath-31-20170321/">XPath
                           3.1</link> for comprehensive, technical coverage.</para>
                  </section>
                  <section>
                     <title>Filter expressions (predicates)</title>
                     <para>An XPath expression that traverses a tree might return more nodes than
                     you want. You can reduce what is captured by applying a predicate, which is an
                     XPath expression that filters results. A predicate consists of an XPath
                     expression enclosed by two square brackets, inserted in the middle of, or at
                     the end of, another XPath expression. The predicate must be placed in an XPath
                     expression immediately to the right of the step you want to filter. For every
                     context node found, the predicate will be evaluated as a boolean. If the
                     predicate is true, the node is retained, otherwise it is discarded. </para>
                  <para>A very simple example shows how to pick the second &lt;div> in the body of a
                     TAN-T file:</para>
                     <programlisting>&lt;xsl:param name="second-div" as="element()?" select="tan:TAN-T/tan:body/tan:div[2]"/></programlisting>
                     <para>This predicate, <code>[2]</code>, returns true if a given node is the
                     second child <code>&lt;div></code> of <code>&lt;body></code>. The simple
                     numeral <code>2</code> in the filter expression is actually shorthand for a
                     slightly longer expression based on XPath functions (discussed below),
                        <code>[position() eq 2]</code>.</para>
                     <para>The next example finds every <code>&lt;div></code> that has an attribute
                     of <code>@xml:lang</code>.</para>
                     <programlisting>&lt;xsl:param name="second-div" as="element()*" select="tan:TAN-T/tan:body//tan:div[@xml:lang]"/></programlisting>
                     <para>This predicate, too is shorthand for <code>[exists(@xml:lang)]</code>,
                        another XPath function.</para>
                     <para>Predicates may nest. Any nesting predicate still takes as its context the
                        step immediately to the left. This example finds every TEI
                           <code>&lt;div></code> tag, but only if it has a <code>&lt;p></code> that
                        has a <code>&lt;quote></code>.</para>
                     <programlisting>&lt;xsl:param name="divs-with-quoting-ps" as="element()*" 
          select="tei:TEI/tei:text/tei:body//tei:div[tei:p[tei:quote]]"/></programlisting>
                     <para>Predicates may chain, simply by appending predicates. The following
                        example reduces the previous example to the first instance.</para>
                     <programlisting>&lt;xsl:param name="divs-with-quoting-ps" as="element()*" 
          select="tei:TEI/tei:text/tei:body//tei:div[tei:p[tei:quote]][1]"/></programlisting>
                     <para>The position of chained predicates is important. Whereas the preceding
                     example filtered the <code>&lt;div></code>s then picked the first one, the next
                     example finds the first <code>&lt;div></code> (one that does not have a
                     preceding sibling <code>&lt;div></code>), and retains it only if it has a
                        <code>&lt;p></code> with a <code>&lt;quote></code>.</para>
                     <programlisting>&lt;xsl:param name="divs-with-quoting-ps" as="element()*" 
          select="tei:TEI/tei:text/tei:body//tei:div[1][tei:p[tei:quote]]"/></programlisting>
                  <para>The previous two examples look very similar, but they produce very different
                     results.</para>
                     <para>Predicates may be placed anywhere in an XPath expression. The following
                     gets all top-level &lt;div>s only if the root element has an
                        <code>@TAN-version</code>,  a distinctive marker of all TAN files.</para>
                     <programlisting>&lt;xsl:param name="top-level-divs" as="element()*" 
          select="*[@TAN-version]/*/*:body/*:div"/></programlisting>
                  </section>
                  <section>
                     <title>Operator expressions</title>
                     <para>We have already seen some basic XPath operator expressions, namely, in
                     the comma and the parentheses. XPath has many more operator expressions, some
                     of which should be immediately recognizable: <code>+</code> for addition,
                        <code>-</code> for subtraction, <code>*</code> for multiplication, and
                        <code>div</code> for division. (The slash is not used for division, to avoid
                     clashes with the step separator.) The keyword <code>to</code>, with an integer
                     on either side (the smaller on the left), creates a range, e.g., <code>(1 to
                        10)</code>.</para>
                     <para>XPath also has comparison expressions. Although <code>&lt;</code> and
                        <code>></code> can be used for "less than" and "greater than", those symbols
                     interfere with XML syntax. Instead, use the expressions <code>lt</code> and
                        <code>gt</code>. The expressions <code>le</code> and <code>ge</code> can
                     also be used, to mean less than or equal to, and greater than or equal to,
                     respectively.</para>
                     <para>For checking equality, you will most often use the <code>=</code>
                     expression. There is also <code>eq</code>, but this can be used only to compare
                     exactly two items. The <code>=</code> is very powerful, because it will return
                     true if there is any item in the sequence on the left hand side that is equal
                     to any item in the sequence on the right. Consider for example, an XPath
                     statement that compares two sequences, each with two integers: <code>(1, 2) =
                        (2, 3)</code>. The statement is true because there is at least one pair of
                     equal items. Because the expression <code>=</code> is used so frequently to
                     compare sequences, you might think of it as meaning "overlaps with."</para>
                     <para>Complex expressions can be combined with <code>and</code>,
                        <code>or</code>, and grouped with parentheses, as needed.</para>
                     <para>As you work with XSLT global parameters, you will find that most operator
                     expressions are used within the filtering predicates. The following finds all
                        <code>&lt;div></code>s with an attribute <code>@type</code> whose value is
                     "chapter".</para>
                     <programlisting>&lt;xsl:param name="chapter-divs" as="element()*" select="//*:div[@type = 'chapter']"/></programlisting>
                     <para>This expression finds the top-level divs in 2nd, 3rd, 4th, and 8th
                        place:</para>
                     <programlisting>&lt;xsl:param name="some-divs" as="element()*" select="//*:body/*:div[position() = (2 to 4, 8)]"/></programlisting>
                     <para>The following example returns any <code>&lt;div></code> whose values of
                           <code>@n</code> and <code>@type</code> match.</para>
                     <programlisting>&lt;xsl:param name="dupl-n-and-type-divs" as="element()*" select="//*:div[@type = @n]"/></programlisting>
                  </section>
                  <section>
                     <title>Functions</title>
                     <para>XPath expressions become enormously powerful when combined with the
                     language's 155 standard functions. You have already seen two of them,
                        <code>position()</code> and <code>exists()</code>. In a brief survey like
                     this, it is possible to illustrate only a few of the most common standard
                     functions you are likely to use when configuring the global parameters of an
                     XSLT application.</para>
                     <para><emphasis role="bold">last()</emphasis>: returns an integer representing
                     the size of the context. The following examples contain an implicit
                        <code>position() eq</code>, just the same as the filter expression example
                     above, with <code>[2]</code>.</para>
                     <programlisting>&lt;xsl:param name="last-div" as="element()?" select="//*:body/*:div[last()]"/>
      &lt;xsl:param name="penultimate-div" as="element()?" select="//*:body/*:div[last() - 1]"/></programlisting>
                     <para><emphasis role="bold">count()</emphasis>: returns the number of items in
                        a sequence. The following returns all TAN-T <code>&lt;div></code>s that have
                        more than three children <code>&lt;div></code>s.</para>
                     <programlisting>&lt;xsl:param name="populous-divs" as="element()*" select="//tan:div[count(tan:div) gt 3]"/></programlisting>
                     <para><emphasis role="bold">not()</emphasis>: returns true if the expression it
                     contains is false, or false if it is true. This function is very widely used,
                     to great effect. The first example belowe finds all leaf divs, and the second,
                     all leaf elements:</para>
                     <programlisting>&lt;xsl:param name="leaf-divs" as="element()*" select="//*:div[not(*:div)]"/>
      &lt;xsl:param name="leaf-elements" as="element()*" select="//*[not(*)]"/></programlisting>
                     <para>Whereas the <code>=</code> operator is very popular, its counterpart,
                           <code>!=</code>, is not used very much, because its results tend to be
                        uninteresting. The true complement of <code>=</code> comes with
                           <code>not()</code>, as illustrated in this example, which retrieves all
                           <code>&lt;div></code>s that are not of a certain type:</para>
                     <para>
                        <programlisting>&lt;xsl:param name="certain-divs" as="element()*" 
             select="//*:div[not(@type = ('ep', 'title', 'pref'))]"/></programlisting>
                     </para>
                     <para><emphasis role="bold">lower-case() / upper-case()</emphasis>: converts a
                     string to all lowercase / uppercase values. This example looks for any text
                     node that has a certain value, but only after it has been rendered
                     lowercase.</para>
                     <programlisting>&lt;xsl:param name="some-elements" as="text()*" select="//text()[lower-case(.) = 'a b c']"/></programlisting>
                     <para>Note the use of the period, which is shorthand for the context
                     item.</para>
                     <para><emphasis role="bold">normalize-space()</emphasis>: takes a string,
                        removing all space from the beginning and end, and replacing any consecutive
                        block of intermediary space with a single space. This function is very
                        useful when you wish to compare texts that may be indented. The preceding
                        example might have missed some text nodes that had initial or trailing
                        space. It can be adjusted as follows:</para>
                     <programlisting>&lt;xsl:param name="some-elements" as="text()*" 
            select="//text()[normalize-space(lower-case(.)) = 'a b c']"/></programlisting>
                     <para>Many times XPath functions must call each other. You may nest them, as in
                     the example above, or you may use pointing syntax, <code>=></code>. Use the
                     syntax you are most comfortable with.</para>
                     <programlisting>&lt;xsl:param name="some-elements" as="text()*" 
            select="//text()[(lower-case(.) => normalize-space()) = 'a b c']"/></programlisting>
                     <para><emphasis role="bold">contains() / starts-with() /
                     ends-with()</emphasis>: tests to see if the string in the first parameter
                     contains / starts with / ends with the string in the second. The following
                     finds all elements that contain the text "straw":</para>
                     <programlisting>&lt;xsl:param name="some-elements" as="element()*" select="//*[contains(., 'straw')]"/></programlisting>
                     <para><emphasis role="bold">contains-token()</emphasis>: tests to see if the
                     string in the first parameter has as one of its "words" the string in the
                     second, based on segmenting the first string at blocks of space. The preceding
                     example would have picked up "strawberry"; in the next example, using
                        <code>contains-token()</code>, "strawberry" would not be selected:</para>
                     <programlisting>&lt;xsl:param name="some-elements" as="element()*" select="//*[contains-token(., 'straw')]"/></programlisting>
                     <para><emphasis role="bold">matches()</emphasis>: tests to see if the string in
                        the first parameter matches the second, which is a regular expression.
                        Several TAN applications rely heavily upon regular expressions, which
                        provide very powerful way of finding and replacing text. See <xref
                           xlink:href="#regular_expressions"/>. The following example finds any text
                        node with one of the seven weekday names in English:</para>
                     <programlisting>&lt;xsl:param name="text-nodes-with-weekdays" as="text()*" 
           select="//text()[matches(., '(Sun|Mon|Tue|Wednes|Thurs|Fri|Satur)day')]"/></programlisting>
                     <para>There are, of course, many, many more XPath functions. For the complete
                        list, along with all the specifications, see <link
                           xlink:href="https://www.w3.org/TR/xpath-functions-31/">XPath Functions
                           and Operators 3.1</link>.</para>
                  </section>
               </section>
            
         </section>
         <section>
            <title>Configuring and running an XSLT application</title>
            <section xml:id="configuring-global-parameters">
               <title>Configuring global parameters</title>
               <para>Once you have determined the master XSLT stylesheet for the application, you
                  may want to configure it by adjusting the values given to the global parameters.
                  You have several possible strategies:<orderedlist>
                     <listitem>
                        <para><emphasis role="bold">Work with a configuration file</emphasis>. If
                           you are comfortable writing some simple XSLT code, you might create a
                           small XSLT file that has nothing but an <code>&lt;xsl:import></code>
                           whose <code>@href</code> value points to the original stylesheet. Copy
                           from the master XSLT stylesheet only those <code>&lt;xsl:param></code>s
                           that you want to change. This method is quick to set up and easy to use,
                           but it also means that you do not have immediate access to
                           documentation.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="bold">Overwrite the values in the master XSLT
                              stylesheet directly</emphasis>. This method is quick, but it also
                           means that you might not easily restore the original settings, unless you
                           make a backup copy. Also, if you are using configuration files, their
                           default values will change. That could be good or bad, depending upon
                           your setup.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="bold">Work from a copy of the master XSLT
                              file</emphasis>. This method allows you to customize the entire
                           application, and consult as needed the original settings in the master
                           file. Like configuration files (see above), you can make new copies for
                           new situations emerge. You should make certain that any working copies
                           are in the same subdirectory as the original, to keep links
                           intact.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="bold">Manage transformations from Oxygen</emphasis>.
                           Oxygen XML Editor has a powerful feature, Configure Transformation
                           Scenarios, which allows you to create custom configurations for an XSLT
                           application. Oxygen has good documentation on how to use this flexible
                           feature, which can be combined with any of the preceding three options.
                           Oxygen allows you not only to configure the parameters but to manage
                           input and output. One drawback is that you are presented with
                              <emphasis>all</emphasis> the global parameters that can be found,
                           whether or not they are really relevant. Documentation associated with a
                           particular parameter may be missing or truncated. You should use this
                           feature in conjunction with any documentation that comes with the XSLT
                           application. </para>
                     </listitem>
                  </orderedlist></para>
               <para>Whatever method you adopt for configuration, first find the relevant global
                  parameters. Once you have them, you should always ensure you understand what type
                  of data is expected, and in what quantity. </para>
               <para><emphasis>Data types</emphasis>. XSLT is a strongly typed programming language.
                  The data that is bound to variables and parameters are always at least implicitly
                  typed. Many variables or parameters specify exactly what kind of data is expected.
                  Those that do not are assigned some default type by the XSLT processor. Most data
                  types you encounter will be of two sorts: atomic types, and nodes. Examples of
                  atomic types are integers, booleans, strings, and dates. Examples of nodes are
                  elements, attributes, comments, and processing instructions. There are other
                  types, but we will focus here on the most common.</para>
               <para><emphasis>Quantities</emphasis>. In XSLT, there are four quantity categories:
                  (1) zero or one; (2) exactly one; (3) zero or more; (4) one or more. Each of these
                  are specified by adding to a data-type declaration a quantifier: <code>?</code>,
                  nothing, <code>*</code>, and <code>+</code>. </para>
               <table frame="all">
                  <title>Quantifiers and data types</title>
                  <tgroup cols="4">
                     <colspec colname="c1" colnum="1" colwidth="1*"/>
                     <colspec colname="c2" colnum="2" colwidth="1*"/>
                     <colspec colname="c3" colnum="3" colwidth="1*"/>
                     <colspec colname="c4" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Quantity</entry>
                           <entry>Symbol</entry>
                           <entry>Atomic type example</entry>
                           <entry>Node type example</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>zero or one</entry>
                           <entry><code>?</code></entry>
                           <entry><code>xs:string?</code></entry>
                           <entry><code>element()?</code></entry>
                        </row>
                        <row>
                           <entry>exactly one</entry>
                           <entry>none</entry>
                           <entry><code>xs:boolean</code></entry>
                           <entry><code>document-node()</code></entry>
                        </row>
                        <row>
                           <entry>zero or more</entry>
                           <entry><code>*</code></entry>
                           <entry><code>xs:dateTime*</code></entry>
                           <entry><code>attribute()*</code></entry>
                        </row>
                        <row>
                           <entry>one or more</entry>
                           <entry><code>+</code></entry>
                           <entry><code>xs:integer+</code></entry>
                           <entry><code>comment()+</code></entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>Below are some of the more common data types you will find in global
                  parameters, along with several examples going from simple values up to more
                  complex assignments based upon XPath expressions or XSLT constructions. For more
                  background, see <xref xlink:href="#xpath-language"/>. Focus is placed upon data
                  types and quantities expected in select TAN applications and utilities.</para>
               <para><emphasis role="bold">Strings</emphasis>. A string is a concatenated sequence
                  of characters. Even when the value consists only of Arabic numerals, a string will
                  be read and interpreted as a text, not as an integer.</para>
               <para>In the following example, the string value is specified by the single quotation
                  marks within the double quotation marks. The double-quotation marks delimit the
                  value of the attribute, and the single-quotation marks specify that the value is a
                  string. If you did not include the single quotation marks, it would be interpreted
                  as an XPath expression pointing to the name of a child element within the
                  context.</para>
               <programlisting>&lt;xsl:param name="text-a-to-compare" as="xs:string?" select="'Every day'"/></programlisting>
               <para>When more than one string is expected, the strings should be separated by a
                  comma. It is also common to surround the series with parentheses, for visual
                  clarity. This example assigns to the parameter a sequence of two strings.</para>
               <programlisting>&lt;xsl:param name="text-a-to-compare" as="xs:string+" select="('day', 'night')"/></programlisting>
               <para>In the next example, <code>@select</code> is replaced by the text node within
                  the parameter. This technique can be useful if the value expected will be
                  space-normalized, and you want to wrap text, and you do not need to create
                  multiple strings.</para>
               <programlisting>&lt;xsl:param name="text-a-to-compare" as="xs:string?">Every day&lt;/xsl:param></programlisting>
               <para>The next example takes the primary input XML and converts it to a string. Such
                  conversion is called <emphasis>casting</emphasis>. Keep in mind that the context
                  node of any global parameter is the primary input XML document.</para>
               <programlisting>&lt;xsl:param name="text-a-to-compare" as="xs:string" select="string(/)"/></programlisting>
               <para>Perhaps you need to supply a path to some input. The following example
                  traverses the tree to a particular <code>@href</code> within the primary input.
                  The string value in that attribute will be treated like a URL, and it will be
                  resolved relative to the base URI of the primary input.</para>
               <programlisting>&lt;xsl:param name="path-to-source" as="xs:string" 
     select="resolve-uri(/*/tan:head/tan:predecessor/tan:location/@href, base-uri(/))"/></programlisting>
               <para>If a parameter allows multiple values, and you need to change those values
                  frequently, you might want to bind options to global parameters or global
                  variables of your own creation... </para>
               <programlisting>&lt;xsl:variable name="dir-1-path" as="xs:string" select="'../../novels/book-a'"/>
&lt;xsl:variable name="dir-2-path" as="xs:string" select="'test/comparanda'"/>
&lt;xsl:variable name="dir-3-path" as="xs:string" select="'test/logs'"/>
&lt;xsl:variable name="dir-4-path" as="xs:string" select="'../brown/texts'"/></programlisting>
               <para>...then update the master global parameter on a case-by-case-basis.</para>
               <programlisting>&lt;xsl:param name="secondary-input-relative-uri-directories" as="xs:string+"
   select="$dir-1-path, $dir-4-path"/></programlisting>
               <para>The preceding example allows you to quickly change from one set of data to
                  another.</para>
               <para><emphasis role="bold">Booleans</emphasis>. A boolean is a true/false value. If
                  a parameter expects a boolean, you should use some XPath expression that can be
                  cast to a boolean, even if it is a simple one, such as <code>true()</code> or
                     <code>false()</code>. If you need to express the value as a string, it should
                  be either "true", "false", "0", or "1".</para>
               <programlisting>&lt;param name="ignore-comments" as="xs:boolean" select="false()"/>
&lt;param name="preoptimize-string-order" as="xs:boolean" select="'true'"/></programlisting>
               <para><emphasis role="bold">Integers</emphasis>. To supply an integer, you need only
                  use numerals, perhaps preceded by a hyphen if it is negative. You should not use
                  quotation marks, or the parameter's child text node. There will be no confusion of
                  the integer with an XPath step, because no element's name may begin with a
                  digit.</para>
               <programlisting>&lt;xsl:param name="start-at-depth" as="xs:integer" select="1"/>
&lt;xsl:param name="ngram-auras" as="xs:integer+" select="(2, 1)"/></programlisting>
               <para><emphasis role="bold">Decimals</emphasis>. Decimals are much like integers, but
                  require decimal points. If the decimal is between 1.0 and -1.0, the decimal point
                  must  be preceded by a zero, e.g., <code>-0.99</code>.</para>
               <programlisting>&lt;xsl:param name="diff-threshold-of-interest" as="xs:decimal" select="0.2"/></programlisting>
               <para><emphasis role="bold">Elements</emphasis>. If a global parameter expects
                  elements as input, you must construct them inline, or provide an XPath expression
                  that directs the processor to the elements in question. The following example
                  shows how to construct a parameter that might be fed into <code><link
                        linkend="function-batch-replace">tan:batch-replace()</link></code>.</para>
               <programlisting>&lt;xsl:param name="additional-batch-replacements" as="element()">
   &lt;replace pattern="(\d\d)/(\d\d)/(\d\d\d\d)" replacement="$3-$1-$2"
      message="Converted U.S.-style date to ISO-style"/>
&lt;/xsl:param></programlisting>
               <para>The parameter used in the previous example might need to be given numerous
                  elements. In those cases it might be convenient to put them in a separate XML file
                  and point to it, with an XPath expression:</para>
               <programlisting>&lt;xsl:param name="additional-batch-replacements" as="element()"
   select="doc('batch-replacements.xml')/*/tan:replace"/></programlisting>
            </section>
            <section>
               <title>Starting the XSLT process</title>
               <para>Running an XSLT application can be done in several ways. As noted above, at the
                  heart of the process is the XSLT processor. The goal is to find the means to feed
                  the primary input and the master stylesheet into the processor, and to tell the
                  processor where to place the output.</para>
               <para><emphasis role="bold">From the command line</emphasis>. Processors such as
                  Saxon allow you to initiate the process from the command line.<itemizedlist>
                     <listitem>
                        <para><emphasis>Windows</emphasis>:<orderedlist>
                              <listitem>
                                 <para>Press the Windows key;</para>
                              </listitem>
                              <listitem>
                                 <para>Type "cmd" and click "Command Prompt";</para>
                              </listitem>
                              <listitem>
                                 <para>Type the letter of the drive where you plan to run the
                                    process, followed by a colon, e.g., <code>e:</code></para>
                              </listitem>
                              <listitem>
                                 <para>Using the command cd navigate to the directory where your
                                    files are, e.g., <code>cd myfiles</code>.</para>
                              </listitem>
                           </orderedlist></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>Macintosh</emphasis>:<orderedlist>
                              <listitem>
                                 <para>Open the Shell app;</para>
                              </listitem>
                              <listitem>
                                 <para>Using the command cd navigate to the directory where your
                                    files are, e.g., <code>cd E:/myfiles</code>.</para>
                              </listitem>
                           </orderedlist>From there, follow the instructions provided by the vendor
                           of the XSLT processor. Saxon provides instructions for its product at
                              <link
                              xlink:href="https://www.saxonica.com/documentation10/index.html#!using-xsl/commandline"
                           />. A simple command-line instruction might look like the
                           following:<programlisting>java -cp "E:/xslt processors/saxon-he-10.0.jar" -s:init.xml -xsl:app.xsl
   -o:primary-output.xml</programlisting></para>
                     </listitem>
                  </itemizedlist></para>
               <para><emphasis role="bold">From Oxygen XML Editor</emphasis>. Oxygen provides
                  numerous ways to initiate the XSLT process, including the following:<itemizedlist>
                     <listitem>
                        <para><emphasis>XSLT Debugger Perspective</emphasis>. This editing mode
                           changes the appearance of Oxygen, putting eligible primary input files on
                           the left, XSLT files in the middle, and an output pane on the right. You
                           can choose the processor you prefer, and pick your primary input and
                           master stylesheet. Running the application provides interactive output,
                           with many diagnostic tools, letting you learn how the output came
                           about.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>Transformation Scenarios</emphasis>. You can choose
                           configure transformation scenarios, and create a highly customized set of
                           conditions for running an XSLT application.</para>
                     </listitem>
                  </itemizedlist>These methods, and other more sophisticated approaches, are
                  described by the vendor in their documentation, <link
                     xlink:href="https://www.oxygenxml.com/"/>.</para>
            </section>
            <section>
               <title>TAN utilities and applications</title>
               <para>All TAN utilities and applications share the same basic architecture. Once you
                  have figured out how to use one TAN application, you are well on your way to being
                  able to use the others as well. Each TAN utility and application has its own
                  purpose, which means that its expected input and output will differ quite a bit
                  from the others. Nevertheless, all TAN utilities and applications share a common
                  set of features, to assist users.</para>
               <section>
                  <title>Application/utility setup</title>
                  <para>All TAN utilities are in the <code>utilities</code> directory of the TAN
                     files; the applications are in the <code>applications</code> directory. Within
                     those directories, there is one subdirectory per utility or application. And
                     within that subdirectory, there are only two XSLT file, accompanied perhaps by
                     further subdirectories. One of the XSLT files has "configuration" in the name,
                     and it allows you to customize a particular application or utility for your
                     projects. The other XSLT file is the master stylesheet for the utility or
                     application in question, and it has the same name as its parent directory.
                     Subdirectories contain the heart of the code, and other important
                     dependencies.</para>
                  <para>The file structure is designed to make quite clear the main point of entry.
                     Having a directory with so few files should hopefully inspire you to fill it up
                     with copies designed for specific situations.</para>
               </section>
               <section>
                  <title>The master stylesheet</title>
                  <para>All master stylesheets for TAN utilities and applications share a common
                     structure. They are designed to be as user-friendly as possible, and to focus
                     exclusively on configuration settings that the user may want to change.<orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Preamble</emphasis>. Every master stylesheet
                              begins with a long series of comments, indicating the name of the
                              application, its version (an ISO date), its name, and a brief
                              description of what it does. The preamble includes a statement of the
                              intended primary input, secondary input, primary output, and secondary
                              output. Cautionary notes may be included. If the utility or
                              application has areas that are known to need development, these will
                              be listed.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Global parameters</emphasis>. After the
                              preamble a series of global parameters are presented. Each one is
                              preceded by a comment that explains the expected value. The parameters
                              may be organized in blocks according to stages or topics. Some of the
                              parameters may be localized versions of global parameters that are
                              defined in standard TAN parameters declared by files in the main
                              directory <code>parameters</code>. The values in the master stylesheet
                              of the application will take precedence over the default
                              values.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Import statement</emphasis>. At the end of
                              the master stylesheet is an <code>&lt;xsl:import></code> statement,
                              pointing to the core stylesheet. That instruction may be followed as
                              well by other comments and declarations that users should not
                              change.</para>
                        </listitem>
                     </orderedlist></para>
               </section>
               <section>
                  <title>The core stylesheet</title>
                  <para>Every master stylesheet points via its import statement to a single XSLT
                     file in the <code>incl</code> subdirectory. That XSLT file is the core
                     stylesheet. As an everyday user of the application, you will find this core
                     stylesheet to be of little or no importance. But anyone doing any kind of
                     customization or development should be aware of how it works, and this
                     description is aimed at those developers.</para>
                  <para>Each core stylesheet follows a common structure. It begins with
                        <code>&lt;xsl:include></code> instructions that point to the TAN function
                     library, and perhaps other important components.</para>
                  <para>Next come metadata about the application: its name, its IRI, a change
                     message to be reported, and a variety of descriptions about the application,
                     and its expected input and output. A change log and a list of features to work
                     on may be included. The dates within those parameters dictate the version of
                     the application. All this metadata is used in several ways: to populate the
                     comments of the master stylesheet, to populate the contents of these
                     guidelines, and perhaps to supplement the output. The master data is here in
                     the stylesheet. The development branch of the TAN project includes a
                        <code>maintenance</code> directory. Within it is a Schematron file that
                     makes sure that the master and core stylesheets of any given utility or
                     application are synchronized. </para>
                  <para>After the metadata come the XSLT declarations that drive the process. The
                     output for most TAN utilities and applications require multiple ordered stages.
                     A given stage might have a strong declarative element, but the stages
                     themselves are set carefully in a sequence, signposted by global variables that
                     incrementally build the primary or secondary output.</para>
                  <para>At the end of the core stylesheet are two unnamed templates. Each one points
                     to the document node of the primary input XML file, and so one of the two will
                     always be the initial, starting template. The first of these templates is for
                     diagnostics and is controlled by a static parameter that allows a developer to
                     turn it on or off. It normally reports back the values of the global variables,
                     set in process order. If that first template is turned off, then the second one
                     takes over, and it drives the messaging system, the primary output tree (bound
                     to some global variable), and initiates any processes necessary for
                        <code>&lt;xsl:result-document></code> instructions required to generate
                     secondary output.</para>
                  <para>Any primary or secondary output that results in a TAN file
                        <emphasis>must</emphasis> be credited to or blamed upon the application or
                     utility. The metadata for the application will be added to the output TAN
                     file's vocabulary, and an appropriate entry will be added to the change
                     log.</para>
               </section>
            </section>
         </section>
         <xi:include href="inclusions/utilities.xml"/>
         <xi:include href="inclusions/applications.xml"/>
      </chapter>
      <chapter xml:id="tan-for-developers">
         <title>Developing with TAN</title>
         
         
            
            <para>This chapter addresses anyone who wants to develop their own applications using
            TAN. Some may want to experiment, revise, or extend the code that already exists. Others
            may be developing their own XQuery or XSLT application, and intend to use select TAN
            functions. Yet others may want to customize the standard TAN applications or utilities,
            perhaps as part of a pipeline or workflow, or for populating a website.</para>
         <para>TAN is very developer-friendly. The function library is one of the richest, largest
            of its kind. If you are accustomed to doing natural language processing through the
               <link xlink:href="https://www.nltk.org/">Natural Language Toolkit</link>, <link
               xlink:href="http://cltk.org/">Classical Language Toolkit</link>, or a comparable
            package, you may find that TAN has the building blocks you need to do the same
            activities within an XSLT or XQuery environment.</para>
         <section>
            <title>General design features</title>
            <para>All TAN digital assets are organized primarily by role. At the heart of TAN is its
               function library. This library is the foundation for the schemas that validate TAN
               files, as well as applications and utilities. All of those resources contribute to a
               large share of the content in these guidelines.</para>
            <figure>
               <title>TAN dependencies</title>
               <mediaobject>
                  <imageobject>
                     <imagedata fileref="img/TAN%20dependencies.jpeg"/>
                  </imageobject>
               </mediaobject>
            </figure>
            <para>The TAN function library is so named because it relies heavily upon functions.
               But, because it is written in XSLT, there are also global parameters, global
               variables, templates, keys, and other declarations. Certain design principles have
               been adopted when designing and organizing these declarations.</para>
            <para><emphasis role="bold">Validation mode</emphasis>. The TAN function library was
               designed first and foremost to drive the validation process. That process prioritizes
               dispensing with parts of the primary input file no longer needed for error-checking.
               As the TAN fuction library grew to supporting utilities and applications, a sharp
               distinction needed to be drawn between processing for validation and processing for
               other purposes. The static global parameter <code>$tan:validation-mode-on</code>
               exerts a significant influence upon many operations. Files in the
                  <code>functions</code> subdirectory whose names include the keyword
                  <code>extended</code> are excluded from the package when validation mode is on. By
               default validation mode is off, fetching everything in the TAN function
               library.</para>
            <para><emphasis role="bold">Named templates</emphasis>. In general, functions have been
               preferred over named templates. This allows TAN operations to be used in XPath
               expressions, and contributes to more concise code. Named templates have been used
               only when result documents need to be created, or when tunnel parameters need to be
               preserved.</para>
            <para><emphasis role="bold">Functions</emphasis>. All functions have their visibility
               declared public or private. You are welcome to use private functions, but keep in
               mind that they are generally specialized. Some functions have parallel cached and
               non-cached versions, to support environments where memoized functions are not
               allowed. Many functions have multiple versions based on the number of parameters
               (arity). Lower-arity functions contain comments that point to the highest-arity
               version, which is fully annotated by enclosed comments. We place them inside the
                  <code>&lt;xsl:function></code>, so that if a function needs to be copied or moved,
               the documentation always accompanies it. Documentation shares a common structure:
               first, the intended input; second, the intended output; third, other notes; finally:
                  <code>kw:</code> with a comma-delimited list of keywords categorizing the
               function. </para>
            <para><emphasis role="bold">Template modes</emphasis>. Every template mode has an
               associated <code>&lt;xsl:mode></code> declaration, which always defines the default
               behavior of the template. To reduce the chance of interference with XSLT applications
               that might include the library, there is only one template that defines behavior for
               all template modes (<code>mode="#all"</code>), at a very low priority, for elements
               that contain validation error messages. That means that you can use
                  <code>&lt;xsl:include></code> or <code>&lt;xsl:import></code> without worrying
               about conflicts with template modes in your host application. All mode names are set
               in the TAN namespace, to avoid conflicts with dependent resources.</para>
            <para><emphasis role="bold">Keys</emphasis>. For convenience, all keys are kept in files
               at <code>functions/setup</code>.</para>
            <para><emphasis role="bold">Character maps</emphasis>. For convenience, all character
               maps are kept in files at <code>functions/setup</code>.</para>
            <para><emphasis role="bold">Global parameters</emphasis>. Most global parameters are
               invitations to the user to configure the environment, and they are placed in the main
                  <code>parameters</code> directory. A few global parameters are reserved for
               technical processes, and they are kept in files at <code>functions/setup</code>. All
               global parameters are bound to the TAN namespace. The exception to this general rule
               of thumb are the global parameters unique to specific utilities and applications;
               they are placed in no namespace. Doing this has helped solidify the boundaries of the
               TAN function library.</para>
            <para><emphasis role="bold">Global variables</emphasis>. Development work revealed that
               global variables, even those that were not used, frequently slowed the validation
               process. Therefore global variables are kept to a minimum within the standard
               components, but are used more extensively in the extended components. Each global
               variable is bound to the TAN namespace. Those whose values rely upon the primary
               input file are constructed under the assumption that the primary input file is a TAN
               file.</para>
            <para>For more specific explanation of individual components see <xref
                  xlink:href="#variables-keys-functions-and-templates"/>.  </para>
         </section>
         <section xml:id="using-tan-functions">
            <title>Using TAN functions</title>
            <para>TAN's extensive function library, which drives the validation process, provides a
               foundation for application development in XSLT. If you are writing an XSLT
               application, simply point via <code>&lt;xsl:include></code> or
                  <code>&lt;xsl:import></code> to <code>functions/TAN-function-library.xsl</code>.
               That's it. You now have access to the complete TAN function library. If you are
               developing for XQuery, you can access any of the functions via
                  <code>fn:transform()</code>, taking care to set up that function's parameters
               correctly. See <link
                  xlink:href="https://www.w3.org/TR/xpath-functions-31/#func-transform"/>.</para>
            <para>Some relatively complex TAN functions may be affected by the settings in the
               subdirectory <code>parameters</code>. Otherwise, the functions have been designed to
               be as orthogonal as possible.</para>
            <para>There are so many TAN functions, you may not know where to begin. Discovering what
               is available will take some time and study. You could simply browse the XSLT files
               that constitute the function library. Or you can use the autocomplete feature in
               Oxygen's editing mode. Either method will provide a complete but perhaps chaotic
               experience. These guidelines provide a more accessible starting point. Begin with the
               grouped index: <xref xlink:href="#tan-function-keyword-index"/>. Find a topic or
               function you are interested in, and follow the links.</para>
         </section>
         <section xml:id="validation_mechanics">
            <title>The mechanics of validation</title>
            <para>In many cases, developers will want to work with TAN files, either as input or as
               output. But TAN files have a number of distinctive constructions: two different
               methods of inclusion (see <xref xlink:href="#inclusions-and-vocabularies"/>),
               space-normalization rules (see <xref xlink:href="#whitespace"/>), numeration systems
               (see <xref xlink:href="#reference_system"/>), tokenization systems (see <xref
                  xlink:href="#defining_tokens"/>), and pointing systems (see <xref
                  xlink:href="#pointer-syntax"/>). You can work directly with raw TAN files, but you
               run the risk of misinterpreting the file.</para>
            <para>Every TAN file is definitively interpreted through the TAN functions that
               undergird the Schematron validation process (see <xref
                  xlink:href="#validating_tan_files"/>). That process is a core part of the standard
               TAN utilities and applications, and it determines the nature of some of the more
               important global variables. </para>
            <para>Every TAN file is subject to two major transformations, both for validation and
               for applications.</para>
            <section xml:id="resolution">
               <title>Resolution</title>
               <para>The first transformation <emphasis>resolves</emphasis> the file. The goal is to
                  get the file into a state where it can be understood on its own terms. A resolved
                  TAN file contains all its relevant vocabulary and components. It can be evaluated
                  without having to consult the files referred to by <code><link
                        linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> dependencies. (See
                     <xref xlink:href="#inclusions-and-vocabularies"/> for background on TAN's
                  approach to inclusion.) This process also does some basic file-specific
                  normalization; it will: <orderedlist>
                     <listitem>
                        <para>Prepare the file. This includes stamping the root element with a base
                           URI (the path location of the file), evaluating <code><link
                                 linkend="element-alias">&lt;alias&gt;</link></code>, and inserting
                           into every element a <code>@q</code> that contains a identifier unique to
                           the element. This identifier is used by the Schematron file to match an
                           element with any error messages in the corresponding element in the XSLT
                           output.</para>
                     </listitem>
                     <listitem>
                        <para>Insert required components from <code><link
                                 linkend="element-vocabulary">&lt;vocabulary&gt;</link></code>s or
                                 <code><link linkend="element-inclusion"
                              >&lt;inclusion></link></code>s using the following method:<orderedlist
                              numeration="loweralpha">
                              <listitem>
                                 <para>Relevant external vocabulary items are inserted into the
                                          <code><link linkend="element-head"
                                       >&lt;head&gt;</link></code>, either as descendants of the
                                    appropriate <code><link linkend="element-vocabulary"
                                          >&lt;vocabulary&gt;</link></code> or if derived from TAN
                                    standard vocabulary as new <code>&lt;tan-vocabulary></code>
                                    elements immediately following the <code><link
                                          linkend="element-vocabulary-key"
                                          >&lt;vocabulary-key></link></code>. All vocabulary items
                                    are imprinted with an <code>&lt;id></code> corresponding to an
                                          <code><link linkend="attribute-xmlid"
                                       >@xml:id</link></code> from any corresponding entry from
                                          <code><link linkend="element-vocabulary-key"
                                          >&lt;vocabulary-key></link></code>, to facilitate rapid
                                    retrieval of vocabulary. Any vocabulary <code><link
                                          linkend="element-name">&lt;name&gt;</link></code> that is
                                    not normalized is duplicated with a name-normalized copy
                                    (signaled by <code>@norm</code>): lower-case, hyphens and
                                    underscores changed to spaces, and space-normalized.</para>
                              </listitem>
                              <listitem>
                                 <para>Any element with an <code><link linkend="attribute-include"
                                          >@include</link></code> is replaced by the elements of the
                                    same name found in the target inclusion document (constructed
                                    recursively if need be). In addition, <code><link
                                          linkend="element-inclusion">&lt;inclusion></link></code>
                                    (in the head) is populated with any vocabulary items required to
                                    resolve the newly included material  (recursively, if need be).
                                    This last point is important, because all idrefs must be
                                    interpreted in light of the original context. Included idrefs
                                    are made available to the host document, so when you use
                                          <code><link linkend="element-inclusion"
                                          >&lt;inclusion></link></code> you must ensure there are no
                                    id conflicts.</para>
                              </listitem>
                           </orderedlist></para>
                     </listitem>
                     <listitem>
                        <para>Normalize all numbers in original components (i.e., excluding included
                           elements or vocabulary items) as Arabic numerals.</para>
                     </listitem>
                  </orderedlist></para>
               <para>Files are resolved recursively. That is, no <code><link
                        linkend="element-vocabulary">&lt;vocabulary&gt;</link></code> or <code><link
                        linkend="element-inclusion">&lt;inclusion></link></code> components are
                  incorporated or processed until the files pointed to are themselves first
                  resolved.</para>
               <para>Numerals fall at the end of the process because they might need to be resolved
                  in light of resolved vocabulary and inclusions. </para>
               <para>The description above is necessarily generalized. For details consult the
                  function library, particularly the <code>functions/resolution</code> directory. In
                  cases of conflict between the code and the description above, the code should be
                  given priority.</para>
            </section>
            <section xml:id="expansion">
               <title>Expansion</title>
               <para>The second transformation <emphasis>expands</emphasis> the resolved file. You
                  must resolve a TAN file before you try to expand it. The goal behind expansion is
                  to unpack the components of a resolved document and identify any errors along the
                  way (see the <link xlink:href="../functions/errors/TAN-errors.xml">master list of
                     errors</link>). There are three levels of expansion, corresponding to the three
                  levels of Schematron validation: terse, normal, and verbose.</para>
               <para>In terse expansion, for each value of an attribute, an element with the
                  attribute's name is placed within the parent (e.g., <code><link
                        linkend="attribute-type">@type</link>="a b"</code> produces
                     <code>&lt;type>a&lt;/type></code> and <code>&lt;type>b&lt;/type></code>). If
                  the value is an IDref, and it points to an alias, a copy is made for the idref of
                  each target vocabulary item. If an idref does not point to a vocabulary item of
                  the expected type, an error message is also copied in the parent. Any values that
                  are ranges are expanded, if need be. Select networked files are checked for basic
                  validity. Class-2 files undergo a extra rounds of processing during terse
                  validation: sources are adjusted if need be, and then checked against references
                  in the host class-2 file. (See <xref xlink:href="#pointer-syntax"/>.) In terse
                  expansion, all pointing mechanisms are checked. Because of this basic requirement,
                  some terse expansion can take a long time on lengthy files, or ones with complex
                        <code><link linkend="element-adjustments"
                  >&lt;adjustments&gt;</link></code>.</para>
               <para>Normal expansion builds on terse expansion by interrogating networked files
                  more closely. Any errors that were reported during the terse stage but were
                  suppressed to avoid clutter are enabled. </para>
               <para>Verbose expansion generally attends to procedures that are complex, or are not
                  essential parts of a validation report. For example, a <code><link
                        linkend="element-model">&lt;model&gt;</link></code> of a class-1 file will
                  be checked, to find references that one has but is lacking in the other. A class-1
                        <code><link linkend="element-redivision">&lt;redivision&gt;</link></code>
                  will be analyzed, to make sure that the two transcriptions are identical. A
                  catalog file in the same directory will be checked, to see if it has faulty
                  entries.</para>
               <para>Many errors lend themselves to solutions that can be recommended by the TAN
                  function library. Some solutions are returned to the Schematron validation method
                  as Schematron Quick Fixes (SQFs). XML editors that are equipped to handle SQFs
                  (e.g., Oxygen XML Editor) can then prompt users to quickly fix an errant section.
                  For example, if text has not been NFC Unicode-normalized, an SQF will allow a user
                  to make the change in two clicks. Thus, TAN validation does not merely tell you
                  what the problems are; it tries to help fix them.</para>
               <para>The term "expansion" describes the process but possibly not the output. If the
                  global parameter <code>$tan:validation-mode-on</code> is true, then in the course
                  of expanding the file the TAN templates will abandon any parts that are no longer
                  needed. The output is normally much smaller than the input file, restricted as it
                  is to the root element, which merely wraps errors, warnings, or fixes. So although
                  during validation the file is really being expanded, at the end only a small
                  portion of the expanded file is returned to the Schematron processor, to expedite
                  validation. But if <code>$tan:validation-mode-on</code> is false (the default
                  value), the entire expanded file and its dependencies are returned. Such output
                  can be very useful in applications.</para>
               <para>The preceding description about expansion is necessarily generalized. For
                  details consult the function library, especially <code>functions/expansion</code>.
               </para>
            </section>
         </section>
         <section xml:id="using-tan-global-variables">
            <title>Using TAN global variables</title>
            <para>The global variables in the TAN function library provide quick access to some
               important material. For a complete list of global variables, with detailed lists of
               dependencies and dependents, see <xref
                  xlink:href="#variables-keys-functions-and-templates"/>. That technical appendix
               will not provide the context necessary to identify some of the key features of the
               global variables, which this section attempts to provide.</para>
            <para>As noted above, the primary task of the TAN function library is to drive the
               validation process, described in the previous section. If you are developing an
               application that begins with a TAN file, whether as primary or secondary input, it is
               often best to start with it in its resolved or expanded state. If that TAN file is
               the primary (catalyzing) input, use the global variables <code><link
                     linkend="variable-self-resolved">$tan:self-resolved</link></code> and
                     <code><link linkend="variable-self-expanded">$tan:self-expanded</link></code>.
               If it is secondary input, use <code><link linkend="function-resolve-doc"
                     >tan:resolve-doc</link>()</code> and <code><link linkend="function-expand-doc"
                     >tan:expand-doc</link>()</code>. You must resolve a TAN file before you attempt
               to expand it.</para>
            <para>For a class-2 file, <code><link linkend="variable-self-expanded"
                     >$tan:self-expanded</link></code>, or the output of <code><link
                     linkend="function-expand-doc">tan:expand-doc</link>()</code>, is a sequence of
               documents, starting with an expansion of the class-2 file, followed by expansions of
               its dependencies (TAN-T or TAN-mor). Its expanded class-1 sources will be tokenized
               where required, and marked with anchors for each reference in the class-2 file. If a
               token straddles leaf <code><link linkend="element-div">&lt;div&gt;</link></code>s,
               the token will be reconstituted by moving the tail of the token up. These expanded
               sources are excellent candidates for other types of transformation. For example, HTML
               pages can be created to integrate class-2 annotations and their class-1 sources, in a
               variety of ways.</para>
            <para>Even when the validation mode is turned off (default), the validation phase
               (terse, normal, verbose) plays a significant role in the results of expansion. At the
               terse and normal phases, an expanded class-2 file will contain expanded versions of
               both the host file and its sources. </para>
            <para>At the verbose level, an expanded TAN-A file will conclude its <code><link
                     linkend="variable-self-expanded">$tan:self-expanded</link></code> sequence with
               one or more documents with a root element <code>&lt;TAN-T_merge></code>, one file per
               detected work. A TAN-T_merge file has one <code><link linkend="element-head"
                     >&lt;head&gt;</link></code> per class-1 source that has been merged, and the
                     <code><link linkend="element-body">&lt;body></link></code> contains a master
               set of <code><link linkend="element-div">&lt;div&gt;</link></code>s that merge all
               the other sources' <code><link linkend="element-div">&lt;div&gt;</link></code>s that
               share the same reference, after all <code><link linkend="element-adjustments"
                     >&lt;adjustments&gt;</link></code> have been made. Each leaf <code><link
                     linkend="element-div">&lt;div&gt;</link></code> in each source appears in the
               appropriate place, but as a child of a common <code><link linkend="element-div"
                     >&lt;div&gt;</link></code> that encompasses all other leaf <code><link
                     linkend="element-div">&lt;div&gt;</link></code>s with the same reference. For
               each version's leaf div, <code><link linkend="attribute-type">@type</link></code> is
               changed to <code>#version</code>, and other markers signify which source it
               corresponds to. A TAN-T_merge file is a good basis building parallel displays (e.g.,
                  <xref xlink:href="#Parabola"/>) or statistical analyses. These merge files can be
               created on an ad hoc basis through the function <code><link
                     linkend="function-merge-expanded-docs">tan:merge-expanded-docs</link>()</code>,
               applied to individual class-1 files, after expansion.</para>
            <para>If your application uses a TAN file as the primary input, you may want to take
               advantage of some other important global (see <xref
                  linkend="inclusions-and-vocabularies"/>):</para>
            <para>
               <table frame="all">
                  <title>Global variables for networked files</title>
                  <tgroup cols="4">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                     <colspec colname="newCol4" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry/>
                           <entry>Raw (first document available)</entry>
                           <entry>Resolved</entry>
                           <entry>Expanded</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-inclusion"
                                 >&lt;inclusion></link></code></entry>
                           <entry>—</entry>
                           <entry><code><link linkend="variable-inclusions-resolved"
                                    >$inclusions-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-vocabulary"
                                 >&lt;vocabulary></link></code></entry>
                           <entry>—</entry>
                           <entry><code><link linkend="variable-vocabularies-resolved"
                                    >$vocabularies-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-source"
                              >&lt;source></link></code></entry>
                           <entry>—</entry>
                           <entry><code><link linkend="variable-sources-resolved"
                                    >$sources-resolved</link></code></entry>
                           <entry><code><link linkend="variable-self-expanded"
                                 >$self-expanded</link>[tan:TAN-T]</code></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-see-also"
                              >&lt;see-also></link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-1st-da"
                                    >$see-alsos-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-resolved"
                                    >$see-alsos-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>The column labeled "raw" lists variables that hold the first documents available,
               without alteration. Variables in the next column hold the resolved form, following
               the same process described above for <code><link linkend="variable-self-resolved"
                     >$tan:self-resolved</link></code>. The resolved forms of <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code> and <code><link
                     linkend="element-vocabulary">&lt;vocabulary></link></code> are sufficient for
               validation, therefore they do not have expanded versions. Expanded sources are always
               bundled with their class-2's <code><link linkend="variable-self-expanded"
                     >$tan:self-expanded</link></code>.</para>
            <para>For most applications, a resolved file is a sufficient starting point. But even
               then, there will be places where you will want to fetch the vocabulary bound to a
               particular attribute or element. One of the more important functions to familiarize
               yourself with is <code><link linkend="function-vocabulary"
                  >tan:vocabulary()</link></code>, which can be used to get the IRI + name pattern
               of a specific node, or to get all the vocabulary available for a given type.</para>
            <para>Some developers will find even <code><link linkend="function-vocabulary"
                     >tan:vocabulary()</link></code> a hassle to use. Consider setting the global
               parameter <code>$tan:distribute-vocabulary</code> (default <code>false</code>) to
                  <code>true</code>. If that happens, whenever an attribute with an idref appears,
               it it will be imprinted with the corresponding IRI + name pattern for the referred
               vocabulary item. Exercise this option with care: the expanded document will grow
               significantly larger.</para>
         </section>
         
      </chapter>
   </part>
   <part xml:id="appendixes">
      <title>Appendixes</title>
      <xi:include href="inclusions/vocabularies.xml"/>
      <xi:include href="inclusions/elements-attributes-and-patterns.xml"/>
      <xi:include href="inclusions/variables-keys-functions-and-templates.xml"/>
      <xi:include href="inclusions/errors.xml"/>
   </part>
</book>
