<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://docbook.org/xml/5.0/rng/docbook.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<book xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns:xi="http://www.w3.org/2001/XInclude" version="5.0">
   <info>
      <title>The Text Alignment Network: Official Guidelines</title>
      <legalnotice>
         <info>
            <title>Text Alignment Network: Official Guidelines</title>
            <copyright>
               <year>2015-present</year>
               <holder>Joel Kalvesmaki</holder>
            </copyright>
            <author>
               <personname>Joel Kalvesmaki</personname>
               <email>kalvesmaki@gmail.com</email>
            </author>
         </info>
         <remark> This document and the files it describes are licensed under a Creative Commons
            Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/
         </remark>
      </legalnotice>
      <revhistory>
         <info>
            <releaseinfo>Latest version:
               http://textalign.net/release/TAN-1-dev/guidelines/.</releaseinfo>
         </info>
         <revision>
            <revnumber>1 dev</revnumber>
            <date>2017-05-24</date>
            <revdescription>
               <para>Working draft. Please send corrections to the author (see above).</para>
               <para>Formats: <link xlink:href="http://textalign.net/release/TAN-1-dev/guidelines/xhtml/index.xhtml">HTML</link> • <link
                  xlink:href="http://textalign.net/release/TAN-1-dev/guidelines/pdf/tan-1-dev-guidelines.pdf">PDF</link> • <link xlink:href="http://textalign.net/release/TAN-1-dev/guidelines/main.xml">Docbook</link>
                  (master)</para>
               <warning>
                  <para>In case of contradictions, apparent or not, between these guidelines and the
                     core TAN files, priority should be given to the RELAX-NG schemas (compact
                     syntax), then to the functions, and then to these guidelines.</para>
                  <para>Chapters 1-7 and 10 are written by hand, and are relatively accurate.
                     Chapters 8, 9, 11, and 12 are written by an algorithm that selectively
                     reformats normative TAN files. Errors or inconsistencies found in those
                     chapters will be due to the XSLT stylesheets that produce them or to the files
                     upon which they are based.</para>
               </warning>
            </revdescription>
         </revision>
      </revhistory>
   </info>
   <part xml:id="general_overview">
      <title>General Overview</title>
      <chapter>
         <title>Introduction</title>
         <section xml:id="tan_definition">
            <title>Definition and purpose </title>
            <para>The Text Alignment Network (TAN) is a suite of highly regulated XML formats
               intended for scholars to align and share texts and textual analysis at a maximal
               level of syntactic and semantic interoperability. TAN is particularly suited to
               textual works with multiple versions (translations, paraphrases), and to related
               datasets on quotations, word-for-word alignments, and lexicomorphological
               features.</para>
            <para>TAN files are simple, modular, and networked, allowing users, working
               independently and collaboratively, to edit, study, and annotate shared files. The
               extensive validation rules depend upon a library of processing functions that
               definitively interpret the format, thereby informing and helping editors in research
               and publication, and providing a basis for developing tools and
               applications.</para>
            <para>Although expressive of scholarly nuance and complexity, the TAN format has been
               designed to benefit everyone, scholars and non-scholars alike, and can be used
               broadly for multilingual publishing, language learning, and machine translation. </para>
         </section>
         <section>
            <title>Rationale and Purpose</title>
            <para>Different versions of texts—translations, quotations, paraphrases, and so
               forth—are important sources for scholars. Some texts have been lost in their original
               form and can be studied only through later translations, paraphrases, or fragmentary
               quotations. Even when an original survives, its later versions are often worth study,
               revealing as they do something of the genius or idiosyncrasies of those who
               translated or quoted the original, which in turn sheds light on how words, concepts,
               and works were preserved, altered, or combined across the generations and cultures
               who read and circulated the versions.</para>
            <para>The comparison of versions of texts requires words, sentences, paragraphs, and
               other text segments to be aligned. Such alignment can be challenging. Some versions
               might be defective, or follow an idiosyncratic sequence. One editor may have chosen a
               segmentation system not easily applied to other versions. Identifying which words or
               phrases in a translation correspond to which words or phrases in the original might
               result in complex, overlapping spans. And even larger segments such as sentences and
               paragraphs may not line up well. Further, every version of a text is part of a much
               larger, complex history of text reuse, and a proper study of that context requires
               not only multiple versions of different works, but collaboration across projects and
               fields of study.</para>
            <para>The Text Alignment Network (TAN) XML format facilitates the exchange and scholarly
               analysis of multiple versions of texts. TAN files adopt a syntax suitable for humans
               to read and edit, expressive enough to allow scholars to register doubt and nuance,
               and sufficiently structured to permit complex computer-based queries across
               independent datasets. The format is modular, with each module designed to allow an
               editor to focus on a single set of tasks without having to worry about other related
               but separable ones. The format encourages or requires editors to declare their views
               or assumptions about language and texts in a structured manner, so that other users
               of the data (both human and computer) can determine whether the data is suitable for
               their needs. Because nearly all TAN data must be expressed in way that computers can
               parse, the information can be used in semantic web applications.</para>
            <para>TAN has been designed to support specific research desiderata such as the
               following:</para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>I want to share the transcription of a particular version of a textual
                        work. How do I encode it such that it is most likely to align with any other
                        version of that text created by someone else?</para>
                  </listitem>
                  <listitem>
                     <para>I have an index of quotations I wish to make available. How do I encode
                        it such that the data is semantically rich and can be applied to other,
                        perhaps unknown versions of the same work?</para>
                  </listitem>
                  <listitem>
                     <para>How do I align multiple versions of a single work when those versions may
                        not match very well, or when the reason for alignment may be vague or
                        ambiguous?</para>
                  </listitem>
                  <listitem>
                     <para>How do I publish a word-for-word analysis of a source and its
                        translation, when there may be messy overlapping or ambiguous relationships,
                        and where I might need to express doubt or alternative possibilities of
                        alignment?</para>
                  </listitem>
                  <listitem>
                     <para>How do I publish a dataset that lists passages in two or more works that
                        share a common feature, such as verbatim text or a parallel topic?</para>
                  </listitem>
                  <listitem>
                     <para>How can I share my data with others, and notify or warn them when I make
                        corrections or changes to the master version?</para>
                  </listitem>
               </itemizedlist>
            </para>
            <para>The last question is especially significant. As TAN files are published, there
               emerges a web of primary sources—a decentralized corpus of texts that "talk" to each
               other. As this TAN-compliant corpus expands across linguistic, chronological, and
               spatial boundaries, the interoperability of its parts allows the development of
               third-party tools and applications to expand the repertoire of research questions
               beyond any single corpus, to help scholars fruitfully investigate broader,
               comparative questions such as:<itemizedlist>
                  <listitem>
                     <para>For classical Greek texts, how were words with the root -ιστημι ("stand")
                        translated into ancient Latin? In what specific ways did the vocabulary of
                        technical terms shift from pre-Christian translations into later, Christian
                        ones?</para>
                  </listitem>
                  <listitem>
                     <para>How do the reformed Chinese translation technique of Sanskrit Buddhist
                        texts, attested by Dao An (312-385 CE), compare to reforms in the seventh
                        and eighth centuries of Syriac translations of Greek texts?</para>
                  </listitem>
                  <listitem>
                     <para>How do Arabic translations of Greek texts from the Abbasid period differ
                        from those of Sanskrit?</para>
                  </listitem>
                  <listitem>
                     <para>Can an anonymous English translation of a modern French novel be
                        identified with known translators of French novels from the same
                        period?</para>
                  </listitem>
                  <listitem>
                     <para>How do present-day translations of official United Nations documents
                        differ across languages?</para>
                  </listitem>
               </itemizedlist></para>
            <para>Optimism that TAN could be used to address such research questions should be
               tempered:<itemizedlist>
                  <listitem>
                     <para>Although TAN comes with an extensive library of functions and templates,
                        it is not a tool per se. It does not provide software or applications to
                        create, edit, or display TAN-compliant files, nor does it dictate the
                        behavior of such tools.</para>
                  </listitem>
                  <listitem>
                     <para>TAN does not on its own create alignments or answer research questions.
                        It merely lays a framework within which such questions can be investigated.
                     </para>
                  </listitem>
                  <listitem>
                     <para>TAN has a restricted field of inquiry (defined and explained in these
                        guidelines). The format is not suitable for many lines of iniquiry, e.g.,
                        reconstructing the format of an original book or article.</para>
                  </listitem>
                  <listitem>
                     <para>TAN is just one of many formats for texts. It supplements, and does not
                        replace, other common markup formats such as TEI, Docbook, and so forth, or
                        other alignment formats such as XLIFF or TMX. Conversion to and from TAN to
                        these formats is usually straightforward, but may not be lossless, and
                        should be given some thoughtful planning.</para>
                  </listitem>
                  <listitem>
                     <para>TAN has not been designed to prioritize computational efficiency. It
                        sacrifices repetition and explicitness in favor of terseness and human
                        readability. The extensive TAN validation routines—essential to aiding
                        interoperability—can be taxing to run on numerous or enormous files. This
                        choice has been made upon the principle that users of the format prioritize
                        quality and readibility over speed.</para>
                  </listitem>
               </itemizedlist></para>
         </section>
         <section xml:id="design_principles">
            <title>Design Principles</title>
            <para>To facilitate the research questions mentioned above, the TAN encoding formats and
               this manual have been designed around a few core principles.</para>
            <para><emphasis><emphasis role="bold">Scholarly freedom: </emphasis>Scholars should be
                  able to create data within their sphere of inquiry simply, expressively,
                  independently, and with fidelity to their guiding lights.</emphasis></para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>Given two ways of expressing the same idea, simplicity is better than
                        complexity, expressiveness than silence. Simplicity and expressiveness
                        should be treated as complementary ideals. In cases where one must be chosen
                        over the other, simplicity is to be preferred. </para>
                  </listitem>
                  <listitem>
                     <para>Editors should be able to register doubt about claims. If in doubt about
                        an assertion, an editor should be able to state alternatives.</para>
                  </listitem>
                  <listitem>
                     <para>Editors should be able to work on the same material indepedently but
                        interoperably.</para>
                  </listitem>
                  <listitem>
                     <para>Editors should work freely within their theories, opinions, and
                        assumptions about language. They should declare those positions, not
                        suppress or alter them. </para>
                  </listitem>
               </itemizedlist>
            </para>
            <para><emphasis><emphasis role="bold">Scholarly responsibility: </emphasis>Scholars must
                  make their data uniquely citable, and responsibly describe how that data was
                  created.</emphasis></para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>Each TAN file should have an expressive, unique, persistent name that can
                        be cited and used independent of the file's location or availability.</para>
                  </listitem>
                  <listitem>
                     <para>Editors must supply, at the very minimum, the core statements of
                        responsibility that are normally expected in any scholarly work:</para>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para>What was done by whom, when.</para>
                           </listitem>
                           <listitem>
                              <para>What sources have been used.</para>
                           </listitem>
                           <listitem>
                              <para>Who holds rights over the data, and what reuse is
                                 permitted.</para>
                           </listitem>
                           <listitem>
                              <para>What editorial assumptions and decisions were made in creating
                                 the data.</para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </listitem>
               </itemizedlist>
            </para>
            <para><emphasis><emphasis role="bold">Utility to both computers and humans:
                  </emphasis>Data should be easy for both humans and computers to read and write;
                  the latter should be able to import, process, and create the data reliably,
                  consistently, and interoperably.</emphasis>
            </para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para>The format should depend upon stable technologies or standards.</para>
                  </listitem>
                  <listitem>
                     <para>All classes and types of formats in the TAN suite should be structured
                        consistently and predictably.</para>
                  </listitem>
                  <listitem>
                     <para>As many as possible computable inconsistencies or errors should be
                        flagged by validation rules.</para>
                  </listitem>
                  <listitem>
                     <para>Every datum should be expressed in both a form that is as human readable
                        as possible and a form that is computer-readable, to make the material
                        suitable for linked data (semantic web) or for processing via an
                        algorithm.</para>
                  </listitem>
                  <listitem>
                     <para>In a given file, data should not be redundant, irrelevant to the
                        immediate points of inquiry, or more reliably and authoritatively found
                        elsewhere.</para>
                  </listitem>
                  <listitem>
                     <para>References to textual units or linguistic concepts should be expressed
                        .</para>
                  </listitem>
                  <listitem>
                     <para>Each TAN file, or collection of files, should be integrally complete and
                        fully useful, independent of any other software such as text processors or
                        version control software. </para>
                  </listitem>
               </itemizedlist>
            </para>
         </section>
         <section xml:id="tan_participation">
            <title>Participation</title>
            <para>Participants in testing, using, and developing the Text Alignment Network are
               welcome. Our core purpose is to develop and maintain the schemas, the guidelines,
               and the functions and templates. Inquiries about participation
               should be sent to the project manager, <link xlink:href="http://kalvesmaki.com/">Joel
                  Kalvesmaki</link>, by email: kalvesmaki at gmail.com.</para>
            <para>Official announcements are made by <link
               xlink:href="http://groups.google.com/group/textalign?hl=en">email (Google Group)</link> and
               by 
               <link xlink:href="https://twitter.com/textalign">Twitter</link>.</para>
         </section>
      </chapter>
      <chapter xml:id="gentle_guide">
         <title>Starting off with the TAN Format</title>
         <para>If you are new to markup languages, or if you are unfamiliar with acronyms such as
               <emphasis role="italic">XML</emphasis>, <emphasis role="italic">RDF</emphasis>,
               <emphasis role="italic">XPath</emphasis>, or technical terms such as
               <emphasis>Unicode</emphasis>, you should start with this chapter, which uses a simple
            example to illustrate the steps typically taken to create and edit TAN files. By the end
            of this chapter, you will be able to create and edit a simple collection of TAN
            transcriptions and alignments. If you are familiar with basic markup concepts, you may
            wish to read through the chapter very quickly, or skip it altogether.</para>
         <para>The discussion touches on a number of general concepts, some of which may be new.
            These concepts will be introduced only briefly. Further reading elsewhere will give you
            better grounding in a particular topic or technology. </para>
         <section>
            <title>Creating TAN Transcription and Alignment Data</title>
            <para>Let us take a simple example, that of aligning two English versions of the nursery
               rhyme <emphasis role="italic">Ring-a-ring-a-roses</emphasis>, sometimes known as
                  <emphasis role="italic">Ring around the Rosie</emphasis>. Our goal here is to
               publish two versions of the nursery rhyme in the TAN format so that they are most
               likely alignable with any other TAN version of the poem that someone might
               create.</para>
            <para>We begin by finding previously published versions. In this case we have taken an
               interest in the versions published in <link xlink:href="http://lccn.loc.gov/12032709"
                  >1881</link> and <link xlink:href="http://lccn.loc.gov/87042504">1987</link> (one
               published in the UK and the other, the US). Each of these books have other rhymes,
               but we've already decided to focus upon the one particular nursery rhyme, so we
               transcribe those parts and nothing else:<table frame="all">
                  <title>Ring around the Rosie</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <thead>
                        <row>
                           <entry>1881 (UK) version</entry>
                           <entry>1987 (US) version</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry>
                              <para>Ring-a-ring-a-roses,</para>
                              <para>A pocket full of posies;</para>
                              <para>Hush! Hush! Hush! Hush!</para>
                              <para>We're all tumbled down.</para>
                           </entry>
                           <entry>
                              <para>Ring-a-round the rosie,</para>
                              <para>A pocket full of posies,</para>
                              <para>Ashes! Ashes!</para>
                              <para>We all fall down.</para>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table></para>
            <para>We must be sure to save each of the two transcriptions as plain Unicode text,
               preferably with <code>.xml</code> at the end of each file name. Do not bother with
               word processor (Word, OpenOffice, Google Docs, and so forth), because those programs
               are too sophisticated for our work. They sometimes generate erroneous data, even when
               you export to plain text. We will be working with raw text, and will not be concerned
               with italics, colors, fonts, margins, and so forth. Much better for our work is a
                  <link xlink:href="http://en.wikipedia.org/wiki/Text_editor">text editor</link>,
               which handles nothing but plain text. But even those are inadequate, because they do
               not check to see if the rules of the format have been followed. So the best tool is
               an <link xlink:href="http://en.wikipedia.org/wiki/XML_editor">XML editor</link>,
               which does the same thing a text editor does, but with shortcuts that save much
               typing and prevents syntax errors. More important, an XML editor will tell us when
               our TAN file is invalid, and will provide information and help in our TAN files.<note>
                  <para>Software suitable for your needs comes in many styles and prices. In
                     addition to the links in the paragraph above, you may wish to visit the
                     comparative lists for both <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_text_editors">text
                        editors</link> and <link
                        xlink:href="http://en.wikipedia.org/wiki/Comparison_of_XML_editors">XML
                        editors</link>. TAN was developed using <link
                        xlink:href="https://www.oxygenxml.com">oXygen</link>, which is so powerful
                     it may be very confusing to use at first. To avoid exasperation or despair,
                     take advantage of tutorials and documentation associated with the XML editor
                     you have chosen. </para>
               </note></para>
            <para>Our first task is to get these two versions into separate files with the
               appropriate markup. Each TAN transcription file has two major parts: a head and a
               body. For now, we focus on only the second part, the body, as well as a few the
               necessary preliminary lines that stand above both the head and the body. First, the
               1881 (UK) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring01">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body xml:lang="eng" in-progress="false">
        &lt;div type="line" n="1"></emphasis>Ring-a-ring-a-roses,<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="2"></emphasis>A pocket full of posies;<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="3"></emphasis>Hush! Hush! Hush! Hush!<emphasis role="bold">&lt;/div>
        &lt;div type="line" n="4"></emphasis>We're all tumbled down.<emphasis role="bold">&lt;/div>
    &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
               And now the 1987 (US) version:
               <programlisting><emphasis role="bold">&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring02">
   &lt;head>
   . . . . . . .
   &lt;/head>
   &lt;body xml:lang="eng" in-progress="false">
      &lt;div type="l" n="1"></emphasis>Ring-a-round the rosie,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="2"></emphasis>A pocket full of posies,<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="3"></emphasis>Ashes! Ashes!<emphasis role="bold">&lt;/div>
      &lt;div type="l" n="4"></emphasis>We all fall down.<emphasis role="bold">&lt;/div>
   &lt;/body>
&lt;/TAN-T></emphasis></programlisting>
            </para>
            <para>These are standard eXtensible Markup Language (XML) files. (If you are already
               familiar with XML you may wish to skip ahead to the next section.) XML is rather
               simple. It provides a way to take a text or a collection of data and give it some
               structure through markup. In the examples above, the markup is in boldface.</para>
            <para>Each file begins with a prolog, marked by the lines that begin with
                  <code>&lt;?</code>. The first line in the prolog simply states that what follows
               is an XML document. The next two lines point to the files that will be used to check
               to see whether or not our data is valid. For now we will skip the specific details of
               those first three lines, which will be identical, or nearly so, from one TAN file to
               the next. We can simply cut and paste those lines when we want to start a new
               one.</para>
            <para>The fourth line is the opening tag of what is called the root element, here called
                     <code><link linkend="element-TAN-T">&lt;TAN-T></link></code>. That opening tag,
                  <code>&lt;TAN-T...></code> is answered by a closing tag, <code>&lt;/TAN-T></code>,
               the last line. The paired-tag relationship is true for all the other elements in this
               example. <code><link linkend="element-head">&lt;head></link></code> is answered by
                  <code>&lt;/head></code>, <code><link linkend="element-body"
                  >&lt;body></link></code> by <code>&lt;/body></code> and each
                  <code>&lt;div...></code> by <code>&lt;/div></code>. These elements nest within or
               beside each other, but they never overlap. (The prohibition on overlapping elements
               is one of the cardinal rules of XML.) This relationship means that every XML file can
               be thought of as a tree, with the root at the trunk and the enveloped elements as
               branches, terminating in metaphorical leaves. It is helpful to use the tree metaphor
               when we describe the path we take, toward either the leaves or the root. In this
               manual, we may use the terms <emphasis role="italic">rootward</emphasis> and
                  <emphasis role="italic">leafward</emphasis> when we want to trace movement within
               an XML document.</para>
            <para>An XML document is also profitably thought of as a family tree, a metaphor that
               provides commonly used terminology. In our examples above, <code><link
                     linkend="element-TAN-T">&lt;TAN-T></link></code> is the <emphasis role="italic"
                  >parent</emphasis> of <code><link linkend="element-body">&lt;body></link></code>,
               and <code><link linkend="element-body">&lt;body></link></code> the parent of the four
                     <code><link linkend="element-div">&lt;div></link></code> elements. Likewise,
               each <code><link linkend="element-div">&lt;div></link></code> is the <emphasis
                  role="italic">child</emphasis> of <code><link linkend="element-body"
                     >&lt;body></link></code>, and <code><link linkend="element-body"
                     >&lt;body></link></code> is the child of <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>. Distant parental relationships can be described with
               the terms <emphasis role="italic">ancestor</emphasis> and <emphasis role="italic"
                  >descendant</emphasis>. <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is the ancestor of every element it encompasses, and
               every element encompassed by <code><link linkend="element-TAN-T"
                  >&lt;TAN-T></link></code> is its descendant. Paratactic relationships are also
               important. <code><link linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code> are <emphasis role="italic"
                  >siblings</emphasis> to each other, and every <code><link linkend="element-div"
                     >&lt;div></link></code> is a sibling to every other <code><link
                     linkend="element-div">&lt;div></link></code>.</para>
            <para>Inside of the opening tags for the <code><link linkend="element-TAN-T"
                     >&lt;TAN-T></link></code>, <code><link linkend="element-body"
                  >&lt;body></link></code>, and <code><link linkend="element-div"
                  >&lt;div></link></code> elements are pairs of text joined by an equals sign,
               collectively called an attribute. The left side of the equals sign is the attribute
               name, and on the right side, within the quotation marks, is the attribute value.
                     <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> has two
               attributes, <code>@xmlns</code> and <code><link linkend="attribute-id"
                  >@id</link></code> (when we discuss an attribute outside its original context, we
               often preface the name with @). We will skip <code>@xmlns</code> for now; this
               attribute (actually, a pseudo-attribute) specifies the <emphasis>namespace
               </emphasis>of the XML file, a somewhat advanced topic. </para>
            <para>The value of <code><link linkend="attribute-id">@id</link></code>, however, is
               quite important and our first item of business. Every TAN file has an <code><link
                     linkend="attribute-id">@id</link></code> that uniquely and permanently
               identifies the file itself. It is quite similar to the name we give a file when we
               save it, and to the names we see when we browse the local contents of our computer,
               except that it should not be changed from one revision to the next. When we want to
               record changes to our file, we will not alter the <code><link linkend="attribute-id"
                     >@id</link></code> value, but simply note the change elsewhere in the document
               (see below).</para>
            <para>The value of <code><link linkend="attribute-id">@id</link></code> is always what
               is called a tag uniform resource name (tag URN). It always starts with
                  <code>tag:</code>, followed by an email address or domain name that we own or
               owned. (It is okay to use an obsolete address.) After that email address or domain
               name comes a comma (no spaces) and a date on which we owned it, in the international
               standard format of year, month, and date, joined by hyphens, e.g., 2014-12-31. If we
               leave off a day value, it is assumed to be the first of the month; if we leave off
               the month value it is assumed to be January. In the examples above,
                  <code>[USER@DOMAIN.NET],2014</code> indicates that the email address was owned on
               the stroke of midnight (Coordinated Universal Time) January 1, 2014. After that comes
               a colon, and then any name we wish to assign to the file. </para>
            <para>We have anticipated a simple collection of texts, so we've called the files
                  <code>ring01</code> and <code>ring02</code>. (If we run out of names, or want to
               restart, we can simply use a new email-date preface, e.g.,
                  <code>[USER@DOMAIN.NET],2014-01-02</code>.)</para>
            <para>The element <code><link linkend="element-body">&lt;body></link></code> contains
               our transcription. <code><link linkend="attribute-xmllang">@xml:lang</link></code>,
               required, specifies the principal language of the transcribed text. We use the
               standard 3-letter abbreviation for English. (See later in the guide for more complex
               language requirements.) By saying that <code><link linkend="attribute-in-progress"
                     >@in-progress</link></code> is <code>false</code>, we indicate that we have
               finished our transcription and have no further plans to develop it. It doesn't mean
               that the file is free of errors. We will can make corrections later. It just means
               that we have no more revisions planned, and any further changes will be restricted to
               corrections of errors. This attribute is optional. If it is left off, our TAN file is
               assumed to be a work in progress, and it serves as a kind of warning to anyone who
               might want to use it.</para>
            <para>Our transcription has been divided into four <code><link linkend="element-div"
                     >&lt;div></link></code> elements. How we divide up the work is entirely up to
               us. But we must make sure that every bit of text is enclosed by a leafmost
                     <code><link linkend="element-div">&lt;div></link></code>. That is, every
                     <code><link linkend="element-div">&lt;div></link></code> must be the parent of
               only other <code><link linkend="element-div">&lt;div></link></code>s, or none at all.
               We cannot have a <code><link linkend="element-div">&lt;div></link></code> that mixes
               text with other elements (such as other <code><link linkend="element-div"
                     >&lt;div></link></code>s). The values of <code><link linkend="attribute-type"
                     >@type</link></code> and <code><link linkend="attribute-n">@n</link></code>
               indicate, respectively, the type of division and the name of the division. We have
               used <code>line</code> in the first example, but we could easily have also used
                  <code>l</code> (as we did in the second) or <code>ln</code> or any other phrase
               that we think will make intuitive sense to other users. The choice is arbitrary (we
               will see why below). We have used arabic numerals for the values of <code><link
                     linkend="attribute-n">@n</link></code>, but the value, once again, could have
               been anything. We could have used Roman numerals, or some other naming scheme that is
               standard in the field.</para>
            <para>Aside from the <code><link linkend="element-head">&lt;head></link></code> element
               (discussed later), that's all we need in the transcription. We can now move to
               alignment.</para>
            <para>There are two different types of alignment, one emphasizing breadth, the other,
               depth. The broad type of alignment, called TAN-A-div, allows us to specify TAN
               transcriptions of as many versions of as many works as we wish, and to fine-tune the
               alignment upon the basis of the <code><link linkend="element-div"
                  >&lt;div></link></code> elements within the transcription. We do not specify why
               we wish to align the versions. We only declare our interest in doing so. The other
               type of alignment, emphasizing depth, is called TAN-A-tok and allows us to take any
               two (and no more) TAN transcriptions, create word-to-word (or better put,
               token-to-token) relationships, and specify what type of relationship holds between
               each set of aligned words. TAN-A-div is suitable for work that focuses on the general
               alignment of multiple versions of one or more works at a single time. TAN-A-tok is
               for highly detailed, precise alignment of two text versions.</para>
            <para>For our example, we start with a TAN-A-div file (once again suppressing
                     <code><link linkend="element-head"
               >&lt;head></link></code>):<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body/>
&lt;/TAN-A-div></programlisting></para>
            <para>In the prolog, the first line is identical to the first line of our transcription
               files. The second and third lines are identical, aside from pointing to the
               validation files for alignment. Even the fourth line looks like the transcription
               file, other than the new name for the root element, <code><link
                     linkend="element-TAN-A-div">&lt;TAN-A-div></link></code>, and the new value for
                     <code><link linkend="attribute-id">@id</link></code>.</para>
            <para>The penultimate line, <code>&lt;body/></code>, is what is called an empty element,
               and is equivalent to <code><link linkend="element-body"
                  >&lt;body></link>&lt;/body></code>. Collapsing the opening and the closing tags of
               the element into a single tag provides a shorthand syntax for elements contains
               nothing. It will become apparent, when we discuss <code><link linkend="element-head"
                     >&lt;head></link></code> below, why our <code><link linkend="element-body"
                     >&lt;body></link></code> can be empty.</para>
            <para>The other kind of alignment, TAN-A-tok, takes a bit more work, because we must
               first identify words that correspond with each other. Even before we do that, we need
               to decide what kind of relationship holds between the two texts. Let us pretend, for
               the sake of example, that the 1987 version is a direct descendant (and therefore
               variation) of the 1881 one. So our task is to show exactly what parts of the the
               older version correspond to those of the newer one. We will simplify in this case,
               and assume an interest only in words, ignoring space and that punctuation. We will
               also adopt, <emphasis>tokens</emphasis> instead of <emphasis>words</emphasis>
                  (<emphasis role="italic">word</emphasis> is notoriously difficult to define, and
               has connotations lacking from <emphasis>token</emphasis>).</para>
            <para>We now create a TAN-A-tok
               file:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-tok.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-tok.sch" type="application/xml" schematypensrc="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-A-tok xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:TAN-A-tok,ring01+ring02">
    &lt;head>
    . . . . . . .
    &lt;/head>
    &lt;body bitext-relation="B-descends-from-A" reuse-type="adaptation" in-progress="false">
        &lt;!-- Examples of picking tokens by number -->
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="1"/>
            &lt;tok src="ring1987" ref="1" ord="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="2"/>
            &lt;tok src="ring1987" ref="1" ord="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="3"/>
            &lt;tok src="ring1987" ref="1" ord="3"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="4"/>
            &lt;tok src="ring1987" ref="l" ord="4"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="1" ord="5"/>
            &lt;tok src="ring1987" ref="1" ord="5"/>
        &lt;/align>
        &lt;!-- Examples of picking tokens by value -->
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="A"/>
            &lt;tok src="ring1987" ref="2" val="A"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="pocket"/>
            &lt;tok src="ring1987" ref="2" val="pocket"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="full"/>
            &lt;tok src="ring1987" ref="2" val="full"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="of"/>
            &lt;tok src="ring1987" ref="2" val="of"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="2" val="posies"/>
            &lt;tok src="ring1987" ref="2" val="posies"/>
        &lt;/align>
        &lt;!-- Examples of picking ranges of tokens -->
        &lt;align>
            &lt;tok src="ring1881" ref="3" ord="1, 2"/>
            &lt;tok src="ring1987" ref="3" ord="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="3" ord="3 - 4"/>
            &lt;tok src="ring1987" ref="3" ord="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="1"/>
            &lt;tok src="ring1987" ref="4" ord="1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="2"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="3"/>
            &lt;tok src="ring1987" ref="4" ord="2"/>
        &lt;/align>
        &lt;!-- examples of using "last" -->
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="last-1"/>
            &lt;tok src="ring1987" ref="4" ord="last-1"/>
        &lt;/align>
        &lt;align>
            &lt;tok src="ring1881" ref="4" ord="last"/>
            &lt;tok src="ring1987" ref="4" ord="last"/>
        &lt;/align>
    &lt;/body>
&lt;/TAN-A-tok></programlisting></para>
            <para>Once again, the first four lines, the prolog and root element, should look
               familiar, with the only significant changes being the names of the validation files,
               the name of the root element (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>) and the value of <code><link
                     linkend="attribute-id">@id</link></code>.</para>
            <para>The heart of the data is <code><link linkend="element-body"
                  >&lt;body></link></code>, which has, in addition to <code><link
                     linkend="attribute-in-progress">@in-progress</link></code>, two more
               attributes, <code><link linkend="attribute-reuse-type">@reuse-type</link></code>,
               which specifies the default type of relationship between the two sources, and
                     <code><link linkend="attribute-bitext-relation">@bitext-relation</link></code>,
               which specifies how the versions relate to each other. Our two values,
                  <code>B-descends-from-A</code> and <code>adaptation</code>, are arbitrary names
               that we define in the <code><link linkend="element-head">&lt;head></link></code>
               (discussed later). </para>
            <para><code><link linkend="element-body">&lt;body></link></code> is the parent of one or
               more <code><link linkend="element-align">&lt;align></link></code> elements, each of
               which correlates a set of tokens in the two texts through its <code><link
                     linkend="element-tok">&lt;tok></link></code> children. Each <code><link
                     linkend="element-tok">&lt;tok></link></code> has, in this example, three
               attributes. <code><link linkend="attribute-src">@src</link></code> takes a nickname
               (an <code><link linkend="attribute-id">@id</link></code> reference) that points to
               one of the two transcriptions; we have used <code>ring1881</code> and
                  <code>ring1987</code> but we could have just as easily used anything else such as
                  <code>uk</code> and <code>us</code>. <code><link linkend="attribute-ref"
                     >@ref</link></code> has a value that points to a specific <code><link
                     linkend="element-div">&lt;div></link></code> in the source transcription; and
                     <code><link linkend="attribute-pos">@pos</link></code> or <code><link
                     linkend="attribute-val">@val</link></code> specify which token is intended,
               either by word number (<code><link linkend="attribute-pos">@pos</link></code>) or
               text of the actual word (<code><link linkend="attribute-val">@val</link></code>).
               Either technique is fine, and can be mixed, as in the example. You may also notice
               that the comma and hyphen can be used in <code><link linkend="attribute-pos"
                     >@pos</link></code> to point to multiple words within the same <code><link
                     linkend="element-div">&lt;div></link></code>, and that <code>last</code> and
                  <code>last-X</code> (where <code>X</code> is a digit) can be used to point to a
               word token relative to the last one in a <code><link linkend="element-div"
                     >&lt;div></link></code>.</para>
            <para>Each <code><link linkend="element-align">&lt;align></link></code> can establish
               one-to-one, one-to-many, many-to-one, or many-to-many relationships between the two
               texts. Words may feature in multiple <code><link linkend="element-align"
                     >&lt;align></link></code> elements (that is, overlapping is permissible). And
               if an <code><link linkend="element-align">&lt;align></link></code> has <code><link
                     linkend="element-tok">&lt;tok></link></code> elements belonging to only one
               source, such as in the fourth-to-last <code><link linkend="element-align"
                     >&lt;align></link></code> above, we have what is called, in these guidelines, a
                  <emphasis>half-null alignment</emphasis>. This half-null alignment indicates that
               the second word of line four of the 1881 version is excluded from the act that we
               have called <code>adaptation</code> (which is, as we shall see, defined in the
                     <code><link linkend="element-head">&lt;head></link></code>). If this were a
               translation, it would be as if we were saying that this word was excluded from the
               translation. (A half-null alignment containing only tokens of the later source might
               point to words that the translator added.) </para>
            <para>A half-null alignment should not be confused with our own silence. As creators of
               this file, we are under no obligation to indicate every word-for-word correspondence.
               If we fail to mention certain words, all that can be implied is that we opted not to
               say anything about them.</para>
            <para>We could have aligned the two texts in different ways. Perhaps further study will
               reveal that we were in error to associate the second "ring" with "round" in line 1.
               We can make corrections, even after publication, and signal the change to users of
               our data. There are also ways to express doubt or alterative opinions. We can even
               correlate fragments of tokens (letters, prefixes, infixes, or suffixes). All these
               more advanced uses are discussed in the detailed parts of these guidelines.</para>
         </section>
         <section>
            <title>The Principles of TAN Metadata (<code><link linkend="element-head"
                     >&lt;head></link></code>)</title>
            <para>At this point, we have finished four TAN files: two transcriptions, one TAN-A-div
               file, and one TAN-A-tok file. But we've suppressed the <code><link
                     linkend="element-head">&lt;head></link></code> in all of them, until now. But
               before getting into details, we need first to discuss a few principles that TAN
               relies upon.</para>
            <para>Unlike <code><link linkend="element-body">&lt;body></link></code>, which carries
               the raw data, <code><link linkend="element-head">&lt;head></link></code> contains
               what is oftentimes called metadata. That is, <code><link linkend="element-head"
                     >&lt;head></link></code> contains data that describes the data. Because the TAN
               format is intended primarily to serve scholars, and because the format is heavily
               regulated (that is, there are numerous validation rules that supplement the basic
               ones behind XML), the metadata requirements are stricter than those of other formats.
               Scholars who use our data really need to know some essential things before they can
               responsibly use the data we produce. For example, what are the sources we have used?
               Who produced the data? When? What key assumptions have been made in producing the
               data? What rights do other people have to use the data? The questions are not
               difficult to answer, but they are critical, and we should take the time we need to
               get correct answers.</para>
            <para>Some of these questions are specific to certain types of data. For example, in a
               TAN-A-tok file, we ask what relationship the two sources hold to each other. But that
               makes no sense for a TAN-T file. But other questions apply universally across all TAN
               files, no matter what kind of data. As we go from one TAN  format to the next, we
               need to deal as much we can with similar structures and expectations. This reduces
               any potential confusion in creating and editing a TAN file, and helps other people
               using our data to find the information they want. More importantly, what we write in
               one file might save us some work in another.</para>
            <para>The rigorous scholarly requirements for TAN metadata are offset somewhat by
               another principle that was adopted in the design of TAN, namely, that each format's
                     <code><link linkend="element-head">&lt;head></link></code> should focus
               exclusively upon the data in <code><link linkend="element-body"
                  >&lt;body></link></code> and not other things. That is to say, in a transcription,
               we should definitely indicate what our source is. But we should not try to write a
               catalog entry, or even a structured citation, for the book we have used. We are not
               library catalogers. Our obligation is merely to point somewhere a reader can get more
               complete information. The <code><link linkend="element-head">&lt;head></link></code>
               is designed to help us to stay focused on the task and data at hand.</para>
            <para>TAN was also designed with the assumption that all metadata should be useful to
               both humans and computers. For our example above, we must describe the work we have
               chosen in such a way that the phrase <emphasis role="italic">Ring around the
                  Rosie</emphasis> is comprehensible not just to the reader but to the computer,
               using syntax that a computer can be programmed to act upon.</para>
            <para>Take for example the 1881 book we have used for our first transcription. For the
               human reader we can say simply something like "Kate Greenaway, <emphasis>Mother
                  Goose</emphasis>, New York, G. Routledge and sons [1881]". But computers need a
               more controlled, predictable syntax before they can be directed to the correct
               edition of <emphasis>Mother Goose</emphasis> (or rather to a digital surrogate of the
               edition). The human-readable string is too complex, and syntactically opaque. A more
               computer-friendly identifier would be international standard book numbers (ISBNs),
               which distinguish the 1984 version of <emphasis>Mother Goose</emphasis> illustrated
               by Kayoko Okumura from the one of the same year illustrated by William Joyce. The
               ISBNs for the Okumura version, 0671493159, and for Joyce's, 0394865340, can be
               converted into a machine-actionable string called universal resource names (URNs), in
               this case <code>urn:isbn:0-671493159</code> and <code>urn:isbn:0-394865340</code>.
               (Our 1881 version was published before the ISBN program was introduced. We will see
               below other ways to name it.)</para>
            <para>URNs are families of formalized naming schemes regulated by a central body
               (Internet Assigned Numbers Authority, IANA) to ensure that people and organizations
               can legitimately coin and use permanent, persistent, unique names for various types
               of things. There are URN schemes for journals (via ISSNs), articles (DOIs), and
               movies (ISANs), which means that anyone can refer to them unambiguously in a manner
               that is computer-friendly.</para>
            <para>All URNs are simply names. They don't tell you where an object is, just what its
               name is. To provide a unique <emphasis role="italic">location</emphasis>, however, we
               have universal resource locators (URLs), which might be much more familiar from daily
               use of the Internet, e.g., <code>http://academia.edu</code>. Like URNs, URLs are also
               centrally regulated, with individuals or organizations buying the rights to domain
               names from a central registry (usually through a third-party vendor).</para>
            <para>Both URNs and URLs can be thought of as the same type of thing, namely, a
               universal resource identifier (URI), sometimes called an international resource
               identifier (IRI). An IRI is a type of URN that allows any alphabet in Unicode, not
               just Latin. URIs/IRIs are, in essence, nothing more than the set of all URNs and
               URLs. These four acronyms can be easily confused, and it is best to disambiguate them
               by thinking of the last letter in each. UR<emphasis role="bold"
                  >I</emphasis>s/IR<emphasis role="bold">I</emphasis>s <emphasis role="bold"
                  >I</emphasis>ncorporate both <emphasis role="bold">L</emphasis>ocators
                  (UR<emphasis role="bold">L</emphasis>) and <emphasis role="bold">N</emphasis>ames
                  (UR<emphasis role="bold">N</emphasis>).</para>
            <para>IRIs are essential to a system frequently called the semantic web or linked (open)
               data, an agreed way of writing and processing data that relies upon IRIs and a simple
               data model to connect them. The semantic web allows independent parties to make
               assertions about things, and if they happen to use the same IRI vocabulary to
               describe those things, then we can program computers to make associations between
               disparate, heterogenous datasets. This allows us to find connections across
               disciplines and projects, to marshall computers to make inferences we not make on
               their own, and to create a network of linked data.</para>
            <para>TAN has been designed to be linked-data friendly, and so requires in its
                     <code><link linkend="element-head">&lt;head></link></code> almost all data to
               be representable not just in a human-readable form but also computer-readable, as an
               IRI. </para>
            <para>Our first task, then, in writing the <code><link linkend="element-head"
                     >&lt;head></link></code> sections of our four TAN files is to look for IRI
               vocabulary that will be familiar to the community of practice most likely to use our
               files. In trying to find suitable IRIs, we will find that the persons, things, and
               concepts we want to describe will range from the highly familiar to the
               unfamiliar.</para>
            <para><emphasis role="italic">Highly familiar</emphasis>: The two books that provide the
               basis of our transcription are well catalogued and generally known. A number of
               services provided by librarians provide a controlled IRI vocabulary that can be used
               by anyone to describe uniquely a particular version of a book. <link
                  xlink:href="http://www.worldcat.org">WorldCat</link> (run by OCLC) and the <link
                  xlink:href="http://catalog.loc.gov">Library of Congress</link> are good examples.
               In our case, we have found accurate Library of Congress IRIs for both editions of
                  <emphasis>Mother Goose</emphasis>: <code>http://lccn.loc.gov/12032709</code> and
                  <code>http://lccn.loc.gov/87042504</code>. Observe that these two IRIs are also,
               perhaps confusingly, URLs. If we paste these strings into our browser, we retrieve a
               record that describes the book. This locator does not lead us to the book per se,
               only to information <emphasis role="italic">about</emphasis> the book. Nevertheless,
               the Library of Congress has decided to coin this URL also as an IRI name for the
               book. Anyone who owns a domain name can designate a URL as a name for an object. And
               that allows them to set up their server to also return information about the object
               the IRI names. This subtle ambiguity—that the URL both names an entity and is a
               location for a webpage—can sometimes be confusing to those who are new to the
               semantic web, because such URLs name in reality two types of things: an entity and a
               location to find out more information about that entity. </para>
            <para>We now have IRIs for the sources. Let's now find an IRI to name the work,
                  <emphasis role="italic">Ring around the Rosie</emphasis>. The work is widely
               known, and even has a <link
                  xlink:href="http://en.wikipedia.org/wiki/Ring_a_Ring_o%27_Roses">Wikipedia
                  entry</link>. That Wikipedia entry is fortuitous. The Universities of Leipzig and
               Mannheim and Openlink Software have collaborated on a project called <link
                  xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>, which is committed to
               providing a unique URN for every Wikipedia entry in the major languages. The DBPedia
               URN for the work we have chosen is
                  <code>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses</code>. Once again, this
               is both a name and a locator. It names a specific intangible object, namely a nursery
               rhyme that we've called <emphasis>Ring around the Rosie</emphasis>, no matter what
               specific version. But if you put that name into your browser, you will get back more
               information about that named object.</para>
            <para><emphasis role="italic">Familiar, but only in small circles</emphasis>: We will
               need to have names for some of the people who edited the file. Here we're not
               interested in the authors of our books. We are interested in crediting the people who
               helped make the TAN file. Most people who contribute to the creation of the data file
               will not be well-known, public figures. If they are, and if they are famous enough to
               have a Wikipedia entry, then a DBPedia IRI could be used. Or if some of the
               contributors are also published authors, there is a good chance that they are listed
               in the databases of either <link xlink:href="http://viaf.org">VIAF</link> or <link
                  xlink:href="http://isni.org">ISNI</link>, both of which publish unique IRIs for
               persons. </para>
            <para>Many contributors to TAN files, however, will not be listed in these general
               databases. In these cases, we can assign our own IRI to name these participants. We
               have already done something like this by assigning tag URNs to our four
               transcriptions (the value of <code><link linkend="attribute-id">@id</link></code> in
               the root element). We can do the same for our editors. If a student Robin Smith has
               been helping with proofreading, we can take an email address for Robin (even one that
               doesn't work any more) and a date when the email address was used and construct a tag
               URN such as <code>tag:smith.robin@example.com,2012:self</code>. This has a slight
               drawback in that we cannot type this string into our browser to find out more about
               the Robin, but it at least allows us to assign a name that will not be confused as
               the Robin Smith identified by ISNI as
                  <code>http://isni.org/isni/0000000043306406</code>. (If we want to go a step
               further, we could mint a URN from a domain name that we own, and set up a linked data
               service that offers more information, human- and computer-readable, about Robin, but
               this is not required. And it can be a lot of work to maintain.)</para>
            <para>Another example of field-specific IRIs is the concept of relationship between two
               text-bearing objects. We are assuming for the sake of illustration that the version
               published in the 1987 <emphasis>Mother Goose</emphasis> is a direct descendant of the
               1881 version. Our assumption is important to declare, because if we had a different
               view on how one related to the other, it would probably affect the specifics of our
               word-for-word alignments. Because no suitable IRI vocabulary yet exists for such
               concepts, TAN has coined an IRI that can be used by anyone wishing to declare that
               the second of two sources descends from the first through an unknown number of
               intermediaries: <code>tag:textalign.net,2015:bitext-relation:a/x+/b</code>.</para>
            <para>We face a similar issue when thinking about text reuse. We generally consider the
               1987 version to be an adaptation of the 1881 version. And there are not stable,
               well-published IRI vocabularies for text reuse. So we adopt a TAN-coined IRI,
                  <code>tag:textalign.net,2015:reuse-type:adaptation:general</code>.</para>
            <para>For other examples of IRIs coined by TAN, see <xref linkend="keywords-master-list"
               />.</para>
            <para><emphasis role="italic">Generally unfamiliar</emphasis>: Some things or concepts
               will be unknown to very few people, perhaps only to us. If we plan to refer to that
               thing or concept often, it is preferable to coin a tag URN, as described above. But
               in some cases, we might find that a tag URN we minted for some concept or thing was,
               in hindsight, misleading or poorly constructed, because we hadn't taken into account
               other things that should be named. So if we wish to avoid these kinds of situations,
               we can assign a random IRI called a universally unique identifier (UUID), e.g.,
                  <code>urn:uuid:3fd9cece-b246-4556-b229-48f22a5ae2e0</code>. These uuid URNs, which
               are generated by computers through randomizing functions, are very useful. The
               likelihood that a randomly generated uuid will be identical to any other uuid is
               astronomically improbable, making them reliably unique names for anything (barring
               someone copying and reusing that uuid URN to name some other object or concept).
               Numerous free UUID generators can be found online.</para>
            <para>To humans, a UUID on its own is meaningless, and rather ugly. But it is a good
               start. We always have the option, later, of adding an IRI. It's perfectly fine to
               give one object or concept multiple IRIs. But the reverse is never true. One should
               never use the same IRI to identify more than one object or concept.</para>
         </section>
         <section>
            <title>Creating TAN Metadata (<code><link linkend="element-head"
               >&lt;head></link></code>)</title>
            <para>Now that we have explored various IRI vocabularies for concepts around our
               versions of <emphasis>Ring-a-ring-a-roses</emphasis>, we can now complete the
               metadata in our four TAN files. Let us start with the TAN-T file of the 1881
               version:<programlisting>    &lt;head>
        &lt;name>TAN transcription of Ring a Ring o' Roses&lt;/name>
        &lt;master-location>ring-o-roses.eng.1881.xml&lt;/master-location>
        &lt;rights-excluding-sources rights-holder="park">
            &lt;IRI>http://creativecommons.org/licenses/by/4.0/deed.en_US&lt;/IRI>
            &lt;name>This data file is licensed under a Creative Commons Attribution 4.0 International
                License. The license is granted independent of any rights and licenses that may be 
                associated with the source. &lt;/name>
        &lt;/rights-excluding-sources>
        &lt;source>
            &lt;IRI>http://lccn.loc.gov/12032709&lt;/IRI>
            &lt;name>Kate Greenaway, Mother Goose, New York, G. Routledge and sons [1881]&lt;/name>
        &lt;/source>
        &lt;declarations>
            &lt;work>
                &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
                &lt;name>"Ring a Ring o' Roses" or "Ring Around the Rosie"&lt;/name>
            &lt;/work>
            &lt;div-type xml:id="line">
                &lt;IRI>http://dbpedia.org/resource/Line_(poetry)&lt;/IRI>
                &lt;name>line of poetry&lt;/name>
            &lt;/div-type>
        &lt;/declarations>
        &lt;agent xml:id="park" roles="creator">
            &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
            &lt;name>Jenny Park&lt;/name>
        &lt;/agent>
        &lt;role xml:id="creator">
            &lt;IRI>http://schema.org/creator&lt;/IRI>
            &lt;name xml:lang="eng">creator&lt;/name>
        &lt;/role>
        &lt;change when="2014-08-13" who="park">Started file&lt;/change>
    &lt;/head></programlisting></para>
            <para><code><link linkend="element-name">&lt;name></link></code> is the human readable
               form of the <code><link linkend="attribute-id">@id</link></code> that is inside the
               root element, <code><link linkend="element-TAN-T">&lt;TAN-T></link></code>. It can be
               anything. And we can supply more than one <code><link linkend="element-name"
                     >&lt;name></link></code>, in case we wish to provide it in different languages
               or variations.</para>
            <para><code><link linkend="element-master-location">&lt;master-location></link></code>
               is mandatory only if we have claimed through <code><link
                     linkend="attribute-in-progress">@in-progress</link></code> that the file is no
               longer in progress. One or more of these elements provide URLs where master versions
               of the file are kept (and updated). They may be absolute URLs, such as an address on
               the Internet, or it may be a relative URL, in case we are working exclusively on our
               local computer. We provide this as a courtesy to others who might be using our data.
               If someone downloads a copy and starts working with it, then whenever they validate
               the file, if it does not match the one in the master version, a warning is returned,
               along with a message or a location of the elements that were last changed. This
               allows users to found out if changes have been made, and it allows us to make
               corrections and silently notify other users of our alterations. To communicate this,
               we do not have to keep track of who is using the file.</para>
            <para><code><link linkend="element-rights-excluding-sources"
                     >&lt;rights-excluding-sources></link></code> contains information about rights
               to the data we are releasing. This element has nothing to do with the copyright of
               the source we have used (although, having been published in 1881, the book is clearly
               in the public domain). This once again gets to the TAN metadata principle of
               describing our data and not other things. We have the option to describe the license
               of the source we have used (see the rest of the guidelines for guidance), but we
               absolutely must declare whether we have placed additional scrictures on the dataset
               we have created. That is, we are declaring the rights attached to the data, not its
               source. In this example, we have released the data under a creative commons license.
               The child element <code><link linkend="element-IRI">&lt;IRI></link></code> specifies
               the IRI assigned by Creative Commons, and <code><link linkend="element-desc"
                     >&lt;desc></link></code> describes it in human-readable format.</para>
            <para>The conjunction of <code><link linkend="element-IRI">&lt;IRI></link></code> and
                     <code><link linkend="element-name">&lt;name></link></code>, the <emphasis>IRI +
                  name pattern</emphasis>, is a recurrent feature of TAN files. We may include any
               number of <code><link linkend="element-IRI">&lt;IRI></link></code> or <code><link
                     linkend="element-name">&lt;name></link></code> elements in an IRI + name
               pattern. But if we do so, we are stating that they all name the same thing, not
               different things.</para>
            <para><code><link linkend="element-source">&lt;source></link></code> points, through its
               IRI + name pattern, to a computer- and human-readable description of the book we have
               chosen. </para>
            <para><code><link linkend="element-declarations">&lt;declarations></link></code>
               contains data that is specific to TAN file types, to declare the assumptions we have
               made relevant to the kind of data we have created. In this case, because we are
               working with transcriptions, we have two major components: <code><link
                     linkend="element-work">&lt;work></link></code> and <code><link
                     linkend="element-div-type">&lt;div-type></link></code>. </para>
            <para><code><link linkend="element-work">&lt;work></link></code> uses the IRI + name
               pattern to name the work we have chosen to transcribe. <code><link
                     linkend="element-div-type">&lt;div-type></link></code> specifies the type of
               divisions we have chosen to use to segment the transcription. In a more complex text,
               there would be several <code><link linkend="element-div-type"
                  >&lt;div-type></link></code>s. Each one has an <code><link
                     linkend="attribute-xmlid">@xml:id</link></code>, which takes as a value some
               nickname that we wish to use for <code><link linkend="attribute-type"
                  >@type</link></code> values of <code><link linkend="element-div"
                  >&lt;div></link></code>s.</para>
            <para>The IRI + name pattern is also used for <code><link linkend="element-agent"
                     >&lt;agent></link></code>, which describes who was involved in creating the
               data, and <code><link linkend="element-role">&lt;role></link></code>. We may have as
               many <code><link linkend="element-agent">&lt;agent></link></code>s and <code><link
                     linkend="element-role">&lt;role></link></code>s as we wish. The
                  <code>agent</code> in this case, Jenny Park, has been given a tag URI. The
                     <code><link linkend="element-IRI">&lt;IRI></link></code> value of <code><link
                     linkend="element-role">&lt;role></link></code> comes from the vocabulary of
                  <link xlink:href="http://schema.org">schema.org</link>, which is maintained by
               Bing, Google, and Yahoo! in conjunction with the W3C (the nonprofit organization
               dedicated to universal Internet standards), but we could have used Dublin Core or
               some other IRI vocabulary describing behaviors, responsibilities, and roles.<note>
                  <para>If you decide to modify someone else's TAN file, then you become responsible
                     for changes, not the original person or organization. Your first point of order
                     should be add an <code><link linkend="element-agent">&lt;agent></link></code>
                     to the head, identifying yourself. You need not change the document's
                           <code><link linkend="attribute-id">@id</link></code>, but you should take
                     responsibility for any changes you make, probably using <code><link
                           linkend="element-change">&lt;change></link></code> or an <code><link
                           linkend="attribute-ed-who">@ed-who</link></code> and an <code><link
                           linkend="attribute-ed-when">@ed-when</link></code>. Otherwise you are
                     incorrectly attributing your changes to someone else.</para>
               </note></para>
            <para>Remember that <code><link linkend="element-head">&lt;head></link></code> is
               focused on the data, not its sources, so the claim that Jenny Park is the creator
               pertains only to the data. No inference should be made about who created the source.
               If someone wants that information, or anything else about the source, they should
               pursue the identifier we have provided under <code><link linkend="element-source"
                     >&lt;source></link></code>.</para>
            <para><code><link linkend="element-change">&lt;change></link></code> has attributes
                     <code><link linkend="attribute-when">@when</link></code> and <code><link
                     linkend="attribute-who">@who</link></code> that specify who made the
               change/comment and when. The value of <code><link linkend="attribute-when"
                     >@when</link></code> is always a date plus optional time formatted according to
               the standard <code>YYYY-MM-DD</code> + time (optional). <code><link
                     linkend="attribute-who">@who</link></code> always carries a value that refers
               to an <code>agent/<link linkend="attribute-xmlid">@xml:id</link></code>. Both
                     <code><link linkend="element-change">&lt;change></link></code> (as well as
                     <code><link linkend="element-comment">&lt;comment></link></code>, missing here)
               lack any IRIs, mainly because the likelihood that the data would ever be reused,
               repeated, or linked to is altogether too remote to be make a mandated <code><link
                     linkend="element-IRI">&lt;IRI></link></code> useful.</para>
            <para>So now we have finished one transcription file's metadata. The other one will look
               similar, but we'll also take a couple of nice
               shortcuts:<programlisting>    &lt;head>
      &lt;name>TAN transcription of Ring around the Rosie&lt;/name>
      &lt;master-location>ring-o-roses.eng.1987.xml&lt;/master-location>
      &lt;rights-excluding-sources which="by-nc-nd_2.0" rights-holder="park"/>
      &lt;source>
         &lt;IRI>http://lccn.loc.gov/87042504&lt;/IRI>
         &lt;name>Mother Goose, from nursery to literature / by Gloria T. Delama, 1987.&lt;/name>
      &lt;/source>
      &lt;declarations>
         &lt;work>
            &lt;IRI>http://dbpedia.org/resource/Ring_a_Ring_o%27_Roses&lt;/IRI>
            &lt;name>Ring around the Rosie&lt;/name>
         &lt;/work>
         &lt;div-type xml:id="l" which="half-line (verse)"/>
         &lt;filter>
            &lt;normalization which="no hyphens"/>
         &lt;/filter>
      &lt;/declarations>
      &lt;agent xml:id="park" roles="creator">
         &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
         &lt;name xml:lang="eng">Jenny Park&lt;/name>
      &lt;/agent>
      &lt;role xml:id="creator" which="creator"/>
      &lt;change when="2014-10-24" who="park">Started file&lt;/change>
      &lt;comment when="2014-10-24" who="park">See p. 39 of source.&lt;/comment>
   &lt;/head></programlisting></para>
            <para>One significant difference is that three of the elements that normally take the
                  <xref linkend="pattern-iri_and_name"/> have been replaced with a simpler form that
               takes merely <link linkend="attribute-which"><code>@which</code></link> and
                     <code><link linkend="attribute-xmlid">@xml:id</link></code>. That is because
               TAN has predefined vocabulary that can be invoked by calling it (through <link
                  linkend="attribute-which"><code>@which</code></link>) and giving it an
               abbreviation to be used elsewhere in the document (<code><link
                     linkend="attribute-xmlid">@xml:id</link></code>).</para>
            <para>
               <code><link linkend="element-declarations">&lt;declarations></link></code> has a new
               child, <code><link linkend="element-filter">&lt;filter></link></code>, which contains
               a <code><link linkend="element-normalization">&lt;normalization></link></code>
               statement that declares, through the name and the IRI in the underlying TAN
               definition, that we have opted to remove word-break line-end hyphenation. This
               provides a cautionary note to users of our data who might value line-end hyphenation.
               Any number of <code><link linkend="element-normalization"
                  >&lt;normalization></link></code>s can be used to describe any alterations we
               might have made in our transcription. In other transcriptions we could use this
               feature to declare other suppressions, such as editorial comments or footnote
               signals.</para>
            <para>Note that the value of <code>div-type/<link linkend="attribute-xmlid"
                     >@xml:id</link></code> here, the letter <code>l</code>, differs from our
               previous transcription file, <code>line</code>. Even though we have adopted a
               different nickname, they are treated as equivalent because in each file we have
               defined <code>l</code> or <code>line</code> with the same IRI,
                  <code>http://dbpedia.org/resource/Line_(poetry)</code>. A computer that later
               looks for files with lines of poetry will not care about <code>l</code> and
                  <code>line</code>, but will look at the underlying IRI that defines these terms.
               This exemplifies how linked data (see above) can support our work. We are free to use
               abbreviations and terms that make sense to us, yet we can also tie those
               abbreviations into the larger infrastructure by means of IRIs. It also means that we
               can tether our texts to others on the basis of segmentns that may be generally rare
               and unfamiliar or common but only to a specific field (e.g., sections of a legal
               document).</para>
            <para>Now that we have created the metadata for our transcriptions, we turn to the
               alignment files. Those <code><link linkend="element-head">&lt;head></link></code>s
               will look slightly different. We start with the TAN-A-div
               file:<programlisting>    &lt;head>
       &lt;name>div-based alignment of multiple versions of Ring o Roses&lt;/name>
       &lt;master-location>ringoroses.div.1.xml&lt;/master-location>
       &lt;rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/>
       &lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml&lt;/location>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml&lt;/location>
       &lt;/source>
       &lt;declarations/>
       &lt;agent xml:id="park" roles="creator">
          &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
          &lt;name xml:lang="eng">Jenny Park&lt;/name>
       &lt;/agent>
       &lt;role xml:id="creator" which="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
    &lt;/head></programlisting></para>
            <para>Much of the code above will look similar to the previous two examples. Every
               alignment file has only one kind of source, namely TAN transcription files, nothing
               else. Therefore <code><link linkend="element-source">&lt;source></link></code>'s
                     <code><link linkend="element-IRI">&lt;IRI></link></code> always takes the
                     <code><link linkend="attribute-id">@id</link></code> value of the corresponding
               TAN transcription file. <code><link linkend="element-name">&lt;name></link></code> is
               arbitrary. It may replicate exactly the title found in the transcription file, or it
               may be modified, perhaps to harmonize better with the descriptions of the other texts
               aligned in the file. <code><link linkend="element-source">&lt;source></link></code>
               also has an child element not seen in the earlier two examples, <code><link
                     linkend="element-location">&lt;location></link></code>, which specifies where
               the digital file was accessed and when (through <code><link
                     linkend="attribute-when-accessed">@when-accessed</link></code>). We may include
               as many of these <code><link linkend="element-location">&lt;location></link></code>
               elements as we wish, with the most preferred or reliable location at the top, since
               the validation process will use first document that is available. The <code><link
                     linkend="attribute-when-accessed">@when-accessed</link></code> value is
               important, because the validator will look for changes in the file, and if there have
               been changes since we last accessed the file, it will return a warning with a summary
               of the number and kind of changes. If such a report is returned, it is up to us to
               determine if the alterations merit any action on our part.</para>
            <para>Our TAN-A-div file could have any number of <code><link linkend="element-source"
                     >&lt;source></link></code>s, and not necessarily for the same work. It also
               does not matter in which order we put the <code><link linkend="element-source"
                     >&lt;source></link></code>s. <code><link linkend="element-declarations"
                     >&lt;declarations></link></code> is empty, mainly because we have, in this
               case, no working assumptions to declare. In more advanced uses, this element would
               not be empty.</para>
            <para>This <code><link linkend="element-head">&lt;head></link></code> explains why the
                     <code><link linkend="element-body">&lt;body></link></code> of our TAN-A-div
               file is allowed to be empty. We have already specified which sources are to be
               aligned and where they are to be found. All TAN-A-div files assume, by default, that
               every source that is a version of the same work should be aligned upon the basis of
               the <code><link linkend="attribute-n">@n</link></code> value of <code><link
                     linkend="element-div">&lt;div></link></code>s. That is, any user or processor
               of a TAN-A-div file may assume that all implicit alignments should be made unless
               otherwise specified. </para>
            <para>For transcriptions that are already similarly structured and labeled, a TAN-A-div
               file is unnecessary for alignment. But we will see that the options available in a
               TAN-A-div's <code><link linkend="element-declarations"
                  >&lt;declarations></link></code> and <code><link linkend="element-body"
                     >&lt;body></link></code> will allow us not only to deal with inconsistencies in
               source transcriptions but to make important statements, such indicating where one
               work quotes from another.</para>
            <para>Meanwhile we turn to our fourth file, TAN-A-tok, whose <code><link
                     linkend="element-head">&lt;head></link></code> looks like
               this:<programlisting>    &lt;head>
        &lt;name>token-based alignment of two versions of Ring o Roses&lt;/name>
        &lt;master-location>ringoroses.01+02.token.1.xml&lt;/master-location>
        &lt;rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/>
        &lt;source xml:id="ring1881">
            &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
            &lt;name>Ring o roses 1881&lt;/name>
            &lt;location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1881.xml&lt;/location>
        &lt;/source>
        &lt;source xml:id="ring1987">
            &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
            &lt;name>Ring o roses 1987&lt;/name>
            &lt;location when-accessed="2015-01-17">../TAN-T/ring-o-roses.eng.1987.xml&lt;/location>
        &lt;/source>
        &lt;declarations>
            &lt;bitext-relation xml:id="B-descends-from-A">
                &lt;IRI>tag:textalign.net,2015:bitext-relation:a/x+/b&lt;/IRI>
                &lt;name>B descends directly from A, unknown number of intermediaries&lt;/name>
                &lt;desc>The 1987 versions is hypothesized to descend somehow from the 
                    1881 version, mainly for the sake of illustration.&lt;/desc>
            &lt;/bitext-relation>
            &lt;reuse-type xml:id="adaptationGeneral">
                &lt;IRI>tag:textalign.net,2015:reuse-type:adaptation:general&lt;/IRI>
                &lt;name>general adaptation&lt;/name>
            &lt;/reuse-type>
            &lt;token-definition src="ring1881 ring1987" which="letters"/>
        &lt;/declarations>
        &lt;agent xml:id="park" roles="creator">
            &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
            &lt;name xml:lang="eng">Jenny Park&lt;/name>
        &lt;/agent>
        &lt;role xml:id="creator" which="creator"/>
        &lt;change when="2015-01-20" who="park">Started file&lt;/change>
    &lt;/head></programlisting></para>
            <para>The TAN-A-tok <code><link linkend="element-head">&lt;head></link></code> looks
               similar to the previous examples, except that <code><link
                     linkend="element-declarations">&lt;declarations></link></code> has three
               children.</para>
            <para><code><link linkend="element-bitext-relation">&lt;bitext-relation></link></code>
               states through an IRI + name pattern the stemmatic relationship we think holds
               between the two sources. (Stemmatics is the study of the chain of transmission by a
               single work eventually became the multiple copies, versions, and editions that are
               extant; it frequently involves the creation of genealogical-like trees to illustrate
               the work's version history.) We have used the entire IRI + name pattern, but we could
               have substituted it with <link linkend="attribute-which"><code>@which</code></link>
               and the value <code>a/x+/b</code>.</para>
            <para>One or more <code><link linkend="element-reuse-type"
               >&lt;reuse-type></link></code>s specify how one text has reused another. The IRI we
               have used shows that we believe that the later text has generally adapted the earlier
               one. If this were a translation or a quotation or some other kind of text reuse, we
               might have used a different IRI.</para>
            <para>A third declaration, <code><link linkend="element-token-definition"
                     >&lt;token-definition></link></code>, specifies how we have defined our word
               tokens. <code><link linkend="attribute-src">@src</link></code> has more than one
               value, specifying that the same tokenization rule should be applied to both
               sources.</para>
            <para>The value for <link linkend="attribute-which"><code>@which</code></link>,
                  <code>letters</code>, is a reserved TAN keyword that specifies that any
               consecutive string of word characters, ignoring spaces and punctuation. Under this
               token definition the phrase <code>"Hush!" said he</code> would have three tokens. Had
               we set the value of <link linkend="attribute-which"><code>@which</code></link> to the
               reserved TAN keyword <code>letters and punctuation</code>, we would have six tokens,
               since each punctuation mark would be defined as a token.</para>
            <para><code><link linkend="element-token-definition">&lt;token-definition></link></code>
               is optional. If we leave it out, users are to assume that we mean
                  <code>letters</code>. This is because most often, whenever in ordinary
               conversation we refer to the nth word in a sentence we assume people will skip
               punctuation marks in their counting.</para>
         </section>
         <section>
            <title>Aligning across Projects</title>
            <para>We now have a small, tightly knit corpus of TAN files. Let us imagine what it
               might be like to connect our TAN corpus to another. Let us assume that we have found
               in a German project a TAN transcription of a work that looks quite similar to our
               own:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-T.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-T xmlns="tag:textalign.net,2015:ns" id="tag:hans@beispiel.com,2014:ringel">
   &lt;head>
      &lt;name>TAN Transkription, Ringelreihen mit Riederfallen&lt;/name>
      &lt;master-location>http://beispiel.com/TAN-T/ringel.xml&lt;/master-location>
      &lt;rights-excluding-sources rights-holder="schmidt">
         &lt;IRI>http://creativecommons.org/licenses/by/4.0/&lt;/IRI>
         &lt;name>Creative Commons Namensnennung 4.0 International Lizenz.&lt;/name>
         &lt;desc>Dieses Werk ist lizenziert unter einer Creative Commons 
            Namensnennung 4.0 International Lizenz.&lt;/desc>
      &lt;/rights-excluding-sources>
      &lt;source>
         &lt;IRI>http://www.worldcat.org/oclc/4574384&lt;/IRI>
         &lt;name>Franz Magnus Böhme, Deutsches Kinderlied und Kinderspiel: Volksüberlieferungen aus
            allen Landen deutscher Zunge, gesammelt, geordnet und mit Angabe der Quellen. Leipzig,
            1897.&lt;/name>
      &lt;/source>
      &lt;declarations>
         &lt;work>
            &lt;IRI>tag:beispiel.com,2014:texte:holderbusch&lt;/IRI>
            &lt;name>"Die Kinder auf dem Holderbusch"&lt;/name>
         &lt;/work>
         &lt;version>
            &lt;IRI>urn:uuid:31648039-3dbb-49b9-b66e-9bd2cd11630e&lt;/IRI>
            &lt;name>zweite Version&lt;/name>
         &lt;/version>
         &lt;div-type xml:id="Zeile">
            &lt;IRI>http://dbpedia.org/resource/Gedichtzeile&lt;/IRI>
            &lt;name>Gedichtzeile&lt;/name>
         &lt;/div-type>
         &lt;filter>
            &lt;normalization>
               &lt;IRI>tag:kalvesmaki@gmail.com,2014:normalization:hyphens-discretionary-off&lt;/IRI>
               &lt;name>Keine Bindestriche&lt;/name>
            &lt;/normalization>
         &lt;/filter>
      &lt;/declarations>
      &lt;agent xml:id="schmidt" roles="Produzent">
         &lt;IRI>tag:hans@beispiel.com,2014:selbst&lt;/IRI>
         &lt;name xml:lang="eng">Hans Schmidt&lt;/name>
      &lt;/agent>
      &lt;role xml:id="Produzent">
         &lt;IRI>http://schema.org/producer&lt;/IRI>
         &lt;name xml:lang="eng">Produzent&lt;/name>
      &lt;/role>
      &lt;change when="2014-08-13" who="schmidt">Anfang&lt;/change>
      &lt;comment when="2014-08-13" who="schmidt">unten auf der Z. 438, recht&lt;/comment>
   &lt;/head>
   &lt;body xml:lang="deu" in-progress="false">
      &lt;div type="Zeile" n="a">Ringel, Ringel, Reihe!&lt;/div>
      &lt;div type="Zeile" n="b">Sind der Kinder dreie,&lt;/div>
      &lt;div type="Zeile" n="c">Sitzen auf dem Holderbuch,&lt;/div>
      &lt;div type="Zeile" n="e">Schreien alle: husch, husch, husch!&lt;/div>
   &lt;/body>
&lt;/TAN-T></programlisting></para>
            <para>It seems clear to us that this 19th-century German version is quite similar to our
               two English versions. We have some alignment options open to us. Two more sets of
               word-for-word alignments would be interesting, but remember, just because we find a
               text that nicely aligns with others does not mean that we <emphasis role="italic"
                  >must</emphasis> align them, or even if we choose to make an alignment that we
               have to align <emphasis>everything</emphasis>. In this case, we choose not to worry
               about word-for word alignments, and we focus here only on the TAN-A-div alignment, so
               that, for example, we can later generate an HTML report that will allow us to more
               conducively read the three versions in parallel and study their relationships.</para>
            <para>To that end, we first observe some differences between this transcription and our
               other two. First, the value of <code><link linkend="element-work"
                  >&lt;work></link></code> is not the one we have given our two versions. Second,
               the <code><link linkend="element-div-type">&lt;div-type></link></code> is defined as
                  <code>http://dbpedia.org/resource/Gedichtzeile</code> (Gedichtzeile = line of
               poetry). Third, the lines have been lettered instead of numbered. And last, the
               editor seems to have made a typographical error, making the last line
                  <code>n="e"</code> instead of <code>n="d"</code>). These four differences typify
               some of the inconsistencies that are commonly found in digital texts.<note>
                  <para>There are a few other differences in this third transcription that do not
                     affect our alignment. <code><link linkend="element-version"
                        >&lt;version></link></code> is used to distinguish different versions of the
                     same work found on the same text-bearing object. That is, if we are
                     transcribing a bilingual edition, we can use <code><link
                           linkend="element-version">&lt;version></link></code> to specify which of
                     the two versions we are encoding. Notice that the <code><link
                           linkend="element-IRI">&lt;IRI></link></code> value is a uuid. In this
                     case the editor was not prepared to deploy a formal IRI naming scheme (perhaps
                     using a tag URN) that would be satisfactory for work-versions.</para>
               </note></para>
            <para>These are points we can easily reconcile in our TAN-A-div file, which we now
               expand to include the German version. We make the following adjustments (in
               boldface):<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.rnc" type="application/relax-ng-compact-syntax"?>
&lt;?xml-model href="http://textalign.net/release/TAN-1-dev/schemas/TAN-A-div.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
&lt;TAN-A-div xmlns="tag:textalign.net,2015:ns" id="tag:parkj@textalign.net,2015:ring-alignment">
    &lt;head>
       &lt;name>div-based alignment of multiple versions of Ring o Roses&lt;/name>
       &lt;master-location>ringoroses.div.1.xml&lt;/master-location>
       &lt;rights-excluding-sources which="by-nc-nd_4.0" rights-holder="park"/>
       &lt;source xml:id="eng-uk">
          &lt;IRI>tag:parkj@textalign.net,2015:ring01&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (UK)&lt;/name>
          &lt;location when-accessed="2015-03-10">../TAN-T/ring-o-roses.eng.1881.xml&lt;/location>
       &lt;/source>
       &lt;source xml:id="eng-us">
          &lt;IRI>tag:parkj@textalign.net,2015:ring02&lt;/IRI>
          &lt;name>Transcription of ring around the roses in English (US)&lt;/name>
          &lt;location when-accessed="2014-08-13">../TAN-T/ring-o-roses.eng.1987.xml&lt;/location>
       &lt;/source>
       <emphasis role="bold">&lt;source xml:id="ger">
          &lt;IRI>tag:beispiel.com,2014:ringel&lt;/IRI>
          &lt;name>Transcription of an ancestor of Ring around the roses in German&lt;/name>
          &lt;location when-accessed="2014-08-22">http://beispiel.com/TAN-T/ringel.xml&lt;/location>
          &lt;location when-accessed="2014-08-22">../TAN-T/ring-o-roses.deu.1897.xml&lt;/location>
       &lt;/source></emphasis>
       &lt;declarations/>
       &lt;agent xml:id="park" roles="creator">
          &lt;IRI>tag:parkj@textalign.net,2015:self&lt;/IRI>
          &lt;name xml:lang="eng">Jenny Park&lt;/name>
       &lt;/agent>
       &lt;role xml:id="creator" which="creator"/>
       &lt;change when="2014-08-14" who="park">Started file&lt;/change>
       <emphasis role="bold">&lt;change when="2014-08-22" who="park">Added German version.&lt;/change></emphasis>
    &lt;/head>
    &lt;body>
       <emphasis role="bold">&lt;equate-works src="eng-uk ger"/>
       &lt;equate-div-types>
          &lt;div-type-ref src="ger" div-type-ref="Zeile"/>
          &lt;div-type-ref src="eng-uk" div-type-ref="line"/>
       &lt;/equate-div-types>
       &lt;realign>
          &lt;anchor-div-ref source="ger" ref="5"/>
          &lt;div-ref source="eng-us" ref="4"/>
       &lt;/realign></emphasis>
    &lt;/body>
&lt;/TAN-A-div></programlisting></para>
            <para>The first major change is the insertion of a new <code><link
                     linkend="element-source">&lt;source></link></code>, identifying the name and
               location of the third example. Note that two locations have been provided, one for
               the original location and another for the copy saved locally into our project folder.
               Validation will occur at the first document available. If we wanted to work primarily
               off our local copy, we would have put it first. By placing it second, we allow the
               validation engine to look for updates and changes in the master version. If that
               version is unavailable, validation will be made against second, local copy.</para>
            <para>The second major insertion is a new <code><link linkend="element-change"
                     >&lt;change></link></code>, documenting when we made the alterations. The value
               of <code><link linkend="attribute-when">@when</link></code> effectively updates the
               version of our TAN-A-div file.</para>
            <para>The third major change populates the <code><link linkend="element-body"
                     >&lt;body></link></code> with elements that calibrate the new version to the
               other two. <code><link linkend="element-equate-works">&lt;equate-works></link></code>
               says that, for the sake of this alignment, the works defined in the UK version and
               the German version to be considered equivalent. We did not mention the US version
               because we do not need to. TAN rules specify that all alignments are transitive
               unless otherwise specified. If A and B are already defined to be the same work, and
               we equate A and C as the same work, then B and C will be equated as well. Note, we
               are not committing ourselves to the proposition that they are in reality the same
               work. We are making this statement only provisionally, to facilitate the
               alignment.</para>
            <para><code><link linkend="element-equate-div-types">&lt;equate-div-types></link></code>
               declares that what the German version calls Zeile is, for the sake of this alignment,
               equivalent to what the UK version calls line. Transitivity means that Zeile is
               inferred to be equivalent to what the US version calls <code>l</code>. This element
               is completely optional. If we left it out, the alignment, which is based upon
               references, not division types, would not be affected. But by creating it, we assist
               users who may care about textual divisions.</para>
            <para>A <code><link linkend="element-realign">&lt;realign></link></code> takes care of
               the apparent typographical error, this time anchoring the German version to the US
               one. Any <code><link linkend="element-div-ref">&lt;div-ref></link></code> in a
                     <code><link linkend="element-realign">&lt;realign></link></code> is wrested
               from automatic alignment and attached to an <code><link
                     linkend="element-anchor-div-ref">&lt;anchor-div-ref></link></code> and, by the
               law of transitivity, anything that aligns to it, in this case the UK version.</para>
            <para>Note that we have used <code>5</code> and not <code>e</code> to point to the stray
               reference in the German version. But we could have used <code>e</code>, or even the
               Roman numeral <code>v</code>, had we wished to, but we should find a single numbering
               system we're comfortable with for our TAN-A-div file, and stick with it. Every TAN
               file's numeration system is evaluated locally, independent of any companion files.
               That way a single TAN file can use a single kind of numbering to access multiple TAN
               documents that may each use different numerals. Therefore we do not need to reconcile
               the letter labels <code>a</code>, <code>b</code>, and <code>c</code> in the
                     <code><link linkend="attribute-n">@n</link></code> values in the German
               version, because these will be automatically treated as equivalent to <code>1</code>,
                  <code>2</code>, and <code>3</code>. The TAN format allows four numeration systems
               other than Arabic numerals: Roman numerals (uppercase or lowercase), alphabetic
               numerals (a, b, c, ..., z, aa, bb, ....), and digit-alphabet combinations (e.g., 1a,
               1e, 4g) or alphabet-digit combinations (e.g., a4, a5, b5). The last two systems will
               be converted to hyphen-joined Arabic numerals before comparison (e.g., 1-1, 1-5, 1-7,
               1-4, 1-5, 2-5).</para>
            <para>With these changes, the new version is completely synchronized with the other two.
               Our work may have been simplified if we had just modified the German version ourself.
               But such changes would have affected only our local copy, not the master one.
               Changing only our local copy would not allow us to connect our work to other TAN
               files that may be depending upon the same master file.</para>
            <para>But the format has also been designed to anticipate a living, growing network.
               Perhaps Hans Schmidt, the producer of the German version, can be contacted. We do so,
               and we suggest that he modify the version to make it align better. In the case of
                     <code><link linkend="element-div-type">&lt;div-type></link></code>, he need
               merely add another element: <code><link linkend="element-IRI"
                  >&lt;IRI></link>http://dbpedia.org/resource/Line_(poetry)&lt;/IRI></code>. This
               line, in addition to the preexisting <code><link linkend="element-IRI"
                     >&lt;IRI></link></code>, specifies that the two IRIs are equivalent. Perhaps he
               has reasons for labeling the lines with letters, and perhaps he is reluctant to
               explicitly identify this poem with <emphasis role="italic">Ring around the
                  Rosie</emphasis>. That is within his rights. (Remember, TAN is meant to provide a
               framework within which opinions can be registered, even counterintuitive ones.) But
               the conversation might lead to our pointing out that <code>n="e"</code> should
               probably be <code>n="d"</code> and that there is an apparent discrepancy in the last
               line. (The original, printed book has the poem twice on page 438, one with the
               spelling "Holderbuch," the other, "Holderbusch"). If Schmidt chooses to correct his
               master file, he can add a new <code><link linkend="element-change"
                  >&lt;change></link></code>, and thereby tacitly notify anyone else using the file
               that corrections have been made.</para>
            <para>At this point we have a network of five TAN files, four in our corpus and one from
               outside. Although simple, the network could be the basis for some creative and
               complex research questions. Stylesheets could be used to automatically align the
               versions for reading and study, or to perform statistical analysis. Study of the rest
               of these guidelines, as well as example TAN libraries, will suggest numerous ways to
               create, manage, share, and use TAN files.</para>
         </section>
      </chapter>
   </part>
   <part xml:id="detailed_description">
      <title>Detailed Description</title>
      <partintro>
         <para>This part of the guidelines provides a detailed description of the formats of the
            Text Alignment Network. The material is organized according to the structure that
            governs the schema files, so both can be read in tandem.</para>
         <para><xref linkend="concepts_common"/> outlines, in a non-technical way, the principles
            and technical foundations of the TAN format.</para>
         <para><xref linkend="class_common"/>, <xref linkend="class_1"/>, <xref linkend="class_2"/>,
            and <xref linkend="class_3"/> comprehensively describe all the TAN formats. Each chapter
            covers preliminary theoretical or scholarly considerations, discussiong how the features
            of each TAN format are meant to be interpreted as a whole. </para>
         <para><xref linkend="elements-attributes-and-patterns"/>, the first of two very long
            chapters, provides a comprehensive, detailed explanation of the rules for every element
            and attribute, as well as the patterns into which they fall. This chapter includes a
            thorough list of relevant validation rules and examples. It has been written using a
            stylesheet that traverses the official TAN schemas, functions, and examples.</para>
         <para><xref linkend="keywords-master-list"/> lists all the vocabulary items that have
            already been defined as a core part of the format. This chapter is, essentially, a
            re-presentation of the TAN-key files that are in the <code>TAN-key</code> folder.</para>
         <para>The chapters in this part of the guidelines should be read selectively, not
            consecutively. They have been written with the assumption that you have already read the
            previous part (<xref linkend="general_overview"/>) and that you have already started to
            create and edit a TAN collection.</para>
         <para>Because readers will come from different specialties, all acronyms, abbreviations,
            and concepts are defined and explained, albeit tersely. Concepts or technologies are
            discussed only insofar as they affect the use of TAN; suggestions for further reading
            are provided for those who want a more thorough introduction to a topic. </para>
      </partintro>
      <chapter xml:id="concepts_common">
         <title>General Underpinnings</title>
         <para>This chapter retains something of the introductory spirit of the previous one by
            providing an overview of the fundamental principles and technologies behind TAN. The
            overall goal of this chapter is to document the definitions, assumptions, and other
            matters that have shaped the design of the format. Although this chapter assumes on your
            part no prior knowledge of any particular technology, it is also not meant to be a
            tutorial. Links to further reading will take you to more adequate introductory
            material.</para>
         <section>
            <title>The Big Picture</title>
            <para>The Text Alignment Network is a modular suite of XML encoding formats. Each TAN
               format is designed for a specific type of textual data, divided into three classes:
               transcriptions (class 1), annotations of transcriptions (class 2), and everything
               else (class 3). </para>
            <para><emphasis role="bold">Class 1</emphasis>, representations of textual objects,
               consists solely of transcription files. Each transcription file contains the text of
               a single work from a single text-bearing object, whether physical or digital (an
               object we sometimes term <emphasis>scriptum</emphasis>). There are two types of
               transcription file: a standard generic format and a TEI extension. Both are TEI
               conformable. These two types are differentiated by the root element, <code><link
                     linkend="element-TAN-T">&lt;TAN-T></link></code> and <code>&lt;TEI></code>
               respectively. In the future, class 1 may expand to include formats intended to
               segment (and therefore align) visual, audio, or audiovisual files; it may also expand
               to include a customized form of HTML. </para>
            <para><emphasis role="bold">Class 2</emphasis>, annotations of class 1 files, encode
               data concerning alignment, lexico-morphology, and other textual claims. There are two
               types of alignment, one for broad, general alignments and another for granular,
               word-for-word aligments. The former, with <code><link linkend="element-TAN-A-div"
                     >&lt;TAN-A-div></link></code> as the root element, aligns any number (one or
               more) of class 1 files, and permits assorted claims about those files. The latter,
                     <code><link linkend="element-TAN-A-tok">&lt;TAN-A-tok></link></code>, aligns
               only pairs of class 1 files. Lexico-morphology files, <code><link
                     linkend="element-TAN-LM">&lt;TAN-LM></link></code>, are used to encode the
               lexical and morphological (or part of speech) forms of individual words in a single
               class 1 file. In the future, class 2 may expand to include syntax
               (treebanking).</para>
            <para><emphasis role="bold">Class 3</emphasis>, covers everything else. <code><link
                     linkend="element-TAN-mor">&lt;TAN-mor></link></code> declares the
               grammatical categories or features of a given language and stipulates rules for
               tagging words. <code><link linkend="element-TAN-key">&lt;TAN-key></link></code>
               collects and defines terms frequently used in other TAN files. <code><link
                     linkend="element-TAN-c">&lt;TAN-c></link></code> supports assertions (in a
               syntax inspired by RDF) to provide context to other TAN files. Class 3 may expand in
               the future to include transliteration, lexicography, and syntax. </para>
            <para><emphasis role="italic">Inclusions</emphasis>: Any TAN file may include any other
               TAN file, no matter the class of either the including or the included files.
               Inclusions in TAN behave differently than other kinds of inclusions in markup
               languages. For example, in XSLT, if file A includes file B, all of B's first-tier
               children are copied into the root element of A before A is processed. In <link
                  xlink:href="https://www.w3.org/TR/2003/WD-xinclude-20031110/">XML
                  Inclusions</link>, inclusion pertains either to the entire file or to a specific
               element, named through <link xlink:href="https://www.w3.org/TR/WD-xptr"
                  >XPointer</link>. For these reasons, mutual inclusion is not allowed because of
               its inherent circularity.</para>
            <para>In TAN, inclusion is a two-step process. First the included file B is declared by
               means of an <code><link linkend="element-inclusion">&lt;inclusion></link></code> in
               the <code><link linkend="element-head">&lt;head></link></code> of document A. Second,
               certain elements in document A may include an <link linkend="attribute-include"
                     ><code>@include</code></link>, specifying that the host element should be
               replaced by all elements of the same name found in document B. Because of this
               behavior, the prohibition on circular inclusion pertains only to select element
               names. That is, A and B may validly invoke each other as inclusions, or share
               inclusions, as long as there is no circularity in the elements that are
               included.</para>
            <para>TAN files that refer to or are referred to by other TAN files form a kind of
               network. Alignment files become the principal point of connection. Below is an
               illustration of how an ecosystem of independently curated TAN files might
               interrelate, with arrows showing lines of dependency.</para>
            <para><inlinemediaobject>
                  <imageobject>
                     <imagedata fileref="img/TAN%20Sample%20Ecosystem.jpeg"/>
                  </imageobject>
               </inlinemediaobject></para>
            <para>In this hypothetical example, Editor 1 has transcriptions of four different high
               medieval works and she wants simply to make them available to anyone who want to use
               them, and posts them on Server 1. Editor 2 (= Server 2), interested primarily in Old
               French morphology, finds three versions in Server 1 that are in that language and
               publishes a morphological analysis of them. Editor 3 has provided a small collection
               of two early interrelated medieval Latin works. Editor 4 has found an Old English
               version missing from Editor 1's collection, and has decided to provide not only a
               word-for-word correspondence between it and a key Old French version, but to create a
               morphological analysis of that Old English version, as a counterpart to Editor 2's
               work on the Old French version. (He is interested in computing the morphological
               differences between the Old French and Old English versions.) Editor 5 is interested
               primarily in showing where Server 1's collection quotes from the works on Server 3,
               and so merely puts together an alignment of quotations.</para>
            <para>This approach adopts what is sometimes called <emphasis role="italic">stand-off
                  annotation</emphasis> (or <emphasis role="italic">stand-off markup</emphasis>), in
               contrast to <emphasis role="italic">in-line annotation</emphasis>, in which a
               transcription and its alignments, morphology, and other annotations are placed in a
               single file. (Most TEI and HTML files rely upon in-line annotation.) In the TAN
               format, stand-off annotation has been extended into a modular design, with each
               module designed to to be simple and complement the other modules. (In fact, the
               combined sum of elements and attributes from TAN modules are roughly equivalent to
               the number of elements in HTML.) Modular stand-off annotation has been adopted for
               several reasons: <itemizedlist>
                  <listitem>
                     <para>An editor can work on a file with minimal distraction, focusing on a
                        limited set of closely related questions. (Editors 2 and 5 can work off the
                        same master files provided by Server 1, even though they have very different
                        research interests.)</para>
                  </listitem>
                  <listitem>
                     <para>Complementary or competing annotations can be made, even if those
                        annotations overlap (a major problem for in-line annotation, where according
                        to XML rules no element may interlock or overlap with another). (Editor 5
                        may choose to incorporate or ignore the alignments that Editor 3 has made of
                        her collection.)</para>
                  </listitem>
                  <listitem>
                     <para>Annotations can be made concurrent to any others that may already exist,
                        allowing for rich and complex analyses. </para>
                  </listitem>
                  <listitem>
                     <para>After a TAN collection is published, any other TAN files that it refers
                        to, or any TAN files referring to it, can be aggregated into much larger and
                        more complex datasets, which can then be queried to answer questions that
                        might not have been anticipated.</para>
                  </listitem>
                  <listitem>
                     <para>Editorial labor can be conducted without central coordination, as
                        individuals work at their own pace, independently, on separate files.</para>
                  </listitem>
                  <listitem>
                     <para>When errors are found, they can be corrected in master files. Anyone
                        depending upon that master file as a source will be notified of changes that
                        have been made and they can deal with them accordingly. (Editor 1 can post
                        typographical corrections, and if she logs the change with a time-date
                        stamp, anyone using the file, upon validating their files, will be sent
                        information or a warning about the change. Similarly, Editors 2 and 4 can
                        let Editor 1 know about their work, and Editor 1 can update the Old French
                        versions with cross-references.)</para>
                  </listitem>
                  <listitem>
                     <para>Any data file can be released, circulated, and used independent of any
                        other that points to it, or to which it points.</para>
                  </listitem>
                  <listitem>
                     <para>Connected files can be combined and transformed in any number of ways to
                        produce a wide variety of derivative documents (e.g., collated versions,
                        statistical analysis). A transformation created for one set of TAN documents
                        will work identically on other TAN documents of the same format. (If someone
                        creates a tool to synthesize a transcription and an associated TAN-LM file,
                        it can be applied to both Editor 2's and Editor 4's work.)</para>
                  </listitem>
                  <listitem>
                     <para>The TAN family of formats can be expanded to allow other types of
                        linguistic data, and therefore other lines of research.</para>
                  </listitem>
               </itemizedlist></para>
            <para>Stand-off annotation is not without its liabilities. Files might be altered or
               altogether deleted, rendering dependent files meaningless. An editor may find that
               not having the annotated text in the same place as the annotation is an
               inconvenience. These are significant challenges, but TAN validation rules have been
               designed to mitigate them somewhat. </para>
         </section>
         <section>
            <title>Assumptions in the Creation of TAN Data</title>
            <para>All creators and users of TAN files are expected to share few basic
               assumptions.</para>
            <para>First, all TAN-compliant data is to be understood as largely
                  <emphasis>derivative</emphasis>. That is, data files have no originality or
               creativity independent of their sources (but see below about interpretation).
               TAN-compliant data is to be created with intent of adhering as closely as possible to
               some model or archetype. For example, a transcription should replicate faithfully
               some earlier digital edition or text-bearing material object (e.g., stone, papyrus,
               manuscript, printed book for written text; audiovisual media for oral or performative
               texts). Morphological files and alignment files should describe as clearly and as
               reliably as possible their source transcriptions. <emphasis>In creating and
                  publishing a TAN file you claim to have offered a good-faith representation or
                  description of something; in using a TAN file, you hold the creator to that
                  expectation.</emphasis></para>
            <para>Second, all core TAN files are <emphasis>interpretive</emphasis>. That is, they
               are permeated by editorial assumptions and opinions that might not be shared by
               everyone. If there is any originality or creativity in a TAN file, it is in that
               interpretive outlook. For example, if you edit a transcription file you must decide
               how to handle unusual letterforms and other visible marks. Your decisions will be
               informed by how you view the original text and its native writing system, and how you
               interpret and use Unicode. If you write an alignment file, you must make decisions
               about what factors caused one text to be transformed into another.
               Lexicomorphological files require you to commit to one or more grammars and
               dictionaries, and you must discern how best to handle cases of vagueness and
               ambiguity. As a general rule, the TAN classes go from least interpretive (class 1) to
               most (class 3). But no matter which class, no TAN data file ever stands completely
               outside the interpretive act. <emphasis>In creating and publishing a TAN file you
                  claim to have disclosed as best you can the assumptions behind your interpretive
                  outlook; in using a TAN file, you hold the creator to that
               expectation.</emphasis></para>
            <para>Third, all core TAN files are <emphasis>useful</emphasis>. That is, the
               interpretive impluse is assumed to be coupled with an equally strong desire to make
               the data as useful to as many users as possible, even those who may not share your
               assumptions or interpretation. A creator of a transcription file, for example, should
               normalize and segment texts with a minimum of idiosyncracies, adopting when possible
               reference systems that are widely used so as to optimize the alignment process.
               Morphological files should depend whenever possible upon commonly accepted grammars
               and lexica. Alignment files should work with comprehensible categories of text reuse.
               No TAN file will always be useful to everyone, but it should be as useful to as many
               as possible, as frequently as possible. <emphasis>In creating a TAN file you claim to
                  use common, shared conventions whenever possible, and to note any departures; in
                  using a TAN file, you hold the creator to that expectation.</emphasis></para>
         </section>
         <section>
            <title>Core Technology</title>
            <para>TAN depends upon a core set of relatively stable technologies. Those technologies
               and the underlying terminology are very briefly defined and explained below, as far
               as they affect the TAN format. References to further reading will lead you to better
               and more thorough introductions. The central goal of this section is to highlight any
               decisions made in the design of TAN that significantly affect how anyone might create
               or interpret TAN-compliant data.</para>
            <section xml:id="unicode">
               <title>Unicode</title>
               <section>
                  <title>What is it?</title>
                  <para>Unicode is the worldwide standard for the consistent encoding,
                     representation, and exchange of digital texts. Stable but still growing,
                     Unicode is intended to represent all the world's writing systems, living and
                     historical. Maintained by a nonprofit organization, Unicode is the basis upon
                     which we can create and edit text in mixed alphabets and reliably share that
                     data with other people, independent of individual fonts. Any Unicode-compliant
                     text is in general semantically interoperable on the character level and can be
                     exchanged between users and systems, no matter what font might be used to
                     display the text. If some software tries to display some Unicode-compliant text
                     in a particular font that does not support a particular alphabet, and ends up
                     displaying boxes, the underlying data is still intact and valid. Styling the
                     text with a font that does support the alphabet will reveal this to be the
                     case.</para>
                  <para>With more than 128,000 characters, Unicode is almost as complex as human
                     writing itself. The entire sequence of characters is divided into blocks, each
                     one reserved, more or less, for a particular alphabet or a set of characters
                     that share something in common. Within each block, characters may be grouped
                     further. Each character is assigned a single codepoint.</para>
                  <para>Because computers work on the binary system, it was considered ideal to
                     number the characters or glyphs in Unicode with a related numeration system.
                     Codepoints are therefore numbered according to a hexadecimal system (base 16),
                     which uses the digits 0 through 9 and the letters A through F. (The number 10
                     in decimal is A in hexadecimal; decimal 11 = hex B; decimal 17 = hex 10;
                     decimal 79 = hex 4F.) To find Unicode codepoint values is therefore helpful to
                     think of the corpus of glyphs as a very long ribbon sixteen squares wide. This
                     is illustrated nicely <link
                        xlink:href="http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF"
                        >in this article</link>. Each position along the width is labeled with a
                     hexadecimal number (0-9, A-F) that always identifies the last digit of a
                     character's code point value.</para>
                  <para>It is common to refer to Unicode characters by their value or their name.
                     The value customarily starts "U+" and continues with the hexadecimal value,
                     usually at least four digits. The official Unicode name is usually given fully
                     in uppercase. Examples:</para>
                  <para>
                     <table frame="all">
                        <title>Unicode characters</title>
                        <tgroup cols="3">
                           <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                           <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                           <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                           <thead>
                              <row>
                                 <entry>Character</entry>
                                 <entry>Unicode value</entry>
                                 <entry>Unicode name</entry>
                              </row>
                           </thead>
                           <tbody>
                              <row>
                                 <entry>" " (space)</entry>
                                 <entry>U+0020</entry>
                                 <entry>SPACE</entry>
                              </row>
                              <row>
                                 <entry>®</entry>
                                 <entry>U+00AE</entry>
                                 <entry>REGISTERED SIGN</entry>
                              </row>
                              <row>
                                 <entry>ю</entry>
                                 <entry>U+044E</entry>
                                 <entry>CYRILLIC SMALL LETTER YU</entry>
                              </row>
                           </tbody>
                        </tgroup>
                     </table>
                  </para>
               </section>
               <section xml:id="normalization">
                  <title>Normalization</title>
                  <para>TAN validation rules require all data to be normalized according to the
                     Unicode NFC algorithm. Any text in a TAN body that does not comply will be
                     marked as invalid. Validation engines that support Schematron Quick Fixes will
                     allow users to easily convert non-normalized to normalized Unicode.</para>
               </section>
               <section xml:id="unicode-characters-with-special-interpretation">
                  <title>Unicode characters with special interpretation</title>
                  <para>The TAN format allows the following characters anywhere, but assign special
                     meaning in certain contexts:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>U+200D ZERO WIDTH JOINER</para>
                        </listitem>
                        <listitem>
                           <para>U+00AD SOFT HYPHEN</para>
                        </listitem>
                     </itemizedlist>
                  </para>
                  <para>When these characters occur at the end of a leaf <link linkend="element-div"
                           ><code>&lt;div></code></link>, perhaps followed by white space that will
                     be ignored (see below), processors will assume that the character is to be
                     deleted, and when combined with the next leaf div, no intervening space should
                     be allowed. Furthermore, because these characters are difficult to discern from
                     spaces and hyphens, any output based on the character mapping of the core
                     functions should replace these characters with their XML entities,
                        <code>&amp;#x200d;</code> and <code>&amp;#xad;</code>.</para>
               </section>
               <section xml:id="combining_characters">
                  <title>Combining characters</title>
                  <para>At the core level of conformance, Unicode does not dictate whether combining
                     characters (accents, modifying symbols) should be counted independently or as
                     part of a base character, nor does the family of XML languages. In most
                     circumstances, this point is negligible. But it affects regular expressions and
                     XPath expressions (see below). </para>
                  <para>Two of the class 2 formats allow the counting of characters. Such counting
                     is assumed to be made exclusively of non-combining characters, defined as the
                     regular expression <code>[^\p{M}]</code>. Any numerical reference made in a TAN
                     file to an individual character will be found by counting only non-combining
                     characters, and will return that base character combined with all combining
                     characters that immediately follow. Any <link linkend="element-div"
                           ><code>&lt;div></code></link> that starts with a combining character will
                     be marked as invalid. See also <xref linkend="reg_exp_and_comb_chars"/>.</para>
               </section>
               <section>
                  <title>Deprecated Unicode points</title>
                  <para>Because TAN is focused not at all on appearance, the following characters
                     will generate an error if found in a TAN file:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>U+00A0 NO-BREAK SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2000 EN QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2001 EM QUAD</para>
                        </listitem>
                        <listitem>
                           <para>U+2002 EN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2003 EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2004 THREE-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2005 FOUR-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2006 SIX-PER-EM SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2007 FIGURE SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2008 PUNCTUATION SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+2009 THIN SPACE</para>
                        </listitem>
                        <listitem>
                           <para>U+200A HAIR SPACE</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
               <section>
                  <title>Further Reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://unicode.org">Unicode
                              Consortium</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="http://en.wikipedia.org/wiki/Unicode"
                                 >Unicode</link> (Wikipedia)</para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="xml">
               <title>eXtensible Markup Language (XML)</title>
               <section>
                  <title>What is it?</title>
                  <para>Defined by the W3C, the eXtensible Markup Language (XML) is a
                     machine-actionable markup language that facilitates human readability.</para>
                  <para>At its heart, XML is rather simple. It begins with an opening line that
                     declares that what otherwise would look just like plain text is an XML file. It
                     then proceeds to the data, which must marked by one or more pairs of tags. An
                     opening tag looks like <code>&lt;tag></code> and a closing like
                        <code>&lt;/tag></code> (or if the tags contain no data, this can be
                     collapsed into one: <code>&lt;tag/></code>). A pair of matching tags is called
                     an <emphasis role="bold">element</emphasis>. Elements must nest within each
                     other. They cannot overlap. For
                     example:<programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;p>A paragraph about 
  &lt;name>
    &lt;first>Mary&lt;/first> 
    &lt;last>Lee&lt;/last>&lt;/name>.&lt;/p></programlisting></para>
                  <para>This nesting relationship of elements means that an XML document can be
                     pictured as a tree, a metaphor that provides a host of technical names for the
                     relationships that hold between elements: <emphasis>root</emphasis>,
                        <emphasis>parent</emphasis>, <emphasis>child</emphasis>,
                        <emphasis>sibling</emphasis>, <emphasis>ancestor</emphasis>, and
                        <emphasis>descendant</emphasis>. In the example above, the root element
                        <code>&lt;p></code> is the parent of <code>&lt;name></code> and the ancestor
                     of <code>&lt;name></code>, <code>&lt;first></code>, and <code>&lt;last></code>.
                     The element <code>&lt;first></code> is a child of <code>&lt;name></code> and a
                     descendant of both &lt;name> and <code>&lt;p></code>. <code>&lt;first></code>
                     and <code>&lt;last></code> are siblings to each other.</para>
                  <para>The opening tag of an element might have additional nodes called <emphasis
                        role="bold">attributes</emphasis>, recognized by a word, an equals sign, and
                     then some text within quotation marks (single or double), e.g.,
                        <code>id="self"</code>. An element may have many attributes, and those
                     attributes can appear in any order. Attributes can be thought of as leaves on
                     an XML tree. They are intended to carry simple data (usually metadata about the
                     data contained by the element), because they cannot govern anything
                     else.</para>
                  <programlisting>&lt;?xml version="1.0" encoding="UTF-8"?>
&lt;p n="1" id="example">A paragraph about &lt;name>&lt;first>Mary&lt;/first> &lt;last>Lee&lt;/last>&lt;/name>.&lt;/p></programlisting>
                  <para>The two examples above are functionally equivalent. The first takes up
                     several lines whereas the second has only two. But they're still equivalent.
                     That is because in most XML projects extra lines, spaces, and indentation are
                     effectively ignored by processors, to give human editors the flexibility they
                     need to optimize indentation for readability. Therefore, continuous strings of
                     multiple spaces, tabs, and newline/carriage return are to be treated as a
                     single space. (See below.)</para>
                  <para>XML allows for other rules to be added, if an individual or group so wishes.
                     These rules, called schemas, can allow great flexibility or be very strict. The
                     TAN schemas tend to the latter.</para>
               </section>
               <section>
                  <title>Schemas and validation</title>
                  <para>Validation files are found here: <code><link
                           xlink:href="http://textalign.net/release/TAN-1-dev/schemas/"
                           >http://textalign.net/release/TAN-1-dev/schemas/</link></code>.</para>
                  <para>Each TAN file is validated by two types of schema files, one dealing with
                     major rules concerning structure and data type (written in RELAX-NG) the other
                     with very detailed rules (written in Schematron). </para>
                  <para>The RELAX-NG rules are written primarily in compact syntax
                        (<code>.rnc</code>), and converted to the XML syntax (<code>.rng</code>).
                     For TAN-TEI, the special format One Document Does it all (<code>.odd</code>) is
                     used to alter the rules for TEI All.</para>
                  <para>The Schematron files are generally quite simple, acting as a conduit to a
                     large function library written in XSLT. For more on this process, see <xref
                        linkend="tan-stylesheets-and-function-library"/>.</para>
                  <para>Some validation engines that process a valid TAN-compliant TEI file may
                     return an error something like <code>conflicting ID-types for attribute "who"
                        of element "comment" from namespace "tag:textalign.net,2015:ns"</code>. Such
                     a message alerts you to the fact that by mixing TEI and TAN namespaces, you
                     open yourself up to the possibility of conflicting <code>xml:id</code> values.
                     It is your responsibility to ensure that you have not assigned duplicate
                     identifiers. Very often, it is possible for you to configure an XML editor to
                     ignore this discrepancy. (In oXygen XML editor go to Options > Preferences... >
                     XML > XML Parser > RELAX NG and uncheck the box ID/IDREF.)</para>
               </section>
               <section xml:id="whitespace">
                  <title>White space</title>
                  <para>In any XML file, unless otherwise specified, consecutive space characters
                     (space, tab, newline, and carriage return) are considered equivalent to a
                     single space. This gives editors the freedom they need to format XML documents
                     as they like, for either human readability or compactness. </para>
                  <para>All TAN formats assume data will be pre-processed with space normalization,
                     as defined by the standard XML function <code>fn:normalize-space()</code>,
                     which trims space from the beginning and end of a text node or string, and
                     replaces consecutive space marks with a single space. Some space is assumed to
                     exist between adjacent leaf <code><link linkend="element-div"
                        >&lt;div></link></code>s, even if no space intervenes (unless if the first
                           <code><link linkend="element-div">&lt;div></link></code> ends in the soft
                     hyphen or the zero width joiner; see <xref
                        linkend="unicode-characters-with-special-interpretation"/>). What type of
                     space is not dictated by the TAN format. It is up to processors to analyze the
                     relevant <code><link linkend="element-div-type">&lt;div-type></link></code> to
                     interpret what kind of white-space separator is appropriate. </para>
                  <para>If retention of multiple spaces is important for your research, then TAN
                     formats may not be an appropriate format, since TAN is not intended to
                     replicate the appearance of a <emphasis>scriptum</emphasis>. Pure TEI (and not
                     TAN-TEI) might be a practical alternative, since it allows for a literal use of
                     space, and encourages XML files that try to replicate the appearance of a
                        <emphasis>scriptum</emphasis>.</para>
                  <para>For more on white space see <link
                        xlink:href="https://www.w3.org/TR/REC-xml/#sec-white-space">the W3C
                        recommendation</link>.</para>
               </section>
               <section>
                  <title>Non-mixed content</title>
                  <para>Many familiar text formats such as TEI, HTML, and Docbook allow what is
                     called mixed content, i.e., elements and nonspace text nodes may be combined as
                     siblings. The TAN formats, aside from TAN-TEI, are committed to a non-mixed
                     content model. Nonspace text nodes and elements are never siblings. The
                     practical effect of this policy is that indentation may be applied to a TAN
                     file as one wishes, and space text nodes may be inserted between any two
                     adjacent elements, without affecting the meaning. </para>
                  <para>To specify in a class 1 file that two adjacent leaf <code><link
                           linkend="element-div">&lt;div></link></code>s should have no intervening
                     space, see <xref linkend="unicode-characters-with-special-interpretation"
                     />.</para>
               </section>
            </section>
            <section xml:id="namespace">
               <title>Namespaces</title>
               <section>
                  <title>What are they?</title>
                  <para>XML allow users to develop vocabularies of elements as they wish. One person
                     may wish to use the element <code>&lt;bank></code> to refer to financial
                     institutions, another to rivers. Perhaps someone wishes to mention both rivers
                     and financial institutions in the same document. XML was designed to allow
                     users to mix vocabularies, even when those vocabularies use synonymous element
                     names. This means that anyone using <code>&lt;bank></code> must be able to
                     specify exactly whose vocabulary is being used. Disambiguation is accomplished
                     by associating IRIs (see <xref linkend="IRIs_and_linked_data"/> below) with the
                     element names. The actual full name of an element is the local name plus the
                     IRI that qualifies its meaning, e.g.,
                        <code>bank{http://example1.com/terms/}</code> and
                        <code>bank{http://example2.com/terms/}</code>. </para>
                  <para>The relationship between the element name and the IRI is analogous to that
                     between a person's given name and family name. The IRI—the family name—is
                     called the <emphasis>namespace</emphasis>. If the term sounds like meaningless
                     jargon, you may find it easier to think of it as the name of a group of
                     elements. </para>
                  <para>Namespaces look a lot like attributes (they aren't). They take the form
                        <code>&lt;bank xmlns="http://example1.com/terms/">...&lt;/bank></code>,
                     which states, in effect not only which namespace governs bank &lt;bank>, but
                     what the default namespace will be for any descendants. </para>
                  <para>But supposing we wished to combine the two type of <code>&lt;bank></code>
                     elements, we can assign abbreviations to select namespaces, then append those
                     abbreviations to the element names, separated by a colon. Here are three ways
                     to say the same thing, showing the use of prefix abbreviations and default
                     namespaces:</para>
                  <programlisting>&lt;bank xmlns="http://example1.com/terms/">
    &lt;bank xmlns="http://example2.com/terms/">
        ...
    &lt;/bank>
&lt;/bank>

&lt;bank xmlns="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    &lt;e2:bank >
        ...
    &lt;/e2:bank>
&lt;/bank>

&lt;e1:bank xmlns:e1="http://example1.com/terms/" xmlns:e2="http://example2.com/terms/">
    &lt;e2:bank >
        ...
    &lt;/e2:bank>
&lt;/e1:bank></programlisting>
               </section>
               <section>
                  <title>TAN namespace and prefix</title>
                  <para>The TAN namespace is <emphasis role="bold"
                           ><code>tag:textalign.net,2015:ns</code></emphasis>. The recommended
                     prefix is <emphasis role="bold"><emphasis>tan</emphasis></emphasis>. The
                     namespace is expected to remain the same from one version to the next.</para>
                  <para>The TAN-TEI format uses as its default the TEI namespace, <link
                        xlink:href="http://www.tei-c.org/ns/1.0"/>, normally given the prefix
                           <emphasis><emphasis role="bold">tei</emphasis></emphasis>.</para>
               </section>
            </section>
            <section xml:id="TEI">
               <title>The Text Encoding Initiative</title>
               <section>
                  <title>What is it?</title>
                  <para>The Text Encoding Initiative (TEI) is a consortium which collectively
                     develops and maintains a standard for the representation of texts in digital
                     form. Its chief deliverable is a set of Guidelines which specify encoding
                     methods for machine-readable texts, chiefly in the humanities, social sciences
                     and linguistics. Since 1994, the TEI Guidelines have been widely used by
                     libraries, museums, publishers, and individual scholars to present texts for
                     online research, teaching, and preservation. In addition to the Guidelines
                     themselves, the Consortium provides a variety of <link
                        xlink:href="http://www.tei-c.org/Support/Learn/">resources</link> and <link
                        xlink:href="http://members.tei-c.org/Events">training events</link> for
                     learning TEI, information on <link
                        xlink:href="http://www.tei-c.org/Activities/Projects/">projects using the
                        TEI</link>, a <link
                        xlink:href="http://www.tei-c.org/Activities/SIG/Education/tei_bibliography.xml"
                        >bibliography of TEI-related publications</link>, and <link
                        xlink:href="http://www.tei-c.org/Tools/">software</link> developed for or
                     adapted to the TEI.<note>
                        <para>Taken from the TEI website <link
                              xlink:href="http://www.tei-c.org/index.xml"/>, accessed
                           2017-05-21.</para>
                     </note></para>
                  <para>Any TAN-T module can be easily cast into a TEI file, although much of the
                     computer-actionable semantics will be lost in the process. Likewise, a TEI file
                     can be converted to TAN-T, but there is a greater risk of loss of content,
                     particularly in the header, since the TAN format is intentionally restricted to
                     an important but small subset of TEI tags. </para>
                  <para>The TAN-TEI module is a TEI extension to the format, based on an ODD file
                     that is in the same directory as the rest of the schemas. TAN-TEI schemas are
                     generated on the basis of the official TEI All schema that is available at the
                     time of release. </para>
                  <para>For more about the strictures placed upon the TEI All schema see <xref
                        linkend="tan-tei"/>. See also <xref linkend="class_common"/> and <xref
                        linkend="class_1"/>.</para>
               </section>
               <section>
                  <title>Further reading</title>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><link xlink:href="http://www.tei-c.org/">Text Encoding
                                 Initiative</link></para>
                        </listitem>
                     </itemizedlist>
                  </para>
               </section>
            </section>
            <section xml:id="data_types">
               <title>Data types</title>
               <para>Being a written purely in XML technologies, TAN adopts its data types, e.g.,
                  strings, booleans, and so forth, from the <link
                     xlink:href="https://www.w3.org/TR/xmlschema-2/">official specifications</link>
                  made by the W3C. The following data types require some special comments.</para>
               <section xml:id="language">
                  <title>Languages</title>
                  <para>TAN adopts for language identification Best Common Practices (BCP) 47, which
                     standardizes with high precision the way languages are identified. For most
                     users of TAN, this will be a simple three-letter abbreviation, sometimes
                     supplemented with a hyphen and an abbreviation designating a script or regional
                     subtag. For example, <code>eng</code>, <code>eng-UK</code>, and
                        <code>eng-UK-Cyrl</code> refer, respectively, to English generally, English
                     from the United Kingdom, and English from the United Kingdom written in the
                     Cyrillic script. As a general rule, values of this type should begin with a
                     three-letter language code, preferably lowercase.</para>
                  <para>ISO codes for human languages appear in <code><link
                           linkend="attribute-xmllang">@xml:lang</link></code> and <code><link
                           linkend="element-for-lang">&lt;for-lang></link></code>. The first
                     indicates the principal language of the text enclosed by the parent element.
                     The second indicates that some statement or claim is being made about a
                     specific language language. For example, <code><link linkend="element-for-lang"
                           >&lt;for-lang></link></code> in the context of a TAN-mor file indicates
                     languages for which the encoded morphological rules are appropriate.</para>
                  <para>For more information, see one of the following:<itemizedlist>
                        <listitem>
                           <para>BCP 47 <link xlink:href="http://tools.ietf.org/rfc/bcp/bcp47"
                                 >official specifications</link></para>
                        </listitem>
                        <listitem>
                           <para>BPC 47 <link
                                 xlink:href="http://www.w3.org/TR/xmlschema11-2/#language">technical
                                 details</link></para>
                        </listitem>
                     </itemizedlist></para>
               </section>
               <section xml:id="date_and_datetime">
                  <title>Dates and times</title>

                  <para>TAN adopts the standardized ISO form of dates and date-times, as interpreted
                     by XML data types. These begin with years (the largest unit) and ends with
                     days, seconds, or fractions of seconds (the smallest). This standard allows for
                     easy sorting</para>
                  <para>The simplest date takes this form: <code>YYYY-MM-DD</code>. If a time is
                     included, it is specified by continuing the string, first with a <code>T</code>
                     (for time) then the form <code>hh:mm:ss.sss(Z|[-+]hh:mm)</code>. For example,
                     the following is <code>2016-09-20T20:38:27.141-04:00</code> is an ISO date-time
                     for Tuesday, September 20, 2016 at 8:38 p.m. on the Eastern Time Zone.</para>
                  <para>More reading:<itemizedlist>
                        <listitem>
                           <para><link xlink:href="https://www.w3.org/TR/xmlschema-2/#dateTime">W3C
                                 specification</link></para>
                        </listitem>
                        <listitem>
                           <para><link xlink:href="https://en.wikipedia.org/wiki/ISO_8601">Wikipedia
                                 entry on ISO 8601</link></para>
                        </listitem>
                     </itemizedlist></para>

               </section>
            </section>
            <section xml:id="IRIs_and_linked_data">
               <title>Identifiers and Their Use</title>
               <para>The acronyms for identifiers, and the meanings of those acronyms, can be
                  mystifying. Here is a synopsis:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><emphasis>IRI</emphasis>: Internationalized Resource Identifier, a
                           generalization of the URI system, allowing the use of Unicode; <link
                              xlink:href="http://www.ietf.org/rfc/rfc3987.txt">defined by RFC
                              3987</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URI</emphasis>: Uniform Resource Identifier, a string of
                           characters used to identify a name or a resource; <link
                              xlink:href="https://tools.ietf.org/html/rfc3986">defined by RFC
                              3986</link></para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URL</emphasis>: Uniform Resource Locator, a URI that
                           identifies a Web resource and the communication protocol for retrieving
                           the resource.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>URN</emphasis>: Uniform Resource Name, a term that
                           originally referred to persistent names using the <code>urn:</code>
                           scheme, but is now applied to a variety of systems that have registered
                           with the IANA. URNs are generally best thought of as a subset of
                           URIs.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis>UUID</emphasis>: Universally Unique Identifier, a
                           computer-generated 128-bit number used to assign identifiers to any
                           entity. UUIDs can be built into a URN by prefixing them with
                              <code>urn:</code>.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>The TAN format generally prefers to refer to IRIs.</para>
               <para>See also <xref xlink:href="#tag_urn"/>.</para>
               <section xml:id="rdf_and_lod">
                  <title>Resource Description Framework (RDF) and Linked Open Data</title>
                  <section>
                     <title>What are they?</title>
                     <para>Identifiers are used in many contexts for many purposes. One of the key
                        purposes close to those of TAN involves what is called variously Linked Open
                        Data (LOD) or the Semantic Web. These technologies rely upon a very simple
                        data model called Resource Description Framework (RDF), a family of World
                        Wide Web Consortium (W3C) specifications originally designed as a data model
                        for metadata. The foundation of the model is the concept of a statement,
                        made of three parts: subject, predicate, and object. Subjects and predicates
                        take identifiers that act as names of things, as does the object, which also
                        allows for data type. The practical impetus to LOD is that if we use URLs as
                        identifiers for things, then we can create web pages at those URLs that
                        provide humans and computers with related, linked information. And as we
                        begin to use the same URLs for the same concepts, then independently created
                        datasets can be combined and compared into a whole that admits inferences
                        not possible with the parts alone.</para>
                     <para>These URL identifiers look like a web page address (e.g.,
                           <code>http://...</code>), but are first and foremost names for things
                        (the "Resource" behind RDF is a clumsy term pointing to person, place,
                        concept—anything at all). Ideally, those URLs will still name those things
                        after the domain name expires and the web resource cannot be found. But
                        ordinary users may be forgiven for not knowing whether the URL is a web page
                        or a name for something else.</para>
                  </section>
                  <section>
                     <title>TAN and RDF</title>
                     <para>Many parts of TAN map nicely onto RDF and vice versa. In fact, TAN tends
                        to be easier for humans to read and write than does RDF, even in its most
                        straightforward syntax. Compare, for example, this snippet (taken from <link
                           xlink:href="http://linkeddatabook.com/editions/1.0/"/>), written in
                        Turtle syntax,
                        ...<programlisting>1 @prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
2 @prefix foaf: &lt;http://xmlns.com/foaf/0.1/> . 
3 
4 &lt;http://biglynx.co.uk/people/dave-smith> 
5 rdf:type foaf:Person ; 
6 foaf:name "Dave Smith" .</programlisting></para>
                     <para>...with the TAN
                        equivalent:<programlisting>&lt;person xml:id="dsmith">
   &lt;IRI>http://biglynx.co.uk/people/dave-smith&lt;/IRI>
   &lt;name>Dave Smith&lt;/name>
&lt;/person></programlisting></para>
                     <para>In this case TAN and RDF are converted losslessly. But in many cases, TAN
                        statements cannot be reduced to the RDF model. This happens most often in
                        the context of <code><link linkend="element-claim">&lt;claim></link></code>,
                        which is designed to allow scholarly assertions and claims that are
                        difficult or impossible to express in RDF. For example, RDF does not allow
                        one to say "Person X is not the author of text Y." TAN claims have been
                        designed specifically to cater to such common scholarly expressions. For
                        more details see <xref linkend="tan-c"/>.</para>
                  </section>
                  <section>
                     <title>Further reading</title>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para><link xlink:href="https://www.w3.org/RDF/">W3C
                                    recommendation</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://linkeddata.org/">Linked
                                 Data</link></para>
                           </listitem>
                           <listitem>
                              <para><link xlink:href="http://lov.okfn.org/dataset/lov/">Linked Open
                                    Vocabularies</link></para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </section>
               </section>
               <section xml:id="tag_urn">
                  <title>Tag URNs</title>


                  <para>TAN files make extensive use of tag URNs (see <xref
                        xlink:href="#IRIs_and_linked_data"/>). In fact, TAN's namespace is a tag URN
                        (<xref linkend="namespace"/>). A <link xlink:href="http://www.taguri.org"
                        >tag URN</link> has two parts:</para>
                  <para>
                     <orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Namespace.</emphasis>
                              <code>tag:</code> + an e-mail address or domain name owned by the
                              person or organization that has authorized the creation of the TAN
                              file + <code>,</code> + an arbitrary day on which that address or
                              domain name was owned. The day is expressed in the form
                                 <code>YYYY-MM-DD</code>, <code>YYYY-MM</code>, or
                              <code>YYYY</code>. A missing <code>MM</code> or <code>DD</code> is
                              implicitly assigned the value of <code>01</code>.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Name of the TAN file.</emphasis>
                              <code>:</code> + an arbitrary string (unique to the namespace chosen)
                              chosen by the namespace owner as a label for the entire file and
                              related versions. It need not be the same as the filename stored on a
                              local directory. You should pick a name that is at least somewhat
                              intelligible to human readers.</para>
                        </listitem>
                     </orderedlist>
                  </para>
                  <para>Great care must be taken in choosing the IRI name, because you are the sole
                     guarantor of its uniqueness. <emphasis role="italic">It is permissible for
                        something to have multiple IRIs, but never acceptable for an IRI to name
                        more than one thing.</emphasis> It is a good practice to keep a master
                     checklist of IRI names you have created. If you find yourself forgetting, or
                     think you run the risk of creating duplicate IRI names, you should start afresh
                     by creating a new namespace for your tag URNs, easily done just by changing the
                     date in the tag URN namespace. That is, if
                        <code>tag:textalign.net,2015:...</code> seems to be overly cluttered, you
                     may start a new set of names with something else, e.g.,
                        <code>tag:textalign.net,2015-01-02:...</code>.</para>
                  <para>
                     <example>
                        <title>TAN IRI names</title>
                        <programlisting>tag:jan@example.com,1999-01-31:TAN-T001
tag:example.com,2001-04:hamlet-tan-t
tag:evagriusponticus.net,2014:tan-lm:Evagrius_Praktikos_grc_Guillaumonts
tag:bbrb@example.org,1995-04-01:pos-grc</programlisting>
                        <para>The first example comes from someone who owned the email address
                              <code>jan@example.com</code> on January 31, 1999 (at the stroke of
                           midnight, Universal Coordinated Time). The other examples follow a
                           similar logic. The namespace of the second and third examples are tied to
                           the owners of specific domain names, not those of email addresses. The
                              <code>2014</code> in the fourth example is shorthand for the first
                           second of January 1, 2014.</para>
                     </example>
                  </para>
                  <para>The TAN encoding format has chosen tag URNs over URLs for several
                     reasons:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para><emphasis role="bold">Permanence.</emphasis> Authors of TAN data
                              are creating files that are meant to be relevant for decades and
                              centuries in the future, well after specific domain names have changed
                              ownership or fallen into obsolesence, and well after the creators are
                              dead. To mint names according to URLs is inadequate for long-term use,
                              since it has no built-in mechanism to identify who owned the domain
                              name in question when the name was minted. </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Responsibility.</emphasis> The TAN format
                              requires every piece of data to be attributable to someone (a person,
                              organization, or some other agent). Tag URNs attached the
                              responsibility for naming objects to a particular person or
                              organization that owned the tag namespace at the specified time.
                           </para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Accessibility.</emphasis> Tag URNs are
                              available to anyone who has an email address. No one has to register
                              with any central authority. You can begin naming anything you want,
                              any time you want, without seeking anyone's approval.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Ease</emphasis>. Tag URNs are easier to use
                              than, say, http-form URLs, as recommended by RDF (see <xref
                                 xlink:href="#rdf_and_lod"/>). Many potential TAN authors never have
                              owned a domain name, and never will. Further, many of those who do own
                              domain names cannot or do not wish to configure and maintain servers
                              that will administer the referral mechanisms upon which the semantic
                              web depends.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Disambiguation of name and
                                 location</emphasis>. In the semantic web, conflation of name with a
                              location to resolve it is considered a virtue because a single string
                              answers two questions: what is the resource and where can I find out
                              more about it. But this conflation is unhelpful for those who use the
                              TAN formats, who are encouraged to distribute their TAN files widely,
                              and not rely upon a single location. And URLs are in common parlance
                              interpreted as locations for data, not as names for things.
                              TAN-compliant tag URLs ensure that the names of concepts and objects
                              do not look like locations, maintaining a distinction that has always
                              been a foundational principle in scholarly citation, namely, that one
                              should always distinguish the name of a resource from where it might
                              be found.</para>
                        </listitem>
                     </itemizedlist>
                  </para>
                  <para>Further reading:<itemizedlist>
                        <listitem>
                           <para><link xlink:href="https://tools.ietf.org/html/rfc4151">RFC
                                 4151</link>, the official definition of tag URNs</para>
                        </listitem>
                     </itemizedlist></para>

               </section>
            </section>
            <section xml:id="regular_expressions">
               <title>Regular Expressions</title>
               <para>Regular expressions are patterns for searching text. The term <emphasis
                     role="italic">regular</emphasis> here does not mean ordinary. Rather, it means
                     <emphasis>rules</emphasis> (Latin <emphasis role="italic">regula</emphasis>),
                  and points to a rule-based syntax that provides expressive power in algorithms
                  that search and replace text. Regular expressions come in different flavors, and
                  have several layers of complexity. So these guidelines are restricted to a
                  synopsis that illustrates very common uses that conform to the definition of
                  regular expressions found in the <link
                     xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions">recommendation
                     of XSLT 3.0</link> (XML Schema Datatypes plus some extensions), and outlined in
                     <link xlink:href="http://www.w3.org/TR/xpath-functions-30/#regex-syntax">XPath
                     Fuctions 3.0</link>. <caution>
                     <para>XML Schema Datatypes define regular expressions differently than do Perl,
                        one of the most common forms of regular expression. For example, the pipe
                        symbol, |, is treated as a word character in XML regular expressions
                           (<code>\w</code>), but the opposite is true for Perl. For convenience,
                        here are the how codepoints U+0020..U+00FF are categorized according to XML
                        (and therefore TAN):</para>
                     <para><emphasis role="bold">Word characters </emphasis>(<code>\w</code>):
                           <code>$ + 0 1 2 3 4 5 6 7 8 9 &lt; = > A B C D E F G H I J K L M N O P Q
                           R S T U V W X Y Z ^ ` a b c d e f g h i j k l m n o p q r s t u v w x y z
                           | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ¬ ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ À Á Â Ã Ä Å Æ Ç È É Ê Ë
                           Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð
                           ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ</code>
                     </para>
                     <para><emphasis role="bold">Non-word characters </emphasis>(<code>\W</code>):
                           <code>! " # % &amp; ' ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ­ ¶ · »
                           ¿</code></para>
                     <para>Some of these choices may seem counterintuitive or wrong. But at this
                        point it does not matter. The distinction is a legacy that will remain in
                        place. It is advisable to familiarize yourself with decisions that, in some
                        respect, are arbitrary.</para>
                  </caution></para>
               <para>A regular expression search pattern is treated just like a conventional search
                  pattern until the computer reaches a special escape character: <code>. [ ] \ | - ^
                     $ ? * + { } ( )</code>. Here is a brief key to how characters behave in regular
                  expressions, provided they are not in square brackets (on which see the
                  recommended reading below):</para>
               <para>
                  <table frame="all">
                     <title>Special characters in regular expressions</title>
                     <tgroup cols="2">
                        <colspec colname="c1" colnum="1" colwidth="1*"/>
                        <colspec colname="c2" colnum="2" colwidth="12.33*"/>
                        <thead>
                           <row>
                              <entry>Symbol</entry>
                              <entry>Meaning</entry>
                           </row>
                        </thead>
                        <tbody>
                           <row>
                              <entry>$</entry>
                              <entry>end of line</entry>
                           </row>
                           <row>
                              <entry>.</entry>
                              <entry>any character</entry>
                           </row>
                           <row>
                              <entry>|</entry>
                              <entry>or (union)</entry>
                           </row>
                           <row>
                              <entry>^</entry>
                              <entry>start of line</entry>
                           </row>
                           <row>
                              <entry>?</entry>
                              <entry>zero or one</entry>
                           </row>
                           <row>
                              <entry>*</entry>
                              <entry>zero or more</entry>
                           </row>
                           <row>
                              <entry>+</entry>
                              <entry>one or more</entry>
                           </row>
                           <row>
                              <entry>[ ]</entry>
                              <entry>a class of characters</entry>
                           </row>
                           <row>
                              <entry>( )</entry>
                              <entry>a group</entry>
                           </row>
                           <row>
                              <entry>\w</entry>
                              <entry>any word character</entry>
                           </row>
                           <row>
                              <entry>\W</entry>
                              <entry>any nonword character</entry>
                           </row>
                           <row>
                              <entry>\s</entry>
                              <entry>any of the four standard spacing characters: space (U+0020),
                                 tab (U+0009), newline (U+000A), carriage return (U+000D)</entry>
                           </row>
                           <row>
                              <entry>\S</entry>
                              <entry>anything not a spacing character</entry>
                           </row>
                           <row>
                              <entry>\d</entry>
                              <entry>any digit (0-9)</entry>
                           </row>
                           <row>
                              <entry>\D</entry>
                              <entry>anything not a digit</entry>
                           </row>
                           <row>
                              <entry>\p{IsGujarati}</entry>
                              <entry>any character from the Unicode block named Gujarati</entry>
                           </row>
                           <row>
                              <entry>\\</entry>
                              <entry>backslash (the backslash alone suggests that the next character
                                 is a special character)</entry>
                           </row>
                           <row>
                              <entry>\$</entry>
                              <entry>dollar sign</entry>
                           </row>
                           <row>
                              <entry>\(</entry>
                              <entry>opening parenthesis</entry>
                           </row>
                           <row>
                              <entry>\[</entry>
                              <entry>opening square bracket</entry>
                           </row>
                        </tbody>
                     </tgroup>
                  </table>
               </para>
               <para>Some examples:</para>
               <table frame="all">
                  <title>Examples of Regular Expressions</title>
                  <tgroup cols="3">
                     <colspec colname="newCol1" colnum="1" colwidth="1*"/>
                     <colspec colname="c1" colnum="2" colwidth="1.48*"/>
                     <colspec colname="c2" colnum="3" colwidth="6.59*"/>
                     <thead>
                        <row>
                           <entry>Expression</entry>
                           <entry>Meaning</entry>
                           <entry>What the expression matches when applied to "Wi-fi, good. A_hem*
                              isn't!"</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>^.+$</code></entry>
                           <entry>one whole line of characters</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>[ae]</code></entry>
                           <entry>a or e</entry>
                           <entry>"e"</entry>
                        </row>
                        <row>
                           <entry><code>[a-e]</code></entry>
                           <entry>a, b, c, d, or e</entry>
                           <entry>"d", "e"</entry>
                        </row>
                        <row>
                           <entry><code>[^ae]+</code></entry>
                           <entry>one or more characters that are anything except a or e</entry>
                           <entry>"Wi-fi, good. A_h", "m* isn't!"</entry>
                        </row>
                        <row>
                           <entry><code>.i</code></entry>
                           <entry>any character followed by i.</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>(.i)</code></entry>
                           <entry>when a character followed by an i is found treat it as a capture
                              group (used only in a search pattern)</entry>
                           <entry>"Wi", "fi", " i"</entry>
                        </row>
                        <row>
                           <entry><code>$1</code></entry>
                           <entry>first capture group (used only in a replacement pattern, and
                              corresponds to the sequence of capture groups in the search
                              pattern)</entry>
                           <entry>In the example above, each match corresponds to $1</entry>
                        </row>
                        <row>
                           <entry><code>[aeiou]\w*</code></entry>
                           <entry>any lowercase vowel along with every word character that
                              follows</entry>
                           <entry>"i", "i", "ood", "em", "isn"</entry>
                        </row>
                        <row>
                           <entry><code>[t*].</code></entry>
                           <entry>any t or * and the following character</entry>
                           <entry>"* ", "t!" Note that the asterisk, if inside a character class,
                              acts as itself.</entry>
                        </row>
                        <row>
                           <entry><code>\s+</code></entry>
                           <entry>match one or more space characters</entry>
                           <entry>" ", " ", " "</entry>
                        </row>
                        <row>
                           <entry><code>\w+</code></entry>
                           <entry>match one or more word characters</entry>
                           <entry>"Wi", "fi", "good", "A_hem", "isn", "t"</entry>
                        </row>
                        <row>
                           <entry><code>\W+</code></entry>
                           <entry>match one or more nonword characters</entry>
                           <entry>"-", ", ", ". ", "* ", "'", "!"</entry>
                        </row>
                        <row>
                           <entry><code>[^q]+</code></entry>
                           <entry>one or more characters that are not a q</entry>
                           <entry>"Wi-fi, good. A_hem* isn't!"</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <para>The examples above provide a taste of how regular expressions are constructed
                  and read. For further examples especially relevant to TAN see <code><link
                        linkend="element-filter">&lt;filter></link></code>.</para>
               <warning xml:id="reg_exp_and_comb_chars">
                  <title>Regular Expressions and Combining Characters</title>
                  <para>Regular expressions come in many different flavors, and each one deals with
                     some of the more complex issues in Unicode in their own manners. This ambiguity
                     will be most keenly felt in the use of combining characters in Unicode. Given a
                     string <code>&amp;#x61;&amp;#x301;&amp;#x62;</code> = áb (i.e., an acute
                     accent over the a), a search pattern <code>a.</code> will in some search
                     engines include the b and others not.</para>
                  <para>Unicode has differentiated three levels of support for regular expressions
                     (see <link xlink:href="http://www.unicode.org/reports/tr18/">official
                        report</link>). Only level one conformance in TAN is guaranteed. Combining
                     characters fall in level two. If you find the need to count characters, and you
                     are working with a language that uses combining characters, you should count
                     only base characters, not combining ones. In fact, TAN assumes that in cases
                     where characters are identified with a numeral, the numeral excludes combining
                     characters. See <xref linkend="combining_characters"/>. Further, any regular
                     expressions with wildcard characters cannot be expected to be treated
                     uniformly.</para>
               </warning>
               <para>TAN includes several functions that usefully extend XML regular expressions.
                  See <code><link linkend="function-regex">tan:regex</link></code>, <code><link
                        linkend="function-matches">tan:matches</link></code>(), <code><link
                        linkend="function-replace">tan:replace</link></code>(), <code><link
                        linkend="function-tokenize">tan:tokenize</link></code>().</para>
               <para>Further reading:<itemizedlist>
                     <listitem>
                        <para>Various <link
                              xlink:href="http://www.google.com/search?q=tutorial+regular+expressions"
                              >tutorials on Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para>Wikipedia, <link
                              xlink:href="http://en.wikipedia.org/wiki/Regular_expression">Regular
                              Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xslt-30/#regular-expressions"
                              >Regular Expressions in XSLT 3.0</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.unicode.org/reports/tr18/">Unicode and
                              Regular Expressions</link></para>
                     </listitem>
                     <listitem>
                        <para><link xlink:href="http://www.w3.org/TR/xmlschema-2/#regexs">XML Schema
                              Datatypes</link></para>
                     </listitem>
                  </itemizedlist></para>
            </section>
         </section>
         <section xml:id="multiple-values">
            <title>Interpretation of multiple values</title>
            <para>The interpretation of an element with multiple child elements, which occur
               frequently in TAN files, or an attribute with multiple values can be quite unclear.
               Do those multiple values represent intersection, union, or distribution? For example,
                  <code>attribute="A B"</code> could be interpreted to mean, using the diagram
               below, one instance in y (intersection), one instance in the region of x or y or z
               (union), or one instance in x or y and one instance in y and z (distribution).</para>
            <figure>
               <title>Venn%20diagram.jpeg</title>
               <mediaobject>
                  <imageobject>
                     <imagedata fileref="img/Venn%20diagram.jpeg"/>
                  </imageobject>
               </mediaobject>
            </figure>
            <para>The interpretation of multiple values in any TAN element or
               attribute is based upon perceived common usage in ordinary English language. For
               example, any element that takes the <xref linkend="pattern-iri_and_name"/> allows
               multiple <code><link linkend="namespace">&lt;IRI></link></code>s. If entity j has
                     <code><link linkend="namespace">&lt;IRI></link></code>s A and B, and entity k
               has <code><link linkend="namespace">&lt;IRI></link></code>s B and C, can j be
               inferred to be the same entity as k? Because people commonly use the same term while
               meaning different things, TAN can answer only the first half of this question. The
               IRI + name pattern is to be interpreted as union. But the TAN schemas cannot predict
               how people will interpret the extent of those two unions, or for that matter how they
               will interpret a single IRI.</para>
            <para>The TAN schemas interpret the meaning of multiple values in an element or
               attribute in one of three ways:</para>
            <para><emphasis role="bold">Intersection.</emphasis> Qualifications of claims, e.g.,
                     <code><link linkend="attribute-adverb">@adverb</link></code>, <code><link
                     linkend="attribute-claimant">@claimant</link></code>. For example, "...probably
               not..." does not mean "...probably..." and "...not..." Not a transitive property (for
               j = A, B; k = B, C, nothing can be inferred about the relationship between j and
               k).</para>
            <para><emphasis role="bold">Union (default).</emphasis> Anything that takes the <xref
                  linkend="pattern-iri_and_name"/>, <code><link linkend="element-equate-works"
                     >&lt;equate-works></link></code>, <code><link linkend="attribute-when"
                     >@when</link></code>
               <code><link linkend="element-when">&lt;when&gt;</link></code>, <code><link
                     linkend="attribute-where">@where</link></code>. For example, "entity j is
               [urn:A], [urn:B]" means that entity j is urn A, urn B, or both. TAN interprets this
               property as being transitive (for j = A, B; k = B, C; l = C, D, one may infer j = k =
               l). <warning>
                  <para>The interpretation of union as being transitive may result in inferences you
                     disagree with. It is your responsibility to interrogate inferences in the TAN
                     files you are using.</para>
               </warning></para>
            <para><emphasis role="bold">Distribution.</emphasis>
               <code><link linkend="attribute-affects-element">@affects-element</link></code>,
                     <code><link linkend="attribute-object">@object</link></code>, <code><link
                     linkend="element-object">&lt;object&gt;</link></code>, <code><link
                     linkend="attribute-src">@src</link></code>, <code><link
                     linkend="attribute-subject">@subject</link></code>, <code><link
                     linkend="element-subject">&lt;subject&gt;</link></code>, <code><link
                     linkend="attribute-verb">@verb</link></code>. For example, "[Source A], [source
               B], are Z" means "Source A is Z" and "Source B is Z." This property is not
               transitive.</para>
            <para>The above has ignored the important question of range. If entity x is said to be
               A, does it mean that it is true for all of x and all of A, or just some part of each?
               If the entity is one or more word tokens, then the statement is assumed to hold over
               the entire entity. If the claim is being made of a range of text, that assumption
               cannot be made. For example, to say that passage x quotes from passage y should not
               be interpreted to mean that the entirety of x quotes the entirety of y.</para>
            <para>At present, TAN does not address this ambiguity, and leaves judgment, based on
               common sense, to you. </para>
         </section>
      </chapter>
      <chapter xml:id="class_common">
         <title>Patterns and Structures Common to All TAN Encoding Formats</title>
         <para>This chapter provides general background to the elements and attributes that are
            common to all TAN files. For detailed discussion see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <section xml:id="patterns">
            <title>Common Patterns</title>
            <section xml:id="pattern-iri_and_name">
               <title>IRI + name Pattern</title>
               <para>Both humans and computers need to read and write TAN metadata. Very often what
                  is readable to humans is unreadable to computers, and vice versa. So the TAN
                  format requires that all metadata be provided whenever possible in both forms.
                  Although this rule may appear to introduce redundancy and therefore new
                  opportunities for error, the clarity is critical. It is the only way at present to
                  ensure that anyone who approaches the data—computer or human—can parse and use it.
                  In addition, doubly expressed metadata provides a safeguard much like a checksum:
                  human- and computer-readable descriptions should correspond. Any discrepancy is a
                  signal that an error should be diagnosed and fixed.</para>
               <para>Some metadata, such as comments, are neither easily nor profitably translated
                  into a computer-actionable string. In such cases only the human-readable form is
                  required. Other metadata involve regular expressions or ISO-compliant dates, both
                  of which are well formed and are usually human-legible. In those cases the data is
                  not repeated. In cases where a datum is not understandable to humans, such as a
                  complex regular expression, a <code><link linkend="element-comment"
                        >&lt;comment></link></code> may be provided.</para>
               <para>Those exceptions aside, all other metadata takes what is called the <emphasis
                     role="italic">IRI + name</emphasis> pattern: one or more <code><link
                        linkend="namespace">&lt;IRI></link></code> and <code><link
                        linkend="element-name">&lt;name></link></code> and zero or more <code><link
                        linkend="element-desc">&lt;desc></link></code>s. If the thing being
                  described is a digital file, then the IRI + name pattern is part of a larger
                  pattern, the <xref linkend="digital_entity_metadata"/>.</para>
            </section>
            <section xml:id="digital_entity_metadata">
               <title>Digital Entity Metadata Pattern</title>
               <para>Some entities identified by the <xref linkend="pattern-iri_and_name"/> will be
                  digital resources. In those cases, the IRI + name Pattern is extended in two
                  different ways, according to whether the entity is a TAN file or not. </para>
               <para>If the entity is a TAN file, then <code><link linkend="namespace"
                        >&lt;IRI></link></code> (one and only one) must be a valid tag URN that
                  matches the <code><link linkend="attribute-id">@id</link></code> value of the TAN
                  file being referred to. This may seem excessive, since in other contexts (HTML,
                  TEI), one need only the <code>@href</code> or <code>@src</code>. This extra
                  measure has been introduced because TAN files are meant to be valid long after
                  their creation, when they may be separated from their original context, or when a
                  server no longer has the files referred to. Without the <code><link
                        linkend="attribute-id">@id</link></code> value, recovering the referred to
                  file would be difficult or impossible; with it, easier, and perhaps
                  possible.</para>
               <para>If the entity is not a TAN file, then any IRI may be used. If you choose to use
                  the digital resource's URL as its name (and as its location; see below), then it
                  will be inferred that you mean to identify the digital resource that appeared at
                  that URL at the date or time you accessed it.</para>
               <para>In either case, the pattern adds to the IRI + name pattern one or more
                        <code><link linkend="element-location">&lt;location></link></code>s and an
                  optional <code><link linkend="element-checksum"
                  >&lt;checksum></link></code>.</para>
            </section>
            <section xml:id="edit_stamp">
               <title>Edit Stamp</title>
               <para>Most TAN elements allow for an optional edit stamp, an <code><link
                        linkend="attribute-ed-who">@ed-who</link></code> and an <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, stating who created or
                  edited the enclosed data and when. Neither attribute is allowed without the other. </para>
               <para><code><link linkend="attribute-ed-when">@ed-when</link></code>, along with
                        <code><link linkend="attribute-when">@when</link></code> and <code><link
                        linkend="attribute-when-accessed">@when-accessed</link></code>, are the
                  attributes through which a TAN file's version is calculated. The latest date
                  serves as the version number.</para>
               <para>An edit stamp performs the same function as <code><link
                        linkend="element-change">&lt;change></link></code>, except that no
                  description can be provided, and it points precisely to the element where a change
                  has been made. If a description of the alteration is necessary, <code><link
                        linkend="element-change">&lt;change></link></code> should be used.</para>
            </section>
         </section>
         <section xml:id="structure">
            <title>Overall Structure (root)</title>
            <para>All TAN-compliant files, no matter the type or class, follow a common basic
               structure: (1) at least three processing instruction nodes, (2) a namespace node, and
               (3) a root element.</para>
            <para><emphasis role="italic">Processing instruction nodes</emphasis>: The first of
               three required processing nodes is the standard declaration made in every XML file's
               prolog: <code>&lt;?xml version="1.0" encoding="UTF-8"?></code> After that come two
               more processing instruction nodes specifying the two schema files required for validation<itemizedlist>
                  <listitem>
                     <para><code>&lt;?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].rn[g OR c]"
                           type="application/relax-ng-compact-syntax"?></code></para>
                  </listitem>
                  <listitem>
                     <para><code>&lt;?xml-model href="[PATH]/[ROOT-ELEMENT-NAME].sch"
                           type="application/xml"
                           schematypens="http://purl.oclc.org/dsdl/schematron"?></code></para>
                  </listitem>
               </itemizedlist></para>
            <para>The first processing instruction node points to the RELAX-NG schema that declares
               the major, structural rules. The second points to the finely tuned rules, written in
               Schematron. Both processing instructions are required. <code>[PATH]</code> represents
               the pathname to the schema file, whether local or on a server and
                  <code>[ROOT-ELEMENT-NAME]</code> stands for the name of the root element (the
               element that is the ancestor of all other elements in the document and the descendant
               of none).<note>
                  <para>An exception to this rule is that a TAN-LM file may alternatively point to
                        <code>TAN-LM-lang.rng</code>, <code>TAN-LM-lang.rnc</code>, and
                        <code>TAN-LM-lang.sch</code>. These are cases where the TAN-LM file is not
                     based on a particular source but on a language in general. See <xref
                        xlink:href="#tan-lm"/>.</para>
               </note> It is your choice whether you use <code>.rnc</code> or <code>.rng</code> as
               the extension for the RELAX-NG schema. The former is the compact syntax and the
               latter, the XML format. They are equivalent. The schemas are written primarily in the
               compact sequence, then converted to the XML format.</para>
            <para>Some files admit different levels of validation, sorted into what Schematron calls
               phases. TAN-A-div phases are termed <code>basic</code> and <code>verbose</code>, and
               are chosen by specifying the phase in the prolog, e.g., <code>&lt;?xml-model
                  href="TAN-A-div.sch" phase="basic" type="application/xml"
                  schematypens="http://purl.oclc.org/dsdl/schematron"?></code>. The verbose version
               makes extra calculations that go beyond mere validation, and analyze the differences
               between source files. In most cases, if you have not specified which phase you prefer
               in the prolog, you will be prompted for a choice when you validate your file. </para>
            <para>Master files are kept at the TAN git repository and website, but anyone may cache,
               save, serve, and use copies of the TAN schema files anywhere. </para>
            <para><emphasis role="italic">Namespace node</emphasis>: All TAN elements take the
               namespace <code>tag:textalign.net,2015:ns</code>. In most cases, this value is placed
               in the root element. (The only exception are TAN-TEI transcription files, which take
               as a default namespace <code>http://www.tei-c.org/ns/1.0</code> everywhere but in
                  <code>/TEI/head</code>, which takes the TAN namespace.) For more about namespaces,
               see <xref linkend="namespace"/>.</para>
            <para><emphasis role="italic">Root element</emphasis>: The name of the root element
               identifies the type of TAN file:<table frame="all">
                  <title>Root TAN elements</title>
                  <tgroup cols="3">
                     <colspec colname="c1" colnum="1" colwidth="1.19*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.19*"/>
                     <colspec colname="newCol3" colnum="3" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry>Root element name</entry>
                           <entry>Type of data</entry>
                           <entry>TAN class</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-TAN-T"
                              >&lt;TAN-T></link></code></entry>
                           <entry>plain text transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>TEI transcriptions</entry>
                           <entry><link linkend="class_1">1</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-tok"
                                 >&lt;TAN-A-tok></link></code></entry>
                           <entry>token-based alignments</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-A-div"
                                 >&lt;TAN-A-div></link></code></entry>
                           <entry>division-based alignments</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-LM"
                              >&lt;TAN-LM></link></code></entry>
                           <entry>lexico-morphological analysis</entry>
                           <entry><link linkend="class_2">2</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-mor"
                              >&lt;TAN-mor></link></code></entry>
                           <entry>part of speech / morphology patterns</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-key"
                              >&lt;TAN-key></link></code></entry>
                           <entry>glossaries</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-TAN-c"
                              >&lt;TAN-c></link></code></entry>
                           <entry>claims</entry>
                           <entry><link linkend="class_3">3</link></entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table></para>
            <para>Each root element takes a mandatory <code><link linkend="attribute-id"
                  >@id</link></code> and <code><link linkend="attribute-TAN-version"
                     >@TAN-version</link></code>. </para>
            <para>The root element takes only two mandatory children: <code><link
                     linkend="element-head">&lt;head></link></code> and <code><link
                     linkend="element-body">&lt;body></link></code>, the latter containing data and
               the former, metadata (data about the data). The only exception to this rule are
               TAN-TEI files, which take three children: <code>&lt;teiHeader></code>, <code><link
                     linkend="element-head">&lt;head></link></code>, and <code>&lt;text></code>,
               because the TEI header is inadequate for TAN purposes. See <xref linkend="tan-tei"/>. </para>
            <para>All TAN files may take one final optional child, <link linkend="element-tail"
                     ><code>&lt;tail></code></link>, a private use element that allows any
               well-formed XML. Nothing in a TAN file should be dependent upon the <link
                  linkend="element-tail"><code>&lt;tail></code></link>. That is, if you are editing
               a TAN file and you add a <link linkend="element-tail"><code>&lt;tail></code></link>,
               assume that it will be disregarded by other users. Similarly, you may delete any TAN
               file's <link linkend="element-tail"><code>&lt;tail></code></link> without
               consequence.</para>
            <section xml:id="iri_name">
               <title><code><link linkend="attribute-id">@id</link></code> and a TAN file's IRI
                  Name</title>
               <para>Every TAN file requires in its root element an <code><link
                        linkend="attribute-id">@id</link></code>. Its value, termed the TAN file's
                     <emphasis>IRI name</emphasis>, must take the form of a tag URN (see <xref
                     linkend="tag_urn"/> for syntax). The file's IRI name is the primary way other
                  TAN files will refer to it. </para>
               <para>The namespace of the current file's IRI name must match at least one namespace
                  in one <code><link linkend="element-agent">&lt;agent></link></code>'s <code><link
                        linkend="element-IRI">&lt;IRI></link></code> value. This helps tie the
                  responsibility for the TAN file to at least one person. The first such <code><link
                        linkend="element-agent">&lt;agent></link></code> is called the key
                  agent.</para>
               <para>In choosing a value for <code><link linkend="attribute-id">@id</link></code>
                  you might borrow the filename, but you do not have to. Indeed, it is probably not
                  a good idea, since files are frequently renamed, often with good reason. A TAN
                  file's IRI name should not be changed, especially after publication, because the
                  name is supposed to be permanent and stable. </para>
               <para>On occasion during editing, it will become clear that revisions are so deep
                  that the file is substantially different from how it began. If a previous version
                  has been published, then coining a new IRI name <emphasis>is </emphasis>advised,
                  to dissociate the file with its ancestry. You may always document the connection
                  by supplying a <code><link linkend="element-see-also">&lt;see-also></link></code>
                  element in the <code><link linkend="element-head">&lt;head></link></code>,
                  specifying the <code><link linkend="element-relationship"
                     >&lt;relationship></link></code> between the two.</para>
               <para>If you take someone else's data and alter it then you should <emphasis
                     role="italic">not</emphasis> change the IRI name, even the namespace. To avoid
                  suggesting that the owner of that namespace is responsible for the revised file,
                  you should add yourself as an <link linkend="element-agent"
                        ><code>&lt;agent></code></link> and then document your alterations through
                     <link linkend="element-change"><code>&lt;change></code></link> or <link
                     linkend="attribute-ed-when"><code>@ed-when</code></link> and <link
                     linkend="attribute-ed-who"><code>@ed-who</code></link>. You should also
                  probably add a <code><link linkend="element-see-also">&lt;see-also></link></code>
                  element, pointing to a version of the file that predates your intervention.</para>
               <para>The name of the version of a TAN file is identified by the most recent date in
                  a file's <code><link linkend="attribute-when">@when</link></code>, <code><link
                        linkend="attribute-ed-when">@ed-when</link></code>, or <code><link
                        linkend="attribute-when-accessed">@when-accessed</link></code>. It is
                  important, therefore, whenever you change a TAN file that has already been
                  published to provide at least an edit stamp (<xref linkend="edit_stamp"/>) in the
                  part of the file you changed or in a <code><link linkend="element-comment"
                        >&lt;comment></link></code> or <code><link linkend="element-change"
                        >&lt;change></link></code>, so that anyone validating a TAN file dependent
                  upon yours will be warned that changes have been made. The user may then either
                  continue to process the file (the changes may be minor on inconsequential) or
                  investigate the changes before deciding what to do. </para>
               <para>Because the IRI name is stable, it is suitable for use outside of TAN, in, for
                  example, RDFa, JSON-LD, and linked open data (see <xref
                     linkend="IRIs_and_linked_data"/>).</para>
               <para>The IRI name kept at <code><link linkend="attribute-id">@id</link></code> is
                  the only metadatum positioned outside <code><link linkend="element-head"
                        >&lt;head></link></code>. It is placed as rootward in the document as
                  possible to emphasize that it names the entire document.</para>
               <para><code><link linkend="attribute-TAN-version">@TAN-version</link></code> must be
                     <code>1 dev</code>, indicating that the files have been made in light of the
                  development files of version one.</para>
            </section>
         </section>
         <section xml:id="metadata_head">
            <title>Metadata (<code><link linkend="element-head">&lt;head></link></code>)</title>
            <para>No matter how much one TAN format differs from another, the metadata are quite
               similar. Anyone getting a TAN file, no matter its class or type, is assumed to want
               to know, and therefore find easily and predictably, the following:<orderedlist>
                  <listitem>
                     <para>the stable name of the file;</para>
                  </listitem>
                  <listitem>
                     <para>its version;</para>
                  </listitem>
                  <listitem>
                     <para>its sources;</para>
                  </listitem>
                  <listitem>
                     <para>other files upon which it depends or otherwise have an important
                        relationship;</para>
                  </listitem>
                  <listitem>
                     <para>the most significant parts of the editorial history;</para>
                  </listitem>
                  <listitem>
                     <para>the linguistic or scholarly conventions that have been adopted in
                        creating and editing the data;</para>
                  </listitem>
                  <listitem>
                     <para>the license, i.e., who holds what rights to the data, and what kind of
                        reuse is allowed.</para>
                  </listitem>
                  <listitem>
                     <para>the persons, organizations, or entities that helped create the data, and
                        the roles played by each.</para>
                  </listitem>
               </orderedlist></para>
            <para>To answer these questions completely, consistently, and predictably the
                     <code><link linkend="element-head">&lt;head></link></code>, a mandatory child
               of the root element, takes a common pattern across <emphasis>all</emphasis> TAN
               formats, thus allowing anyone to work easily and predictably across large numbers and
               types of TAN files. The TAN <code><link linkend="element-head"
                  >&lt;head></link></code>, intended to be concise and focused, compels you to
               provide metadata for the data that is governed by <code><link linkend="element-body"
                     >&lt;body></link></code>, but it does not accommodate metadata for the
               metadata. That is, your metadata should focus on the data itself and not other
               things. For example, <code><link linkend="element-head">&lt;head></link></code>
               requires you name the people who helped create or edit the data, but you are not
               expected to tell us about them. You merely refer through <code><link
                     linkend="element-IRI">&lt;IRI></link></code> to other authoritative sources
               that can provide background information.<note>
                  <para>The principles above explain why the TEI extension of TAN requires two
                     heads, one for TEI and the other for TAN. Because of its design principles, the
                        <code>&lt;teiHeader></code> is impossible to map onto a TAN <code><link
                           linkend="element-head">&lt;head></link></code>. But that
                        <code>&lt;teiHeader></code> has valuable, sometimes critically important,
                     information, and should be retained. Or it may be left empty.</para>
               </note></para>
            <para>Detailed descriptions of <code><link linkend="element-head"
                  >&lt;head></link></code> and its components are in <xref
                  linkend="elements-attributes-and-patterns"/>. Here we provide a summary, general
               description of TAN metadata. </para>
            <para>To <emphasis role="bold">describe the current file</emphasis>, <code><link
                     linkend="element-head">&lt;head></link></code> takes one or more <code><link
                     linkend="element-name">&lt;name></link></code>s, zero or more <code><link
                     linkend="element-desc">&lt;desc></link></code>s and <code><link
                     linkend="element-master-location">&lt;master-location></link></code>s, and one
                     <code><link linkend="element-rights-excluding-sources"
                     >&lt;rights-excluding-sources></link></code>.</para>
            <para>Next come a list of <emphasis role="bold">files upon which the file
                  depends</emphasis>: zero or more <code><link linkend="element-inclusion"
                     >&lt;inclusion></link></code>s, zero or more <code><link linkend="element-key"
                     >&lt;key></link></code>s, zero or more <code><link linkend="element-source"
                     >&lt;source></link></code>s, and zero or more <code><link
                     linkend="element-see-also">&lt;see-also></link></code>s.</para>
            <para>All <emphasis role="bold">editorial assumptions</emphasis> are placed in
                     <code><link linkend="element-declarations">&lt;declarations></link></code>,
               whose contents differ from one TAN format to the next.</para>
            <para>Finally comes the <emphasis role="bold">responsibility</emphasis> section stating
               who did what when: one or more <code><link linkend="element-agent"
                  >&lt;agent></link></code>s, <code><link linkend="element-role"
                  >&lt;role></link></code>s, and <code><link linkend="element-change"
                     >&lt;change></link></code>s, and zero or more <code><link
                     linkend="element-agentrole">&lt;agentrole></link></code>s.</para>
            <section xml:id="license">
               <title>Rights and Licenses</title>
               <para>Two TAN elements cover rights and licenses: <code><link
                        linkend="element-rights-excluding-sources"
                        >&lt;rights-excluding-sources></link></code> (mandatory in every TAN file)
                  and <code><link linkend="element-rights-source-only"
                        >&lt;rights-source-only></link></code> (optional, and never allowed in class
                  2 files, because a statement on rights is required in each source). The first
                  element covers the work specific to a given TAN file. The second pertains to the
                  rights for the sources. The distinction is important, and helpful. It is much
                  easier for you to decide and state the rights and license behind your own work
                  than to do so for that of others. Declaring who holds what rights over your
                  source(s) may be not only difficult but risky, and is therefore optional (see
                  below).</para>
               <para>As an editor, you are strongly encouraged in the <code><link
                        linkend="element-desc">&lt;desc></link></code> element of <code><link
                        linkend="element-rights-excluding-sources"
                        >&lt;rights-excluding-sources></link></code> to emphasize the distinction
                  between the rights you have over your data and the rights held by others over your
                  source, for the benefit of those who may not be familiar with the TAN format. A
                  statement something like this is recommended: <code><link linkend="element-desc"
                        >&lt;desc></link>The data in this file, only insofar as it constitutes an
                     independent work, is licensed exclusive of any licenses held by parties over
                     the source or sources listed below.&lt;/desc></code></para>
               <para>When using a TAN file, you should investigate the entire chain of rights. If
                  you find a discrepancy between the two licenses—that of a TAN file and that of its
                  sources—you should respect the more restrictive license. If a TAN file has a very
                  liberal, open license for the data, this does not necessarily mean that the
                  material upon which it depends is in the public domain. The TAN file's source may
                  be under tight restrictions.</para>
               <para>It is recommended that you not declare who own what rights over your source
                  unless you are quite certain. Copyright laws differ from one country to another,
                  and they change. A source may be protected by copyright in one place and
                  simultaneously be in the public domain in another. (At the time of this writing,
                  dozens of scholarly editions of ancient texts are in the public domain in Germany,
                  where copyright of a new edition lasts forty years, but not in the U.S. or Canada,
                  where there is no explicit legislation on this issue.) Some copyright statements
                  in books are false, or cannot be proven. Some persons or entities who claim rights
                  over a source may have no legal basis for the claim, at least in some
                  jurisdictions. Furthermore, if you mischaracterize the rights that are held over a
                  source, you may be held liable by a putative rights holder. It is safer to use the
                        <code><link linkend="element-IRI">&lt;IRI></link></code> of <code><link
                        linkend="element-source">&lt;source></link></code> (described below) to
                  point the user to a publisher or some other entitiy that has greater authority and
                  specificity about who owns what rights. </para>
               <para>TAN adopts the Creative Commons licenses as its default key vocabulary. See
                     <xref linkend="keywords-rights-excluding-sources"/>.</para>
               <para>
                  <note xml:id="copyright_vs_contract">
                     <title>Copyright Law versus Contract Law</title>
                     <para>Some third-party services, such as the Thesaurus Linguae Graecae for
                        Greek texts, require users to agree not to copy and reuse the texts in
                        service's databases. Such agreements fall under the area of contract law and
                        not copyright law. That is, many of these third parties have no intellectual
                        property rights (or only derivative rights) over the texts they store.
                        Therefore, they should normally not be credited in any <code><link
                              linkend="element-rights-source-only"
                           >&lt;rights-source-only></link></code>.</para>
                  </note>
               </para>
            </section>
            <section xml:id="inclusions-and-keys">
               <title>Inclusions and Keys</title>
               <para>Many if not most TAN files are created alongside or in the context of a
                  project, where certain elements will be repeated. Such repetition makes the files
                  prone to errors, where editorial corrections made in one place are mistakenly not
                  made everywhere. TAN has two features that help avoid duplication, reduce the
                  likelihood of incomplete editing, and lead to cleaner, smaller files.</para>
               <section>
                  <title>Keys</title>
                  <para>Most often, an editor wants a simple, shorthand reference to an entity
                     commonly referred to from one file to the next in a single project, e.g., the
                     person who is the principle editor. Writing individual <link
                        linkend="pattern-iri_and_name">IRI + name pattern</link>s can be
                     time-consuming, and if a change needs to be made, it is easy to be inconsistent
                     or incomplete.</para>
                  <para>Vocabulary commonly used in a project may be kept in a <code><link
                           linkend="element-TAN-key">&lt;TAN-key&gt;</link></code> file. This file
                     is made accessible to any other TAN file via <code><link linkend="element-key"
                           >&lt;key&gt;</link></code>. The key vocabulary is then invoked by using
                        <link linkend="attribute-which"><code>@which</code></link>, whose value
                     should match a <code><link linkend="element-name">&lt;name&gt;</link></code>
                     value in the TAN-key file.</para>
                  <para>A number of standard keys have already been predefined, documented in <xref
                        linkend="keywords-master-list"/>. It is strongly recommended that you not
                     depend upon the supplementary TAN-key files of a different project. Rather you
                     should develop your own. You may also wish to create a workflow where the
                     TAN-key is used for private editing, but the published versions have their
                     keywords resolved to their full value.</para>
               </section>
               <section>
                  <title>Inclusions</title>
                  <para>More powerful than TAN-keys are inclusions. Unlike other forms of inclusion
                     you may be familiar with, TAN inclusion involves only select elements, never an
                     entire file. </para>
                  <para>As with keys, TAN inclusion is a two-step process. First, a TAN file is made
                     available for inclusion by invoking <code><link linkend="element-inclusion"
                           >&lt;inclusion></link></code>s (inside <link linkend="element-head"
                           ><code>&lt;head></code></link>). Like <code><link linkend="element-key"
                           >&lt;key&gt;</link></code>, an <code><link linkend="element-inclusion"
                           >&lt;inclusion></link></code> does nothing on its own. It merely
                     indicates a file that may be used for patterned inclusions. </para>
                  <para>Inclusions are acted upon only in the second step. Many elements allow
                           <code><link linkend="attribute-include">@include</link></code>, which
                     points to the <code><link linkend="attribute-xmlid">@xml:id</link></code>
                     reference of an included file. In the validation process, those elements will
                     be replaced with every element of that name found in the inclusion file,
                     checked recursively (see below), and ignoring duplicated elements.</para>
                  <para><code><link linkend="element-inclusion">&lt;inclusion></link></code>s are
                     critically important to the content of the TAN file, so any file with
                           <code><link linkend="element-inclusion">&lt;inclusion></link></code>s
                     that cannot be located will be regarded as being in fatal error. Because of the
                     importance of access to included files, it is strongly recommended that
                     inclusions be limited to files locally available, in the same project.</para>
                  <para>Inclusions are recursive. If a TAN file A has <code>&lt;x
                        include='B'></code> and file B has <code>&lt;x include='C D E'></code> then
                     the validator for file A will replace the element with all <code>&lt;x></code>s
                     found in B, C, D, and E. </para>
                  <para>In any recursive activity, circularity is fatal. That is true for TAN
                     inclusion as well, but only within the domain of a given element name. It is
                     perfectly legal for two files to include each other, as long as they do not try
                     to include elements of the same name. </para>
                  <para>TAN inclusion removes elements from their original context, which means that
                     values that must be interpreted locally are converted before the elements are
                     included. For example, <link linkend="attribute-which"
                        ><code>@which</code></link> must be interpreted in light of the included
                     document's keys, not those of the including document. Similarly, different
                     numeration systems, e.g., Roman numerals, must be interpreted locally and
                     converted, before inclusion (see <xref linkend="reference_system"/>).</para>
               </section>
            </section>
            <section xml:id="source_and_see-also">
               <title>Distinguishing <code><link linkend="element-source">&lt;source></link></code>s
                  and <code><link linkend="element-see-also">&lt;see-also></link></code>s</title>
               <para>Creating and editing a class 1 TAN file frequently involves working with
                  non-TAN digital files. In the course of editing, and making the material
                  TAN-compatible, you will likely start to correct errors, to normalize conventions,
                  or to bring the transcription closer to an earlier version. At such times it may
                  unclear how to credit the digital files.</para>
               <para>To answer this, first determine a class 1 file's <code><link
                        linkend="element-source">&lt;source></link></code>s. Everything else is then
                  a <code><link linkend="element-see-also">&lt;see-also></link></code>. </para>
               <para>If you find that you are changing the material to go back to the source of your
                  source, then that earlier version should be the <code><link
                        linkend="element-source">&lt;source></link></code> and the file you were
                  using should be credited under a <code><link linkend="element-see-also"
                        >&lt;see-also></link></code>. But beware, lest using a particular source
                  (such as the TLG) puts you in violation of contract law (see <xref
                     linkend="license"/>).</para>
            </section>
            <section xml:id="inheritable_attributes">
               <title>Interpretation of inheritable attributes</title>
               <para>Some attributes are inheritable attributes, in that they affect not only the
                  host element but all descendants as well. Some inheritable attributes in
                  co-occurrence fall into an interpretive sequence. That is, in any given element,
                  some attributes must be interpreted before others.</para>
               <para><code><link linkend="attribute-claimant">@claimant</link></code> falls first in
                  the sequence, and <code><link linkend="attribute-cert">@cert</link></code> second.
                  Each attribute qualifies the data governed by the elements they modify. Put
                  another way, the two attributes are to be interpreted to mean: "<code><link
                        linkend="attribute-claimant">@claimant</link></code> has <code><link
                        linkend="attribute-cert">@cert</link></code> confidence about the following
                  data:...."</para>
               <para>Suppose you encoding claims made by someone else, and you are not certain if you
                  are faithfully representing their point of view. In those cases, your doubt should
                  be registered in a <code><link linkend="attribute-claimant"
                     >@claimant</link></code> and <code><link linkend="attribute-cert"
                     >@cert</link></code> that is a parent to the secondary claim you are
                  representing.</para>
               <para>If <code><link linkend="attribute-claimant">@claimant</link></code> is missing,
                  it is to be assumed that the assertion is being made by the key <code><link
                        linkend="element-agent">&lt;agent&gt;</link></code> (see <xref
                     linkend="iri_name"/>).</para>
               <para>If <code><link linkend="attribute-cert">@cert</link></code> is missing, it is
                  to be assumed that the data is asserted with full confidence.</para>
            </section>
            <section xml:id="defining_tokens">
               <title>Defining Words and Tokens</title>
               <para>At the heart of interaction between class 1 and class 2 files is a reference
                  system that counts or names words. This poses a problem at the outset. The term
                     <emphasis role="italic">word</emphasis> is notoriously difficult to define, no
                  matter the language. In different contexts, for example, "New York" and "didn't"
                  can each be justifiably defined as one or two words. Furthermore, some scholars
                  consider punctuation to be words (e.g., commas in modern prose, representing
                  "and"), whereas others ignore them as being anachronistic or capricious (e.g.,
                  ancient Greek and Latin). In the end, the number of meanings for "word" reflects
                  the rich variety of scholarly disciplines.</para>
               <para>TAN adopts the proximate term <emphasis role="italic">token</emphasis>—a word
                  that is defined not linguistically but computationally, according to a regular
                  expression (see <xref linkend="regular_expressions"/>). </para>
               <para>A TAN token is a reference pointer, not a linguistic marker. To define a token
                  in TAN does not entail any linguistic commitments. Neither editors nor users of
                  TAN data should infer that a <link linkend="element-tok"
                     ><code>&lt;tok></code></link> points to a morpheme, a lexeme, or any other
                  linguistic entity. There will frequently be a fortuitous correlation between the
                  two, but it is not guaranteed. In TAN, a token is purely a method of
                  reference.</para>
               <para>TAN requires all class 2 files that handle tokens to define them, either
                  implictly through TAN defaults, or explicitly by using <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link>.
                  TAN was developed in service of ancient literature, where punctuation is
                  anomalous, or of little use. Furthermore, even in contemporary use, most people
                  ignore punctuation when they count words. Therefore the default <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link>
                  defines a token as being any continuous string of word characters, the soft
                  hyphen, the zero-width space, or the zero-width joiner, formally defined:</para>
               <para>
                  <programlisting>&lt;token-definition regex="[\w&amp;#xad;&amp;#x200b;&amp;#x200d;]+"/></programlisting>
               </para>
               <para>This pattern will result in a close resemblance to what is ordinarily thought
                  of as words, but perhaps with some surprises (see above, <xref
                     linkend="regular_expressions"/>). If no <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link> is
                  invoked for a particular source, the pattern above will be assumed. It may also be
                  explictly called through <link linkend="attribute-which"
                     ><code>@which</code></link> (see <xref linkend="keywords-token-definition"
                  />).</para>
               <para>If you are working with modern texts, where punctuation might be important to
                  name and number, try the built-in keyword <code>general</code> (or <code>letters
                     and punctuation</code>):</para>
               <para>
                  <programlisting>&lt;token-definition regex="\w+|[^\w\s]"/></programlisting>
               </para>
               <para>This expression defines a token as a sequence of word characters or any single
                  character that is neither a word nor a space. The string "<code>(I go!)</code>"
                  (the text inside the quotation marks) would have five tokens: <code>( I go !
                     )</code>.</para>
               <para>Above are the two built-in, TAN-defined <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link>s.
                  You may customize your own <link linkend="element-token-definition"
                        ><code>&lt;token-definition></code></link> to suit your needs. But keep in
                  mind that TAN files were meant to be shared across fields and disciplines. You are
                  encouraged to to define tokens in manner customary to users of the text.
                  Specialized definitions make it less likely that your TAN file will be able to
                  mesh well with other TAN files. Two class-2 files annotating the same class-1 file
                  cannot be easily compared or synthesized if they use different definitions of
                  token.</para>
               <para>Given those caveats, consider a specialized case, where you wish to prepare
                  your transcriptions such that certain Unicode characters precisely delimit tokens
                  that are synonymous with a particular linguistic category, say lexeme. Say, for
                  example, you use specialized control characters (e.g., U+200C ZERO WIDTH
                  NON-JOINER and U+200D ZERO WIDTH JOINER) to mark word boundaries within the text
                  of your class 1 file. You might then create a <link
                     linkend="element-token-definition"><code>&lt;token-definition></code></link>
                  like this:</para>
               <para>
                  <programlisting>&lt;token-definition regex="[^\p{Cf}\s]+"/></programlisting>
               </para>
               <para>The statement defines a token as any consecutive sequence of non-spacing and
                  non-control format characters.</para>
               <para>Such customized approaches may make the technique unwieldy or impossible to
                  use, thereby limiting your TAN file's interoperability and utility. It is
                  recommended that if you use control formatting characters or other special
                  characters that are invisible to use the xml entity, e.g.,
                     <code>&amp;#x200D;</code>, so they can be seen in your file.</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_1">
         <title>Class-1 TAN Files, Representations of Textual Objects (Scripta)</title>
         <para>This chapter provides general background to the elements and attributes that are
            common to all class 1 TAN files. For detailed discussion see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>Class 1 TAN files preserve segmented transcriptions of books, manuscripts, papyri,
            stones, or any other objects with writing on them—collectively termed here
               <emphasis>scripta</emphasis> (sg. <emphasis>scriptum</emphasis>). Files of this class
            are the foundation of any project. No class 2 files (e.g., alignment, morphology) can be
            created without class 1 files. </para>
         <para>Transcriptions come in two different formats, identified by the root element.
                  <code><link linkend="element-TAN-T">&lt;TAN-T></link></code> is a simple, generic
            format, as close as one can get to plain text. <code>&lt;TEI></code> (also referred to
            in this manual as TAN-TEI), on the other hand, can be complex and highly expressive.
            Because the two types function almost identically, the generic TAN-T format is described
            first, followed by supplemental comments on TAN-TEI.</para>
         <section xml:id="transcription_principles">
            <title>Principles and Assumptions</title>
            <section>
               <title>General</title>
               <para>(For more general principles and assumptions applying to all TAN files, not
                  just class 1, see <xref linkend="design_principles"/>.)</para>
               <para>Class 1 formats are designed for faithful but judiciously edited digital
                  transcriptions. Each TAN-T(EI) file is devoted exclusively to a single version of
                  a single work found in a single scriptum (text-bearing object), segmented and
                  uniquely labeled with a common reference system. Editors of TAN-T(EI) files should
                  be able to read, write, and proofread texts in the languages of the
                  transcriptions. They should understand the texts well enough to segment them and
                  label them according to the conventions used for those works. They should be able
                  to distinguish the primary source from its editorial apparatus. They should be
                  familiar with normalizing conventions for texts from the period, language, and
                  culture. They should know how users of the transcription might use it in other
                  contexts, especially translation studies or a study of quotations.</para>
               <para>Editors need not understand everything about their texts, and they need not
                  have any specialized skill in grammar or lexicology. They need not know the
                  morphology of individual words, or how individual parts of the text have been
                  translated. Those skills are better used in other TAN formats. </para>
               <para>TAN-T(EI) editors stand at the beginning of a larger workflow for text
                  alignment. It is critical that work not be published hastily, and only after
                  careful proofreading, especially of white space. Many transcriptions, especially
                  those of long texts, have typographical errors. Eliminating as many as possible
                  before publication will maximize the utility of a TAN-T(EI) file. On the other
                  hand, TAN has been designed with the assumption that all our files have
                  typographical errors that we need to correct as they are found.</para>
               <para>If you are creating a TAN-T(EI) file, you are doing so primarily to service
                  text alignment. To align is to correlate texts that are similar because of
                  copying, translating, paraphrasing, revising, quoting, summarizing, and so forth.
                  In all these processes, one or more texts, usually called the
                     <emphasis>source</emphasis> (or <emphasis>sources</emphasis>), serves as the
                  basis for a new text, oftentimes called the <emphasis>target</emphasis>. In many
                  cases, the target and source bear little resemblance to each other. Therefore the
                  best transcription files are those whose structures look to an archetype, not a
                  particular version. Editors of TAN transcriptions should not worry about
                  preserving the appearance of its source (i.e., it should not be a diplomatic
                  edition), and they should structure the text, when possible, by the most familiar
                  reference system for that work. If possible, semantic mileposts (clauses,
                  sentences, paragraphs, chapters) should be prioritized over visual (lines,
                  columns, pages, volumes). See below on <link linkend="reference_system">reference
                     systems</link>.</para>
            </section>
            <section xml:id="domain_model">
               <title>Domain model</title>
               <para>Contributors and users of TAN files must assume a firm distinction between a
                  scriptum (text-bearing object) and a conceptual work, e.g., a specific printed
                  copy of the <emphasis>Iliad</emphasis> versus the <emphasis>Iliad</emphasis>
                  concieved generally. The former has materiality (digital files are treated as
                  having materiality) and the latter does not. Even though both are constitutively
                  necessary for any transcription, the two are sharply differentiated in the TAN
                  format: <code><link linkend="element-source">&lt;source&gt;</link></code> and
                        <code><link linkend="attribute-src">@src</link></code> point to physical
                  exemplars; <code><link linkend="element-work">&lt;work&gt;</link></code> and
                        <code><link linkend="attribute-work">@work</link></code> to the conceptual. </para>
               <para>The distinction may remind some readers of the domain model defined by the
                  Functional Requirements for Bibliographical Records (FRBR), which identifies four
                  types of entities for what they call Group 1 (Products of intellectual &amp;
                  artistic endeavor): <emphasis>Work</emphasis>, <emphasis>Expression</emphasis>,
                     <emphasis>Manifestation</emphasis>, and <emphasis>Item</emphasis>, the first
                  pair being conceptual, non-material entities and the latter pair material ones. </para>
               <para>TAN has been designed with a slightly different domain model in mind. FRBR
                  Items are equivalent to what TAN calls <emphasis>scripta</emphasis>. Multiple
                  scripta that for all intents and purposes are indistinguishable (i.e., items
                  reproduced mechanically) are equivalent to FRBR Manifestations, but in TAN no
                  corresponding entity has been defined. It is best to think of TAN scripta as being
                  equivalent to FRBR Items, with FRBR Manifestations being sets of indistinguishable
                  TAN scripta. </para>
               <para>As for conceptual entities, TAN has been designed with the assumption that most
                  users will find the distinction between Works and Expressions to be unhelpful or
                  false. What one person calls a FRBR Expression another may legitimately call a
                  Work (e.g., the King James Version is more than just a translation of the Bible).
                  TAN assumes that any derivation of a Work (or Works) is itself a Work, which is
                  really shorthand for <emphasis role="italic">work-version</emphasis>. Thus, in
                  this manual the term <emphasis>version</emphasis> indicates merely a type of work
                  that is known either to derive from another work or to be the basis for other
                  versions of a work. </para>
               <para>TAN avoids altogether the term <emphasis>Expression</emphasis>. Aside from the
                  issues mentioned above, the term implies a medium (without which nothing can be
                  expressed) and therefore materiality. </para>
            </section>
            <section>
               <title>One version, one work, one object, one reference system</title>
               <para><emphasis>Every TAN-T(EI) file must be restricted to a transcription of a
                     single version of a single conceptual work found on a single scriptum,
                     segmented and labeled according to a single reference system</emphasis>. </para>
               <para>This restrictive principle is critical to the the success of the network. It
                  reduces the risk of confusion, simplifies the files, and shifts markup complexity
                  from an individual transcription file to the network in which that file
                  participates.</para>
               <section xml:id="textual_objects">
                  <title>One scriptum</title>
                  <para>Each TAN-T(EI) file transcribes one and only one text-bearing object or
                     scriptum. It may be a digital file, a book, a manuscript, a stone, a sign, or a
                     bottlecap. If the object you've chosen has been made mechanically and is
                     virtually indistinguishable from other objects created in the same process
                     (e.g., copies of a printed book or copies of a digital file), then the entire
                     set of copies is to be treated as a single object (an entity some librarians
                     call a manifestation). </para>
                  <para>The definition of some scripta require an editor's discernment and judgment.
                     For example, some manuscripts have been split up, their parts now residing in
                     multiple libraries around the world; other manuscripts have been physically
                     altered. In such cases, you may need to define your scriptum in a way that
                     might not match the way others define it. But the decision is your prerogative,
                     not theirs. You have both the right and responsibility to define your object in
                     the way that you think will most benefit users of your files.</para>
                  <para>It is a good idea to name your scriptum in <code><link
                           linkend="element-source">&lt;source></link></code> with an <code><link
                           linkend="element-IRI">&lt;IRI></link></code> value in the form of an
                        <code>http</code> URL provided by a library catalogue. This way you provide
                     a way for others, perhaps through an algorithm, to retrieve extensive,
                     structured bibliographical information. You also save yourself the hassle of
                     writing a detailed bibliographical description that your users would have to
                     tailor to suit their distinctive purposes. If a URL cannot be found for
                           <code><link linkend="element-IRI">&lt;IRI></link></code>, you may simply
                     coin a tag URN or a UUID. Alternatively, if you find another TAN file that uses
                     the same source, it would be a good idea to adopt that name.</para>
               </section>
               <section xml:id="conceptual_works">
                  <title>One work</title>
                  <para>The transcription must be restricted to a single creative work, identified
                     by <code><link linkend="element-work">&lt;work></link></code>. </para>
                  <para>Many scripta have more than one work. Identifying and defining the creative
                     work you transcribe is, once again, your prerogative. Suppose the scriptum you
                     have is a Bible. The work you choose from that object can take whatever
                     contours you wish. Perhaps you wish to encode the entire Bible and treat it as
                     a single work. Or maybe you wish to treat only the New Testament as the work,
                     or the Tetraevengelion, or the Gospel of Matthew, or a specific episode in that
                     gospel, or simply the Beatitudes. Any reasonable definition of a work is
                     permitted, but a TAN-T(EI) file must contain nothing but the work you have
                     defined. It should be a complete representation of what is found on the object
                     (even if only partially preserved), and respect as far as is practical the
                     order found in the scriptum.</para>
                  <para>Well-known works may have a suitable IRI name already assigned to them, say
                     by means of a <link xlink:href="http://wiki.dbpedia.org/About">DBPedia</link>
                     entry. Most works have not been assigned IRIs or are named in IRI vocabularies
                     that are not well known. You may assign any work your own URN, through a UUID
                     or a tag URN. Any IRIs that you mint are free to be used by other people
                     writing TAN files about the same work. Similarly, if you find that another
                     TAN-T file has transcribed a version of your work, you may also use that URN
                     (you don't need to ask permission, since no URN can be copyrighted). As with
                     other parts of the metadata, multiple <code><link linkend="element-IRI"
                           >&lt;IRI></link></code>s and <code><link linkend="element-name"
                           >&lt;name></link></code>s are names for the same work, not individual
                     names for different works. </para>
               </section>
               <section xml:id="work-versions">
                  <title>One version</title>
                  <para>The transcription must be restricted to a single version of the creative
                     work, identified by <code><link linkend="element-version"
                        >&lt;version></link></code> (optional). In most cases, <code><link
                           linkend="element-version">&lt;version></link></code> is unnecessary,
                     because <code><link linkend="element-work">&lt;work></link></code> in
                     conjunction with <code><link linkend="element-source">&lt;source></link></code>
                     are sufficient to identify a particular work-version. But if the source carries
                     multiple versions (e.g., a bilingual edition of a text), then <code><link
                           linkend="element-version">&lt;version></link></code> must be
                     included.</para>
                  <para>If you wish to include other versions from a source, each one should have
                     its own separate TAN-T(EI) file. </para>
                  <para>Notes should be included only if they are an integral part of the primary
                     work (i.e., by the same author). Otherwise, you should ask yourself whether the
                     notes are of any real interest. If they are not, ignore them. If they are
                     important, put them in their own TAN-T(EI) file, or convert them to claims in a
                     TAN-A-div file.</para>
                  <para>If you need to specify exactly where on a scriptum a version appears,
                           <code><link linkend="element-desc">&lt;desc></link></code> or <code><link
                           linkend="element-comment">&lt;comment></link></code> should be
                     used.</para>
                  <para>Very few work-versions have their own URN names. It is advisable to assign a
                     tag URN or a UUID. If the IRI you have used for <code><link
                           linkend="element-work">&lt;work></link></code> is in a namespace that you
                     own or control, then you are entitled to modify it, and you may wish merely to
                     add a suffix to the work IRI to name the version. </para>
               </section>
               <section xml:id="reference_system">
                  <title>One reference system</title>
                  <para>Every TAN transcription must be segmented into a hierarchy of uniquely
                     labeled divisions, defined in the <code><link linkend="element-body"
                           >&lt;body></link></code> through <code><link linkend="element-div"
                           >&lt;div></link></code>s and their <code><link linkend="attribute-type"
                           >@type</link></code> and <code><link linkend="attribute-n"
                        >@n</link></code> values. </para>
                  <para>Those divisions, whenever possible, should align with the reference system
                     that prevails for the work across versions or translations, what is sometimes
                     called a canonical reference system. Because even the most familiar reference
                     system admits degrees and dispute the term <emphasis>canonical</emphasis> is
                     problematic, so <emphasis role="italic">reference system</emphasis> is
                     preferred in these guidelines. </para>
                  <para>If you have your choice, preference should be given to systems that follow
                     the semantic contours of the work, not the physical features of a particular
                     object. Chapter, paragraph, and sentence numbers are preferable to volume,
                     page, and line numbers, because other derivative versions of a work (e.g.,
                     translations, paraphrases) will only roughly, if at all, follow an
                     object-oriented reference system. </para>
                  <para>Sometimes an object-based reference system is inescapable, or is the most
                     common reference system for a work (e.g., Porphyry's commentary on the
                        <emphasis>Categories</emphasis>). It is perfectly acceptable to adopt that
                     scheme, but it may eventually entail more labor for the alignment process. </para>
                  <para>If a given work has multiple systems (e.g., the works of Plato and
                     Aristotle, which have two reference systems—semantic- and object-oriented—both
                     of which are standard and important), then the recommended practice is to
                     encode the same text twice, placing in each file a <code><link
                           linkend="element-see-also">&lt;see-also></link></code> pointing to the
                     other and a <code><link linkend="element-relationship"
                        >&lt;relationship></link></code> with the keyword <code>alternatively
                        divided edition</code> as the value of <link linkend="attribute-which"
                           ><code>@which</code></link>. A pair of alternatively divided editions can
                     usefully serve as the basis for concordances. In fact, the pair can be used as
                     the first step in converting another version of the same work from one
                     reference system to the other.</para>
                  <para>If there is a good reference system, but the divisions are overly lengthy,
                     you may introduce subdivisions. Such subdivided texts are compatible with
                     references to the older system. But there is no guarantee that the provisional
                     subdivisions you introduce will be adopted by other editors who create or edit
                     TAN versions of the same work, and in the end editors working independently
                     upon the same text may produce discordant schemes. The TAN-A-div format was
                     designed to reconcile such differences.</para>
                  <para>If there is no reference system, or if you think that the ones that exist
                     are inadequate or misguided, create one of your own. If you develop your own
                     reference system, be sure to optimize for all versions of the work, whether
                     known or not. </para>
                  <para>In the <code><link linkend="element-declarations"
                        >&lt;declarations></link></code>, at least one <code><link
                           linkend="element-div-type">&lt;div-type></link></code> must be supplied,
                     declaring the types of divisions into which the text has been segmented, to be
                     referred to by <code><link linkend="attribute-type">@type</link></code> in
                           <code><link linkend="element-div">&lt;div></link></code>s. To declare a
                           <code><link linkend="element-div-type">&lt;div-type></link></code> does
                     not require you to use it in the transcription. It is advisable to keep the
                     abbreviation coined in <code><link linkend="attribute-xmlid"
                        >@xml:id</link></code> brief but meaningful. </para>
                  <para>Well-known division types already have suitable IRI names. See <xref
                        linkend="keywords-div-type"/> for a list of core TAN vocabulary for division
                     types, both common and uncommon. If you encounter a rare division type, or one
                     that needs specificity not provided for in a well-known URN, you should mint
                     your own, either in the declarations or in a separate TAN-key file.</para>
                  <para>Reference systems have as a central component numbering systems. TAN
                     supports five numeration systems:<orderedlist>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals</emphasis>. 1, 2, 3,
                              etc.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Roman numerals</emphasis>. Values up to 5000,
                              utilizing i, v, x, l, c, d, and m, uppercase or lowercase, with
                              liberal syntactic rules (within a roman numeral, any digit preceding
                              one of a higher value is assumed to be a subtraction from the total
                              value; all others are positive values).</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences</emphasis>. The
                              26-letter Roman alphabet, with numbers higher than 26 (or any multiple
                              of 26) beginning with the letter a incrementally repeated, e.g., y
                              (25), z, (26), aa (27), bb (28), … aaa (53). Uppercase or lowercase
                              allowed.</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Arabic numerals + alphabetic
                                 sequences</emphasis>. Arabic numerals followed immediately by an
                              alphabetic sequence. The second item is to be calculated as a
                              subsequence of the first item, with the lack of a second item taking
                              highest priority. E.g., 4, 4a, 4b, 4c....</para>
                        </listitem>
                        <listitem>
                           <para><emphasis role="bold">Alphabetic sequences + Arabic
                                 numerals</emphasis>: As above, but with alphabetic sequence
                              preceding Arabic numerals.</para>
                        </listitem>
                     </orderedlist></para>
                  <para>TAN file processors will attempt to convert all values of <code><link
                           linkend="attribute-n">@n</link></code> to Arabic numerals. Some values
                     are ambiguously Roman numerals or alphabetic sequences, e.g., <code>c</code> (=
                     3 or 100), so this conversion takes place within the context of a single
                     document, without reference to any associated files. You may not mix Roman
                     numerals and alphabetic sequences in the same div type. You should also avoid
                     any string labels that would be misinterpreted as a Roman numeral. For example,
                     if you are labeling a book whose title is "Civilizations," you should not use
                        <code>n="Civ"</code>, since all values of <code><link linkend="attribute-n"
                           >@n</link></code> are treated as lowercase. </para>
                  <para>There are also tools for other numeration systems, but they have not been
                     implemented in the validation process. See <code><link
                           linkend="function-arabic-numerals">tan:arabic-numerals</link></code>(),
                           <code><link linkend="function-grc-to-int">tan:grc-to-int</link></code>(),
                     and <code><link linkend="function-syc-to-int"
                     >tan:syc-to-int</link></code>().</para>
               </section>
            </section>
            <section xml:id="normalizing_transcriptions">
               <title>Normalizing transcriptions</title>
               <para>You should declare how you have normalized the transcription via <code><link
                        linkend="element-filter">&lt;filter></link></code> and its children,
                        <code><link linkend="element-normalization"
                  >&lt;normalization></link></code>, <code><link linkend="element-transliteration"
                        >&lt;transliteration></link></code>, and <code><link
                        linkend="element-replace">&lt;replace></link></code>. (For suggestions on
                  values of <code><link linkend="element-IRI">&lt;IRI></link></code> for <code><link
                        linkend="element-normalization">&lt;normalization></link></code> see <xref
                     linkend="keywords-normalization"/>.)</para>
               <para>Generally speaking, normalization entails the suppression of things extraneous
                  to or separable from the work you have chosen. You are encouraged to omit
                  parenthetical editorial insertions, stray handwritten remarks, discretionary
                  word-breaking hyphens, editorial comments, inserted cross-references, and
                  reference numerals (page numbers, section numbers, etc.). The goal is a
                  transcription whose text is free of the interpretive voice of later editors. In
                  addition, you should resolve ligatures and correct unintended typographical
                  errors. (Such orthographic corrections are useful to those users who want to
                  generate lexico-morphological data automatically or semiautomatically.)</para>
               <para>In a digital source, variable lengths of spacing marks (e.g., General
                  Punctuation U+2000..U+200B) should be converted to ordinary spaces, and
                  superscript combining Roman letters (U+0363..U+036F) should probably be converted
                  to their non-combining counterparts. All Unicode must be normalized to NFC forms
                  (see <xref linkend="normalization"/>). </para>
               <para>Keep in mind that your transcriptions will be used by other people doing, e.g.,
                  word-for-word translation alignments, quotation checking, syntactical analysis,
                  and they will want transcriptions that are as clean as possible. You should remove
                  from the text anything that is not part of the work proper and would interfere
                  with detailed word-for-word alignment, or would require extra preprocessing or
                  postprocessing work for later users. If you are segmenting a source into line
                  breaks, and you are required to break a word between divisions, you should either
                  use the soft hyphen (<code>&amp;#xad;</code>) or the zero-width joiner
                     (<code>&amp;#x200d;</code>) at the end of the first <code><link
                        linkend="element-div">&lt;div></link></code>. TAN processors that handle a
                        <code><link linkend="element-div">&lt;div></link></code> will automatically
                  normalize the space in the element, then place a space between that <code><link
                        linkend="element-div">&lt;div></link></code> and the next unless if one of
                  those two characters are present, in which case the character will be deleted and
                  the two <code><link linkend="element-div">&lt;div></link></code>s will be joined
                  with no intervening space. For more on issues regarding whitespace, see <xref
                     linkend="whitespace"/>.</para>
               <para>If you are working with a text with notes, distinguish between those written by
                  the same person who wrote the work you're transcribing from those that aren't.
                  Treat the former as part of the work proper and give each note a <code><link
                        linkend="element-div">&lt;div></link></code> with a suitable <code><link
                        linkend="attribute-type">@type</link></code> and place it after the
                        <code><link linkend="element-div">&lt;div></link></code> it annotates. It
                  will be assumed by processors of the data that, absent more specific information,
                  any <code><link linkend="element-div">&lt;div></link></code> of an annotating
                        <code><link linkend="attribute-type">@type</link></code> is an annotation of
                  the last <code><link linkend="element-div">&lt;div></link></code> that is not an
                  annotation. (Alternatively, you may use the <code>&lt;note></code> feature of
                  TAN-TEI, but bear in mind that this element will be treated by users as part of
                  the leaf div to which it belongs, not separate from it.) </para>
               <para>If the notes are not part of the work per se—for example, translator's notes in
                  a translation of a primary source—you should treat them as a separate work
                  altogether, and put them in a separate TAN-T(EI) file, perhaps linking the two
                  through <code><link linkend="element-see-also">&lt;see-also></link></code>. You
                  may wish to structure that file so that it mirrors the reference system of the
                  primary source, in which case further alignment between the two is not needed. Or
                  you may wish to use a reference system that reflects how you would cite the note,
                  e.g., page and note number. In this latter case, you would then create a companion
                  TAN-A-div file that establishes links between the primary source and its
                  annotations.</para>
               <para>Remember that the note signals in the main text and in the footnote area are
                  metadata meant to help readers link corresponding passages of texts, and should be
                  deleted. If the connective function served by the note signal is important, use a
                  TAN-A-div file to link the notes to the main text.</para>
               <para>This principle holds true for transcribing texts that have variants to the work
                  integrated into the document. For example, a manuscript may have correctors'
                  marks. Or a set of footnotes (or apparatus criticus) might comment on how and why
                  the main text differs from previous readings. In those cases, each set of
                  corrections might be wholly incorporated into the <code><link
                        linkend="element-claim">&lt;claim></link></code>s of a TAN-A-div file,
                  perhaps also with a separate TAN-T file.</para>
               <para>Overall, normalization is a difficult topic, and it is not well studied. Not
                  all decisions will be clear-cut. You may justly hesitate before normalizing
                  orthography, punctuation, accentuation, or capitalization. Some aspects of Unicode
                  that lend themselves to varying conventions may need special consideration. You
                  may need to consider whether an unusual or rarely used Unicode character might be
                  misinterpreted, or a hindrance to other users (especially for parsing word
                  tokens). Describe any decisions that might not be agreeable to everyone who uses
                  the file in the <code><link linkend="element-filter">&lt;filter></link></code>. </para>
               <para>In some ambiguous areas, you can use TAN-TEI to your advantage. Suppose, for
                  example, a manuscript has reference numerals that are sui generis. That is, these
                  reference numbers do not correspond to the "canonical" reference scheme. On the
                  one hand, they are metadata, and should arguably be deleted; on the other, they
                  are part of the text, and witness to how a text was read and changed over time. A
                  middle-ground approach would move these references to TAN-TEI's
                     <code>&lt;milestone rend=""></code>. In that way, the numerals are removed from
                  the main text; on the other hand, the information is retained. Generally speaking
                  TEI's <code>@rend</code> is an excellent way to remove something from the main
                  text, without removing it from the file altogether.</para>
            </section>
         </section>
         <section xml:id="tan-t_data">
            <title>Transcriptions</title>
            <para>The sole purpose of the <code><link linkend="element-body">&lt;body></link></code>
               of a class 1 file is to contain a segmented transcription of a single version of a
               single work from a scriptum. <code><link linkend="element-body"
                  >&lt;body></link></code> may take <code><link linkend="attribute-in-progress"
                     >@in-progress</link></code> and must take <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> that the majority of the
               text is in. If a change in language occurs in a descendant <code><link
                     linkend="element-div">&lt;div></link></code>, ensure that its <code><link
                     linkend="attribute-xmllang">@xml:lang</link></code> value (explicity or by
               inheritance) indicates the language that is used.</para>
            <para><code><link linkend="element-body">&lt;body></link></code> takes one or more
                     <code><link linkend="element-div">&lt;div></link></code> elements, each of
               which govern either other <code><link linkend="element-div">&lt;div></link></code>
               elements, or text (or TEI elements).</para>
            <para>The term <emphasis>leaf div</emphasis> refers to those <code><link
                     linkend="element-div">&lt;div></link></code>s that contain text and therefore
               no other <code><link linkend="element-div">&lt;div></link></code>s.</para>
            <para>Within this treelike structure of <code><link linkend="element-div"
                     >&lt;div></link></code>s, the concatenation of <code><link
                     linkend="attribute-n">@n</link></code> values, starting from the most ancestral
                     <code><link linkend="element-div">&lt;div></link></code>, provides the
                  <emphasis>flat ref</emphasis>, the reference system used by class 2 files to refer
               to parts of TAN-T(EI) files. </para>

            <section xml:id="leaf_div_uniqueness_rule">
               <title>Flattened References, and the Leaf Div Uniqueness Rule</title>
               <para>One of the most important validation rules is the <emphasis>Leaf Div Uniqueness
                     Rule</emphasis>, which states that the flat ref for each leaf <code><link
                        linkend="element-div">&lt;div></link></code> must be unique.</para>
               <para>This rule applies only to leaf <code><link linkend="element-div"
                        >&lt;div></link></code>s and not to <code><link linkend="element-div"
                        >&lt;div></link></code>s in general, since on occasion a major textual unit
                  will be broken by another. For example, chapters 24 and 30 in the book of Proverbs
                  of the Septuagint are split and interleaved (24.1–22e [22a–e are verses not extant
                  in the Hebrew]; 30.1–14; 24.23–34; and 30.15–33).</para>
            </section>
         </section>
         <section xml:id="tan-tei">
            <title>Transcriptions Using the Text Encoding Initiative (<code>&lt;TEI></code>)</title>
            <para>
               <note>
                  <para>This section is to be read in conjunction with <xref linkend="class_1"/> and
                        <xref linkend="TEI"/>, which address some technical issues that relate to
                     TAN-compliant TEI to XML and validation generally.</para>
               </note>
            </para>
            <para>Some creators and editors of transcriptions will find the rather stripped-down
               TAN-T format inadequate. Some may wish to mark up the text further, or already have a
               library of transcriptions whose annotations are desirable to keep, even if some users
               may not disinterested. To serve these needs, you should use TAN-TEI, an extension to
               the Text Encoding Intiative (TEI) format, which is well known for its expressiveness,
               its stability, its flexibility, and its widespread use in scholarship.</para>
            <para>TEI was designed to be maximally expressive and flexible, to serve the detailed
               needs of humanities scholars. In serving this mission, TEI has come to define more
               than five hundred different element names, and more than two hundred attributes
               (roughly six times more than are defined in TAN). Of course, any given TEI file uses
               only a small subset of those elements and attributes, and TEI itself comes in
               different flavors, from TEI Lite, which uses only 75 attributes and 140 elements, to
               TEI All, which opens up almost the entire library. </para>
            <para>Although the TEI format is oftentimes seen as a standard, it lacks some of the
               charactistics expected in a standard. It is greatly flexible, admits flavors and
               interpretation, and has been designed to encourage customization. Individuals and
               projects may define their own subset of TEI elements, to constrict or expand the
               allowable rules as they see fit. TAN-TEI is one of those customizations. The major
               difference is that TAN-TEI attempts to impose extra strictures not defined in TEI, to
               ensure that transcriptions are maximally likely to be interchangeable with other TAN
               files.</para>
            <para>TAN's customization of the TEI can be summarized as follows (the default namespace
               in this section is the TEI namespace,
               <code>http://www.tei-c.org/ns/1.0</code>):</para>
            <para>
               <table frame="all">
                  <title>Synopsis of TAN-TEI customization</title>
                  <tgroup cols="2">
                     <colspec colname="c1" colnum="1" colwidth="1*"/>
                     <colspec colname="c3" colnum="2" colwidth="3.21*"/>
                     <thead>
                        <row>
                           <entry>TEI element</entry>
                           <entry>summary of alteration</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code>&lt;TEI></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must have <code><link linkend="attribute-id"
                                          >@id</link></code> with IRI name</para>
                                 </listitem>
                                 <listitem>
                                    <para>should take new namespace declaration,
                                          <code>xmlns:tan="tag:textalign.net,2015:ns"</code>
                                    </para>
                                 </listitem>
                                 <listitem>
                                    <para>takes a new child element, <code><link
                                             linkend="element-head">&lt;head></link></code>, placed
                                       between <code>&lt;teiHeader></code> and
                                          <code>&lt;text></code></para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code>&lt;text></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>Only the child <code><link linkend="element-body"
                                             >&lt;body></link></code> will be regarded by other TAN
                                       users. <code>&lt;front></code> and <code>&lt;back></code>
                                       will be ignored.</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-body">&lt;body></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-xmllang"
                                             >@xml:lang</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>may take <code><link linkend="attribute-in-progress"
                                             >@in-progress</link></code></para>
                                 </listitem>
                                 <listitem>
                                    <para>must take exclusively one or more <code><link
                                             linkend="element-div">&lt;div></link></code>s</para>
                                 </listitem>
                                 <listitem>
                                    <para>any elements or text between <code><link
                                             linkend="element-div">&lt;div></link></code>s will be
                                       ignored</para>
                                 </listitem>
                                 <listitem>
                                    <para>overall contents must be restricted to a single
                                       work</para>
                                 </listitem>
                                 <listitem>
                                    <para>any and all text nodes will be treated as part of the
                                       transcription</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-div">&lt;div></link></code></entry>
                           <entry>
                              <itemizedlist>
                                 <listitem>
                                    <para>must take either only <code><link linkend="element-div"
                                             >&lt;div></link></code>s or no <code><link
                                             linkend="element-div">&lt;div></link></code>s at
                                       all</para>
                                 </listitem>
                                 <listitem>
                                    <para>must take <code><link linkend="attribute-type"
                                             >@type</link></code> and <code><link
                                             linkend="attribute-n">@n</link></code> (<link
                                          linkend="attribute-include"><code>@include</code></link>
                                       is not allowed in TAN-TEI, but is allowed in TAN-T)</para>
                                 </listitem>
                              </itemizedlist>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>Like all other TAN files, the root elements of TAN-TEI files must take an
                     <code><link linkend="attribute-id">@id</link></code>, the IRI name. See above,
                  <xref linkend="tag_urn"/>.</para>
            <para>TAN-TEI files have two heads, which may strike you as odd. The TEI head and the
               TAN head were designed for different purposes. Whereas the TAN <link
                  linkend="element-head"><code>&lt;head></code></link> is meant to be brief and
               keyed to both IRIs and human-readable data, the <code>&lt;teiHeader></code> has been
               designed principally for human readability, and permits quite an expansive range of
               metadata, and about matters that bear on the transcription only indirectly (e.g.,
               manuscript descriptions). </para>
            <para>Processors of TAN-TEI files will in general ignore the contents of
                  <code>&lt;teiHeader></code>, since the contents are unpredictable. If your
                  <code>&lt;teiHeader></code> has any kind of metadata relevant to TAN users, you
               will need to adapt it for the standard TAN <link linkend="element-head"
                     ><code>&lt;head></code></link> (see <xref linkend="metadata_head"/> and <xref
                  linkend="transcription_principles"/>). You may find that some of the material you
               put in <code>&lt;teiHeader></code> is not suitable for <code><link
                     linkend="element-head">&lt;head></link></code> and vice versa. This conversion
               needs to be performed manually, since the two headers are incommensurate, and writing
               each one requires a different kind of outlook.</para>
            <para>In a TAN-TEI file, the TAN <code><link linkend="element-head"
                  >&lt;head></link></code> must declare the TAN namespace to be its default, i.e.,
                  <code>&lt;head xmlns="tag:textalign.net,2015:ns"></code> or
                  <code>&lt;tan:head></code> if the prefix <code>tan:</code> has been defined in the
               root element.</para>
            <para>Within any leaf <code><link linkend="element-div">&lt;div></link></code>, you may
               use whatever TEI markup you wish, to whatever level of depth or complexity. All users
               of your TAN-TEI file will be interested in the text; only a subset will care about
               any markup within leaf <code><link linkend="element-div">&lt;div></link></code>s. For
               this reason, even if you change the value of <code><link linkend="attribute-xmllang"
                     >@xml:lang</link></code> within a leaf <code><link linkend="element-div"
                     >&lt;div></link></code>, there is no guarantee that readers or processors of
               your data will take it into account. </para>
            <para>TAN-TEI should not be used to try to represent the physical appearance of the text
               on the object. Write a separate TEI (non-TAN) file first, and then use TAN-TEI to
               create a more normalized version.</para>
            <para>You may need to prepare a TEI file to be TAN compliant. As a matter of
               practicality, it is helpful to envision the conversion process as falling in three
               steps:</para>
            <para>
               <orderedlist>
                  <listitem>
                     <para>Structure: insert new processing instructions (TAN-TEI validation files);
                        adjust root element by supplying IRI name to <code><link
                              linkend="attribute-id">@id</link></code>, TAN namespace to
                           <code>@xmlns:tan</code>.</para>
                  </listitem>
                  <listitem>
                     <para>Metadata: create new <code><link linkend="element-head"
                           >&lt;head></link></code> and populate it</para>
                  </listitem>
                  <listitem>
                     <para>Data: edit <code><link linkend="element-body">&lt;body></link></code> to
                        restrict the content to a single work; restructure <code><link
                              linkend="element-body">&lt;body></link></code> content into nesting
                              <code><link linkend="element-div">&lt;div></link></code>s with correct
                              <code><link linkend="attribute-type">@type</link></code> and
                              <code><link linkend="attribute-n">@n</link></code> values.</para>
                  </listitem>
               </orderedlist>
            </para>
            <para>It has been the experience of those who have made TEI to TAN-TEI conversions that
               step 2 is the most time-consuming. The TAN <code><link linkend="element-head"
                     >&lt;head></link></code> requires one to more carefully curate the metadata
               than does <code>&lt;teiHeader></code>. But step 3 should not be overlooked, either.
               Many people write TEI files with a focus on the original textual object, and they
               make editorial decisions that look toward the scriptum and not the intertextual
               ecosystem that TAN supports. It is advisable to trim from the body of your TEI file
               any elements that would interfere with direct comparison with other versions of the
               text in the TAN format.</para>
         </section>
      </chapter>
      <chapter xml:id="class_2">
         <title>Class-2 TAN Files, Annotations of Texts</title>
         <para>This chapter provides general background to the elements and attributes that are
            common to class 2 TAN files. For detailed discussion see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>At present, class 2 files are restricted to alignment or lexico-morphology. </para>
         <para>Alignment files come in two different formats, identified by the root element.
            TAN-A-div provides macroscopic alignment; TAN-A-tok, microscopic. TAN-A-div aligns one
            or more class 1 files. It is intended for broad, general alignments of any number of
            versions of any number of works. The scope of TAN-A-tok is more restricted, to two class
            1 files, allowing one to declare alignments with detailed specificity, certainty, and
            type between words (tokens). TAN-A-div focuses on works, regardless of version;
            TAN-A-tok focuses on individual versions.</para>
         <para>Lexico-morphology files (also called part-of-speech files), TAN-LM, are used to
            encode the lexical headwords and morphological forms of individual words in class 1
            files.</para>
         <section xml:id="class_2_common">
            <title>Common Elements</title>
            <para>The class 2 formats have been designed to be human readable, particularly
               references to class 1 files. In ordinary conversation, when refering to specific
               parts of a work, we like to cite pages, paragraphs, sentences, lines, words, letters,
               and so forth. We use relational words (e.g., "first"), and the very text itself. We
               might say, for example, "See page 4, second paragraph, the last four words." Or, "See
               page 4, second paragraph, first sentence, second occurence of 'pull'." </para>
            <para>The TAN pointer syntax differs from other pointer systems (e.g., URLs, XPath, and
               XPointer) in that it depends upon a hierarchy of four features: works, divisions,
               word tokens, and characters. <emphasis>Works</emphasis>, defined above (see <xref
                  linkend="conceptual_works"/>), are defined by the <emphasis>source</emphasis>
               (which may not have more than one work). <emphasis>Divisions</emphasis> are defined
               by the <code><link linkend="element-div">&lt;div></link></code> structure of each
               source. <emphasis>Tokens</emphasis> are words of those divisions, defined according
               to one or more tokenization rules. And <emphasis>characters</emphasis> are defined as
               non-modifying codepoints in a word token. (A modifying character are treated as a
               piece with the non-modifying base character it modifies.)</para>
            <para>Parts of this fourfold hierarchy—works, divisions, tokens, and characters—are
               named with vocabulary that the editor of a class 2 file finds most useful. Sources
               are given a nickname (e.g., <code><link linkend="attribute-xmlid">xml:id</link> =
                  "hamlet-1741"</code>); divisions are named using the values for <code><link
                     linkend="attribute-n">@n</link></code>; tokens are referred to by position, by
               their actual values, or both (e.g., <code><link linkend="attribute-pos">pos</link> =
                  "1 - 5", <link linkend="attribute-pos">pos</link> = "last-1 - last", <link
                     linkend="attribute-val">val</link> = "hath"</code>; see <xref
                  linkend="attr_pos_and_val"/>). Characters are always identified by number (e.g.,
                     <code><link linkend="attribute-chars">chars</link> = "2, 7"</code>).</para>
            <para>This approach not only makes the syntax human readable, it also mitigates any
               disruptions that corrections or alterations might incur. For example, if an
               incorrectly duplicated <code><link linkend="element-div">&lt;div></link></code> is
               deleted, disruption to the reference system is isolated and does not affect the rest
               of the document.</para>
            <section xml:id="class_2_validation">
               <title>Class 2 Validation</title>
               <para>Some Class 2 files may be time-consuming to validate fully. The length of the
                     <link linkend="element-body"><code>&lt;body></code></link> could be enormous.
                  Or the number and length of sources may be taxing. Or validation may depend upon
                  time-consuming transformations of the source documents. Most oftentimes, this
                  problem affects TAN-A-div files, so to facilitate editing within an XML editor,
                  where regular validation is essential, Schematron validation falls into one of two
                  phases:</para>
               <para>
                  <orderedlist>
                     <listitem>
                        <para><emphasis role="bold">basic</emphasis>: All regular Schematron tests
                           are suspended, and reports are devoted exclusively to assisting in
                           looking for and checking the validity of references in <link
                              linkend="element-div-ref"><code>&lt;div-ref></code></link> and <link
                              linkend="element-tok"><code>&lt;tok></code></link>.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="bold">verbose</emphasis>: complete testing of class-2
                           files, including checks on source files to determine whether they adhere
                           to the LDUR (see <xref linkend="leaf_div_uniqueness_rule"/>). In
                           addition, information is given on where there are discrepancies in the
                           numeration system across versions of the same work.</para>
                     </listitem>
                  </orderedlist>
               </para>
               <para>If you do not specify in the prolog which phase you intend to be the default,
                  you will be prompted for the phase you wish to use whenever you validate the
                  file.</para>
            </section>
            <section xml:id="class_2_metadata">
               <title>Class 2 Metadata (<code><link linkend="element-head"
                  >&lt;head></link></code>)</title>
               <para>Class 2 files share a few common features in their metadata, mostly to
                  facilitate the human-friendly reference system outlined above.</para>
               <para>All class 2 files have as their sources nothing other than class 1 files.
                  Therefore each <code><link linkend="element-source">&lt;source></link></code> must
                  take the <xref xlink:href="#digital_entity_metadata"/>. Because the rights have
                  already been declared in the source files, <code><link
                        linkend="element-rights-source-only">&lt;rights-source-only></link></code>
                  is disallowed. </para>
               <para>Editors of class 2 files must be able to name or number word-tokens in a
                  transcription, via an optional <code><link linkend="element-token-definition"
                        >&lt;token-definition></link></code>. See <xref linkend="defining_tokens"
                  />.</para>
               <para>There may be some cases where a source has a div type that is unnecessary, is
                  confusing, or should be ignored. One or more optional <code><link
                        linkend="element-suppress-div-types">&lt;suppress-div-types></link></code>s
                  may be used to specify division types that you wish to suppress in
                  references.</para>
               <para>Optional <code><link linkend="element-rename-div-ns"
                     >&lt;rename-div-ns></link></code> provide a convenient way to provisionally
                  rename <code><link linkend="attribute-n">@n</link></code> values. This is useful
                  for cases where you wish to use division labels that more familiar to users of the
                  class 2 files, or are easier to edit and read. It can also be used to harmonize
                  discordant <code><link linkend="attribute-n">@n</link></code> values, especially
                  helpful for divs that are named, not numbered, such as the books of the
                  Bible.</para>



            </section>
            <section xml:id="class_2_body">
               <title>Class 2 Data Patterns (<code><link linkend="element-body"
                     >&lt;body></link></code>)</title>
               <para>The three types of class 2 files treat different kinds of phenomena, so their
                  data structures look quite different. Nevertheless, a few elements and attributes
                  are shared by at least two class 2 formats.</para>
               <para>Many class 2 elements take <code><link linkend="attribute-src"
                     >@src</link></code> and <code><link linkend="attribute-ref">@ref</link></code>.
                        <code><link linkend="attribute-src">@src</link></code> points via ID
                  reference to one or more <code><link linkend="element-source"
                     >&lt;source></link></code>s and <code><link linkend="attribute-ref"
                     >@ref</link></code> points to one or more <code><link linkend="element-div"
                        >&lt;div></link></code>s through their <emphasis>flat ref</emphasis>
                  (perhaps substituted with their new values if <code><link
                        linkend="element-rename-div-ns">&lt;rename-div-ns></link></code> have been
                  invoked (see <xref linkend="metadata_head"/>).</para>
               <para>In the example <code><link linkend="attribute-ref">ref</link> = "1.2-4,
                     1.5"</code>, the periods are arbitrary (but the hyphen and comma, which have
                  special meanings here, are not). You may use any punctuation you wish, or even
                  space, but it is recommended you use what will be most familiar to users. You may
                  use non-Arabic numerals, regardless of the numbering system used by your sources.  </para>
               <para><code><link linkend="attribute-chars">@chars</link></code> and <code><link
                        linkend="attribute-pos">@pos</link></code> follow a useful compact syntax,
                  described below (<xref linkend="attr_pos_and_val"/>).</para>
            </section>
            <section xml:id="attr_pos_and_val">
               <title><code><link linkend="attribute-pos">@pos</link></code> and <code><link
                        linkend="attribute-val">@val</link></code></title>
               <para>To point to a token, one of three methods may be used.</para>
               <para>
                  <orderedlist>
                     <listitem>
                        <para><emphasis role="italic"><code><link linkend="attribute-pos"
                                    >@pos</link></code> alone</emphasis>. Under this method, one or
                           more digits, or the phrase <code>last</code> or <code>last-</code> plus a
                           digit, joined by hyphens or commas indicate one or more token numbers.
                           For example, <code>2, 4-6, last-2 - last</code> refers to the second,
                           fourth, fifth, sixth, antepenult, penult, and final tokens in a sequence
                           of word tokens. The numerical value to which the keyword
                              <code>last</code> resolves depends upon the context of each source and
                           ref.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="italic"><code><link linkend="attribute-val"
                                    >@val</link></code> alone</emphasis>. Under this method, a
                           single token is picked by means of a string value equivalent to the
                           token. For example, <code><link linkend="attribute-val">@val</link> =
                              "bird"</code>, points to the first occurence of the token
                              <code>bird</code>.</para>
                     </listitem>
                     <listitem>
                        <para><emphasis role="italic"><code><link linkend="attribute-pos"
                                    >@pos</link></code> and <emphasis role="italic"><code><link
                                       linkend="attribute-val">@val</link></code></emphasis>
                              together.</emphasis> Under this method, specific occurences of a token
                           are picked. For example, <code><link linkend="attribute-val"
                              >@val</link>="bird" <link linkend="attribute-pos">@pos</link>="2,
                              4"</code> picks the second and fourth occurences of the token
                              <code>bird</code>.</para>
                     </listitem>
                  </orderedlist>
               </para>
               <para>Any time <code><link linkend="attribute-pos">@pos</link></code> appear in an
                  element, and <code><link linkend="attribute-val">@val</link></code> doesn't,
                        <code><link linkend="attribute-val">@val</link></code> is assumed to allow
                  matches to any word. Vice versa, if <code><link linkend="attribute-val"
                        >@val</link></code> appears but <code><link linkend="attribute-pos"
                        >@pos</link></code> doesn't, the latter is assumed to equal <code>1</code>. </para>
               <para><code><link linkend="attribute-pos">@pos</link></code> and <code><link
                        linkend="attribute-val">@val</link></code> must be used carefully. For
                  example, the attribute combination <code>val="bird" pos="last-5"</code> will
                  produce an error if the word token <code>bird</code> does not occur at least six
                  times.</para>
            </section>
         </section>
         <section xml:id="alignment_principles">
            <title>Alignments: Principles and Assumptions</title>
            <para>TAN alignments attest to acts of translating, paraphrasing, revising, quoting,
               summarizing, and so forth. All these are treated as types of text reuse, where one or
               more texts, usually called in translation studies the <emphasis>source</emphasis> (or
                  <emphasis>sources</emphasis>), are transformed into a new text, customarily called
               the <emphasis>target</emphasis>. Text reuse has chronological directionality and is
               asymmetrical (a quoted text affects a quoting text but not vice versa). But many
               times we deal with texts where the original lines of direction are contested or
               unknown. In those cases, it is hasty or misleading to refer to either of the texts as
               a source or a target. Indeed, the two texts may in fact derive from a common source,
               or be only indirectly related, the result of multiple generations of copying and
               translating. In these guidelines, therefore, we avoid the term <emphasis
                  role="italic">target</emphasis> altogether, and when we use the word <emphasis
                  role="italic">source</emphasis>, we are referring only to one of the class 1 files
               upon which a class 2 alignment depends.</para>
            <para>Thus, the order of <code><link linkend="element-source">&lt;source></link></code>s
               in an alignment file's <code><link linkend="element-head">&lt;head></link></code>
               does not imply chronological precedence. The only implication is that of processing
               order: the first will be the foundation or base against which subsequent sources will
               be aligned. It is usually a good idea to list as the first <code><link
                     linkend="element-source">&lt;source></link></code> the version that is most
               complete or most important to a given alignment.</para>
         </section>
         <section xml:id="tan-a-div">
            <title>Division-Based Alignments (<code><link linkend="element-TAN-A-div"
                     >&lt;TAN-A-div></link></code>)</title>
            <para>TAN-A-div is the format for macroscopic, division-based alignment, and is
               dedicated to aligning any number of versions of any number of works on the basis of
                     <code><link linkend="element-div">&lt;div></link></code>s, or even smaller, ad
               hoc segments in the sources invoked. </para>
            <para>A TAN-A-div file provides two major services. </para>
            <para><emphasis role="bold">Reconciling structural differences between versions of the
                  same text</emphasis>. Some independently created transcriptions of the same work
               will, no matter the good intentions of the transcribers, fail to correspond exactly
               to related versions. Perhaps works or div types were not defined with the same IRIs,
               or perhaps one version follows a reference system at odds with the majority of other
               versions. Perhaps a version is interpolated or lacunose. TAN-A-div is used to
               reconcile such inconsistencies, to make special alignments that a computer might not
               be able to make accurately, and to refine the alignment of parallel sources, even
               down to the word level. </para>
            <para><emphasis role="bold">Make general claims about a work, or a particular version of
                  a work</emphasis>. Scholars working with texts regularly wish to make claims about
               those texts, e.g., work A passage b quotes from work X passage c; work A passage b
               deals with topic M; work A passage b word 7 has a variant reading b' in version
               A1.</para>
            <para>For the first purpose, the motivations of an aligner are opaque. A TAN-A-div file
               says, in essence, "Please align the following sources," but it does not say why the
               alignment is requested, and it does not indicate what relationship holds between the
               various sources. In fact, a TAN-A-div file could be used to align texts that have no
               apparent relationship (to what end would be unclear). </para>
            <para>For the second purpose, the aligner makes claims about the texts, and motivations
               and assumptions are made as clear as possible. </para>
            <para>Processors of a TAN-A-div file will assume greedy alignment. Alignments will be
               inferred wherever possible, when not explicitly overridden. Alignments are also
               transitive. If passage A is declared to align with B, then, barring any exceptions,
               anything that aligns with A will be assumed to align with anything that aligns with B
               (see <xref xlink:href="#multiple-values"/>).</para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a TAN division-based alignment file is <code><link
                        linkend="element-TAN-A-div">&lt;TAN-A-div></link></code>.</para>
               <para>TAN-A-div's <code><link linkend="element-head">&lt;head></link></code> has some
                  special rules. </para>
               <para>One or more <code><link linkend="element-source">&lt;source></link></code>s
                  must be declared (<xref linkend="source_and_see-also"/>). That an alignment file
                  would have only a single source may seem strange, but such a scenario could be
                  useful for self-alignment (i.e., to indicate places where a source reuses itself),
                  or to make claims about that text. </para>
               <para><code><link linkend="element-declarations">&lt;declarations></link></code>
                  takes zero or more of the declarations common to class 2 files: <code><link
                        linkend="element-token-definition">&lt;token-definition></link></code>,
                        <code><link linkend="element-suppress-div-types"
                        >&lt;suppress-div-types></link></code>, <code><link
                        linkend="element-rename-div-ns">&lt;rename-div-ns></link></code>. See <xref
                     linkend="class_2_common"/>. TAN-A-div also allows declarations unique to <xref
                     linkend="pattern-TAN-c-decl-core"/>.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>A TAN-A-div may have an empty <code><link linkend="element-body"
                        >&lt;body></link></code> because the format by default demands greedy
                  alignment. That is, it effectively states, "Take the list of sources in the
                  header. First group (align) them by work, then by <code><link
                        linkend="element-div">&lt;div></link></code>s according to flat refs." </para>
               <para>A processor will create groups of works according to the <code><link
                        linkend="element-IRI">&lt;IRI></link></code> values under <code><link
                        linkend="element-work">&lt;work></link></code> in each source. To those
                  matches will be added any sources you claim are equivalently the same work. Then
                  within each group of versions of the same work, the processor will align (group)
                        <code><link linkend="element-div">&lt;div></link></code>s based on their
                  flat ref (based on <code><link linkend="attribute-n">@n</link></code>), after
                  normalization and after taking into account exceptions declared in the TAN-A-div
                  file.</para>
               <para>If sources representing different versions of the same work already have
                        <code><link linkend="element-div">&lt;div></link></code>s whose flat refs
                  match well, then nothing needs to be declared in a TAN-A-div <code><link
                        linkend="element-body">&lt;body></link></code>. A TAN-conformant processor
                  will perform the alignment. </para>
               <para>Within the <code><link linkend="element-body">&lt;body></link></code> of a
                  TAN-A-div file, the first optional procedure, reconciliation, is an
                  up-to-four-step process. Each step is optional and sequence-specific. That is,
                  each statement assumes actions specified by previous siblings have already been
                  implemented.</para>
               <para>After reconciliation happens, the second optional procedure, claims, are
                  handled.</para>
               <section>
                  <title>Process 1, Step 1: Correlate Works</title>
                  <para>In the first step you may declare an ad hoc equivalence between sources that
                     do not already share an <code><link linkend="element-IRI"
                        >&lt;IRI></link></code> value for <code><link linkend="element-work"
                           >&lt;work></link></code>. Each equivalence is made through an <code><link
                           linkend="element-equate-works">&lt;equate-works></link></code>, which
                     groups together under <code><link linkend="attribute-src">@src</link></code>
                     the ids of sources that should be treated as containing the same work. </para>
                  <para>Transitive alignment holds: <code>&lt;<link linkend="element-equate-works"
                           >equate-works</link>
                     </code><code><link linkend="attribute-work">work</link></code><code>="a
                        b"/></code> means that any sources that share the same works as
                        <code>a</code> and <code>b</code> will also be treated as equivalent.</para>
                  <para>This declaration does not imply that the works are, in reality, one and the
                     same. It merely states that, for the purposes of this alignment, they should be
                     treated as equivalent.</para>
               </section>
               <section>
                  <title>Process 1, Step 2: Correlate Division Types</title>


                  <para>The second step does for div types what the first step did for works, with
                           <code><link linkend="element-equate-div-types"
                           >&lt;equate-div-types></link></code>. Across all sources, every
                           <code><link linkend="element-div-type">&lt;div-type></link></code> that
                     shares an <code><link linkend="element-IRI">&lt;IRI></link></code> value will
                     be treated as equivalent. But you may augment that automated alignment through
                     an <code><link linkend="element-equate-div-types"
                        >&lt;equate-div-types></link></code>, which takes one or more <code><link
                           linkend="element-div-type-ref">&lt;div-type-ref></link></code>s, each of
                     which takes a mandatory <code><link linkend="attribute-src">@src</link></code>
                     and <code><link linkend="attribute-div-type-ref">@div-type-ref</link></code>,
                     to point to one or more sources and division types. You must use the
                           <code><link linkend="attribute-xmlid">@xml:id</link></code> assigned by
                     the source to that div type.</para>
                  <para>As with <code><link linkend="element-equate-works"
                        >&lt;equate-works></link></code>, <code><link
                           linkend="element-equate-div-types">&lt;equate-div-types></link></code>
                     assume a greedy, transitive alignment. The ad hoc declaration does not imply
                     that the two types of division are in reality one and the same; it just
                     correlates them for the sake of the alignment.</para>
                  <para>This step is not likely to be used in most TAN-A-div files, because it has
                     no impact on the steps that follow, or even on alignment proper, since it does
                     not affect the reconciliation of flat refs. It is useful mainly in those cases
                     where you expect users of your file to be interested in comparing division
                     types (e.g., calculating ratios of paragraphs to chapters per version per
                     work).</para>
               </section>
               <section>
                  <title>Process 1, Step 3: Refine Segmentation</title>


                  <para>Suppose you have two transcriptions where a phrase ending one leaf
                           <code><link linkend="element-div">&lt;div></link></code> in source A
                     actually corresponds to the beginning phrase of the next leaf <code><link
                           linkend="element-div">&lt;div></link></code> in source B. Or suppose that
                     you wish to break down a leaf <code><link linkend="element-div"
                        >&lt;div></link></code> into smaller constituent parts, to facilitate more
                     exact alignment against another version that is divided more granularly. Before
                     these refined alignments can occur, you must first segment specific leaf
                           <code><link linkend="element-div">&lt;div></link></code>s through
                           <code><link linkend="element-split-leaf-div-at"
                           >&lt;split-leaf-div-at></link></code>, which contains one or more
                           <code><link linkend="element-tok">&lt;tok></link></code>s pointing to
                     individual words (see <xref linkend="attr_pos_and_val"/>) that should begin a
                     new segment in each reference in each source.</para>
                  <para><code><link linkend="attribute-ref">@ref</link></code> must refer only to
                     leaf <code><link linkend="element-div">&lt;div></link></code>s. Any leaf
                           <code><link linkend="element-div">&lt;div></link></code> may be split as
                     many times as one wishes, but never at the first token.</para>
               </section>
               <section>
                  <title>Process 1, Step 4: Realign Versions of the Same Work</title>
                  <para>After step 3, some of the divisions and segments of a work may not be
                     properly aligned. Segments newly created by <code><link
                           linkend="element-split-leaf-div-at">&lt;split-leaf-div-at></link></code>s
                     may need to be realigned. Or perhaps one of the sources uses a reference system
                     that is out of step with the others. <code><link linkend="element-realign"
                           >&lt;realign></link></code> is used to reconcile differences. It is not
                     used for aligning across works. </para>
                  <para>There are two types of realignment: anchored and unanchored, discussed in
                     detail at <code><link linkend="element-realign"
                     >&lt;realign></link></code>.</para>
               </section>
               <section xml:id="tan-a-div_align">
                  <title>Process 2: Make Claims</title>
                  <para>At this point, each work should have its versions properly aligned. You are
                     now in a position to indicate other places where one work quotes from another,
                     or make other comments on specific textual passages. In this process,
                           <code><link linkend="element-claim">&lt;claim></link></code> may be used
                     to indicate such things as:</para>
                  <para>
                     <itemizedlist>
                        <listitem>
                           <para>textual passages where one work quotes or alludes to another work
                              or itself (index of quotations and allusions);</para>
                        </listitem>
                        <listitem>
                           <para>textual passages deal with a certain topic (general index);</para>
                        </listitem>
                        <listitem>
                           <para>where notes in one source correspond to main text in another
                              (tethering separated notes from main text);</para>
                        </listitem>
                        <listitem>
                           <para>alternative readings of a textual passage (apparatus
                              criticus).</para>
                        </listitem>
                     </itemizedlist>
                  </para>
                  <para>These alignments occur through <code><link linkend="element-claim"
                           >&lt;claim></link></code>s whose <code><link linkend="element-subject"
                           >&lt;subject></link></code> or <code><link linkend="element-object"
                           >&lt;object></link></code> points to passages of text.</para>
                  <para>Any textual <code><link linkend="element-subject">&lt;subject></link></code>
                     or <code><link linkend="element-object">&lt;object></link></code> may take
                           <code><link linkend="attribute-work">@work</link></code> or <code><link
                           linkend="attribute-src">@src</link></code>. The former takes a single
                     reference to a <code><link linkend="element-source"
                        >&lt;source&gt;</link></code>, but adopts the reference as a proxy to make a
                     claim applicable to all versions of the same work. <code><link
                           linkend="attribute-src">@src</link></code> restricts the claim to
                     specific versions, not to the work as a whole.</para>
                  <para><code><link linkend="element-claim">&lt;claim></link></code> is most
                     commonly used to create an interoperable index, indicating where one work
                     quotes from another. Such claims should not be taken to apply to the whole (see
                        <xref xlink:href="#multiple-values"/>). A claim that passage b quotes
                     passage y means only that some part of b quotes from some part of y, not that
                     the whole of b quotes from the whole of y. Specificity must made on the level
                     of <code><link linkend="element-tok">&lt;tok></link></code>, a child of a
                     textual <code><link linkend="element-subject">&lt;subject></link></code> or
                           <code><link linkend="element-object">&lt;object></link></code>. </para>
                  <para>Furthermore, if that <code><link linkend="element-tok"
                        >&lt;tok></link></code> is governed by <code><link linkend="attribute-work"
                           >@work</link></code> and not <code><link linkend="attribute-src"
                           >@src</link></code>, then two statements are implied, first that the
                     claim pertains to such-and-such a particular range of tokens in a particular
                     source, and second that the claim pertains to other versions of the same work,
                     but at unspecified ranges of words. For example:</para>
                  <para>
                     <programlisting>&lt;claim verb="quotes">
   &lt;subject work="nt-grc">
      &lt;tok ref="Mk 10:6" pos="last-4 - last"/>
   &lt;/subject>
   &lt;object work="lxx">
      &lt;tok ref="Gen 1:27" pos="last-4 - last"/>
   &lt;/object>
&lt;/claim></programlisting>
                  </para>
                  <para>might correlate the following leaf divs (matches in
                     bold):<programlisting>&lt;div n="27" type="v">καὶ ἐποίησεν ὁ θεὸς τὸν ἄνθρωπον κατ' εἰκόνα 
θεοῦ ἐποίησεν αὐτόν <emphasis role="bold">ἄρσεν καὶ θῆλυ ἐποίησεν αὐτούς</emphasis>&lt;/div>
. . . . . 
&lt;div type="v" n="6">ἀπὸ δὲ ἀρχῆς κτίσεως <emphasis role="bold">ἄρσεν καὶ θῆλυ ἐποίησεν 
αὐτούς</emphasis>·&lt;/div></programlisting></para>
                  <para>Even though the claim is about the work in general, the statement provides
                     specificity to only two sources. The claim will be regarded as holding over
                     other versions of the same works, but only on the leaf div level. On the token
                     level, it is up to a processor to determine if and where the relative position
                     of the quote might be found. </para>
               </section>
            </section>
         </section>
         <section xml:id="tan-a-tok">
            <title>Token-Based Alignments (<code><link linkend="element-TAN-A-tok"
                     >&lt;TAN-A-tok></link></code>)</title>
            <para>TAN-A-tok files provide a microscopic view of how two sources relate to each
               other. The format is intended to allow you to specify exactly where, how, and why two
               transcriptions align, and to do so on the most granular level possible. TAN-A-tok
               files also allow you to express levels of confidence or alternative opinions.</para>
            <para>Creators and editors of TAN-A-tok files should be able to read the languages of
               their sources and to explain as precisely as possible the relationship between the
               two sources. You should be prepared to think about and specify types of textual
               reuse. TAN-A-tok files tend to be more demanding to create and edit than TAN-A-div
               files are because they reflect work that is more detailed, and therefore more
               time-consuming, than simple en masse alignment of sources.</para>
            <para>Because of the detailed nature of the inquiry, token alignment is restricted to
               two texts, referred to jointly as a <emphasis role="italic">bitext</emphasis>. Each
               half of the bitext must be a TAN-T(EI) file. It is assumed that those two sources
               share some special relationship, direct or indirect, and relate through one or more
               types of textual reuse: translation, paraphrase, commentary, and so forth. Some of
               these bitexts, such as literal translations, may line up quite nicely word for word.
               Others, such as paraphrases, may line up sporadically, vaguely, ambiguously, or, in
               places, not at all. So alignment of a bitext is oftentimes not easy, and requires you
               to think hard about assumptions you have made in two key areas: the relationship that
               holds from one source's scriptum to the other and the types of reuse that was
               involved in turning one version into the other (or a common ancestor into
               both).</para>
            <para><emphasis role="bold">Relationship of sources' scripta</emphasis>. What is the the
               physical relationship or history that connects the two sources' scripta? Is one a
               direct descendant (copy) of the other? If not, where is their common ancestor? Here
               you consider the material aspect of the bitext, because you are trying to answer how
               object A's text relates to object B's, because that goes a long way to explaining the
               relationship that holds between the immaterial texts.</para>
            <para><emphasis role="bold">Types of reuse</emphasis>. What categories of text reuse do
               you hold to? Such a declaration tells users of your data what paradigm you bring to
               your analysis. You may wish to keep your categories nondescript and somewhat vague,
               using loosely defined concepts such as <emphasis>translation</emphasis>,
                  <emphasis>paraphrase</emphasis>, <emphasis>quotation</emphasis>, and so forth
               without offering a specific definition. On the other hand, you may have a specific
               and detailed view of text reuse. Perhaps you have adopted field-specific categories
               such as <emphasis>obligatory explicitation</emphasis>, <emphasis>optional
                  explicitation</emphasis>, <emphasis>pragmatic explicitation</emphasis>, or
                  <emphasis>translation-inherent explicitation</emphasis>. You may also wish to
               declare secondary types of reuse, such as <emphasis role="italic">scribal
                  omission</emphasis> or <emphasis role="italic">dittography</emphasis>, to declare
               secondary types of reuse that may have intervened. You must declare at least one type
               of reuse. Or you may use those that are built into the TAN format. See <xref
                  xlink:href="#keywords-reuse-type"/>.</para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a token-based alignment file is <code><link
                        linkend="element-TAN-A-tok">&lt;TAN-A-tok></link></code>.</para>
               <para>The TAN-A-tok header builds upon the core and class 2 headers (see <xref
                     linkend="metadata_head"/> and <xref linkend="class_2_metadata"/>).</para>
               <para>TAN-A-tok files take exactly two <code><link linkend="element-source"
                        >&lt;source></link></code>s. The sequence is arbitrary. Each <code><link
                        linkend="element-source">&lt;source></link></code> must take an <code><link
                        linkend="attribute-xmlid">@xml:id</link></code>.</para>
               <para><code><link linkend="element-declarations">&lt;declarations></link></code>
                  takes, in addition to all the elements allowed in class 2 files (see <xref
                     linkend="class_2_metadata"/>), two elements unique to TAN-A-tok: <code><link
                        linkend="element-bitext-relation">&lt;bitext-relation></link></code> and
                        <code><link linkend="element-reuse-type">&lt;reuse-type></link></code>. The
                  former describes the genealogical relationship between each source's scriptum. The
                  second attends to the qualitative aspect of the bitext relationship.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-A-tok
                  file takes, in addition to the customary optional attributes (see <code><link
                        linkend="attribute-in-progress">@in-progress</link></code> and <xref
                     linkend="edit_stamp"/>), required <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code> and
                        <code><link linkend="attribute-reuse-type">@reuse-type</link></code>, which
                  take one or more id references from <code><link linkend="element-bitext-relation"
                        >&lt;bitext-relation></link></code> and <code><link
                        linkend="element-reuse-type">&lt;reuse-type></link></code>, indicating the
                  default values that govern the alignment. </para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-align">&lt;align></link></code>s,
                  each of which collects sets of <code><link linkend="element-tok"
                     >&lt;tok></link></code>s from one or both sources, known collectively as a
                     <emphasis role="italic">token cluster</emphasis>. Each token cluster in a given
                  TAN-A-tok file is valid independent of any other token cluster. Clusters may
                  overlap, to handle translations in which words fall in one-to-one, one-to-many,
                  many-to-one, and many-to-many relationships. The independence of token clusters
                  allows you to register differences of opinion about the same set of tokens. An
                        <code><link linkend="element-align">&lt;align></link></code> may take an
                        <code><link linkend="attribute-xmlid">@xml:id</link></code>, to facilitate
                  external discussions about an assertion.</para>
               <para>Nothing should be inferred from silence in a TAN-A-tok file. Unmentioned tokens
                  in either source do not represent gaps in a translation. All that can be inferred
                  is that the creators and editors of the TAN-A-tok file have said nothing about the
                  tokens. </para>
               <para>If you wish to declare that one or more words in one source were left out of a
                  translation or inserted into one—that is, words in one source have no match in the
                  other—you must do so through a <emphasis role="italic">half-null
                     alignment</emphasis>, i.e., a token cluster that has tokens from only one
                  source. A half-null alignment corresponds—to draw from the terminology of
                  translation studies—to implicitation or explicitation of entire words or
                  phrases.</para>
               <para>A fully aligned bitext may result in a TAN-A-tok file with a very long
                        <code><link linkend="element-body">&lt;body></link></code> (in contrast to
                  the typical TAN-A-div file). That does not mean, however, that everything in a
                  source <emphasis>must </emphasis>be encoded or described. In writing and editing a
                  TAN-A-tok file you do not commit you to saying everything possible about the
                  bitext. You might choose to encode only a few token clusters.</para>
               <para>If there are multiple IDs in <code><link linkend="attribute-reuse-type"
                        >@reuse-type</link></code> or <code><link
                        linkend="attribute-bitext-relation">@bitext-relation</link></code>, the
                  intersection, not the union, of those values is to be understood. For example,
                     <code>reuse-type="trans para"</code> would indicate that the token cluster
                  results from both translation and paraphrase. If you wish to claim that the token
                  cluster might be a translation or it might be a paraphrase, then you should create
                  two separate alignments, and add <code><link linkend="attribute-code"
                     >@cert</link></code>.</para>
            </section>
         </section>
         <section xml:id="tan-lm">
            <title>Lexico-Morphology</title>
            <para>TAN-LM files are used to associate words or word fragments with lexemes and
               morphological categories. They are intended primarily to facilitate research that
               depends upon alignments, but they can be valuable on their own, whether or not there
               are other versions or alignments.</para>
            <para>These files rely upon the grammatical rules defined for a given language in a
               TAN-mor file. Therefore this section should be read in close conjunction with its
               companion: <xref linkend="TAN-mor"/>).</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>TAN-LM files are assumed to be applicable to texts in languages whose
                  vocabulary lends itself to grammatical and lexicographical analysis. The two areas
                  are interrelated but independent. If you wish, your TAN-LM file may contain only
                  lexemes or only morphological analyses.</para>
               <para>As an editor of a TAN-LM file you should understand the vocabulary and grammar
                  of the languages you have picked. You should have a good sense of the rules
                  established by the lexical and grammatical authorities you have chosen to follow.
                  You should be familiar with the conventions and assumptions of the TAN-mor files
                  you have adopted.</para>
               <para>Although you must assume the point of view of a particular grammar and lexicon,
                  you need not define those authorities, nor hold to a single one. In addition, you
                  may bring to lexical analysis your own expertise and supply lexical headwords
                  unattested in printed authorities.</para>
               <para>Although TAN-LM files are simple, they can be laborious to write and edit, more
                  than other types of TAN files. They can also be hard to read if the underlying
                  TAN-mor files use cryptic codes. It is customary for an editor of a TAN-LM file to
                  use tools to help create and edit the data.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a lexico-morphological file is TAN-LM.</para>
               <para>TAN-LM files are either source-specific or language-specific. In the case of
                  the former, <code><link linkend="element-source">&lt;source></link></code> points
                  to the one and only TAN-T(EI) file that is the object of analysis. In the case of
                  the latter, <code><link linkend="element-for-lang">&lt;for-lang></link></code> is
                  used to indicate the languages that are covered.<note>
                     <para>If the language-specific option is exercised, the file must point to
                        TAN-LM-lang schema files. See <xref xlink:href="#structure"/>.</para>
                  </note></para>
               <para><code><link linkend="element-declarations">&lt;declarations></link></code>
                  takes the elements common to class 2 files (see <xref linkend="class_2_metadata"
                  />. It takes two other elements unique to TAN-LM: <code><link
                        linkend="element-lexicon">&lt;lexicon></link></code> (optional) and
                        <code><link linkend="element-morphology">&lt;morphology></link></code>
                  (mandatory). Any number of lexica and morphologies may be declared; the order is
                  inconsequential. </para>
               <para>There is, at present, no TAN format for lexica and dictionaries, although this
                  may change in the future. So even if a digital form of a dictionary is identified
                  through the <xref linkend="digital_entity_metadata"/>, no validation tests will be
                  performed. </para>
               <para>You may find a non-TAN lexical model to be a suitable supplement to any TAN
                  collections you develop. The <link
                     xlink:href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DI.html">TEI
                     supports dictionary encoding</link>, and the <link
                     xlink:href="http://www.lexicalmarkupframework.org/">Lexical Markup
                     Framework</link>, an ISO standard (ISO-24613:2008), has defined a data model
                  for lexicons and dictionaries. The former is geared toward philology and the
                  latter toward linguistics. You may also devise your own format if neither of these
                  support aspects of lexicology that you find important.</para>
               <para>Because you or other TAN-LM editors are likely to be authorities in your own
                  right, <code><link linkend="element-agent">&lt;agent&gt;</link></code> can be
                  treated as if a <code><link linkend="element-lexicon">&lt;lexicon></link></code>,
                  and be referred to by <code><link linkend="attribute-lexicon"
                     >@lexicon</link></code> in the <code><link linkend="element-body"
                        >&lt;body&gt;</link></code> .</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-LM file
                  takes, in addition to the customary optional attributes found in other TAN files
                  (see <code><link linkend="attribute-in-progress">@in-progress</link></code> and
                     <xref linkend="edit_stamp"/>), <code><link linkend="attribute-lexicon"
                        >@lexicon</link></code> and <code><link linkend="attribute-morphology"
                        >@morphology</link></code>, to specify the default lexicon and grammar for
                  the file. <code><link linkend="attribute-lexicon">@lexicon</link></code> may point
                  either to a <code><link linkend="element-lexicon">&lt;lexicon></link></code> id or
                  to an <code><link linkend="element-agent">&lt;agent></link></code> id (when
                  someone editing the TAN file is an authority).</para>
               <para><code><link linkend="element-body">&lt;body></link></code> has only one type of
                  child: one or more <code><link linkend="element-ana">&lt;ana></link></code>s
                  (short for analysis), each of which matches one or more tokens (<code><link
                        linkend="element-tok">&lt;tok&gt;</link></code>) to one or more lexemes or
                  morphological assertions (<code><link linkend="element-lm"
                     >&lt;lm&gt;</link></code>, which takes <code><link linkend="element-l"
                        >&lt;l&gt;</link></code>s and <code><link linkend="element-m"
                        >&lt;m&gt;</link></code>s). </para>
               <para>If due to tokenization a linguistic token must occupy more than one <code><link
                        linkend="element-tok">&lt;tok></link></code>, you may use <code><link
                        linkend="attribute-cont">@cont</link></code> to group <code><link
                        linkend="element-tok">&lt;tok></link></code>s together. </para>
               <para>Elements within an <code><link linkend="element-ana">&lt;ana&gt;</link></code>
                  are distributed, to allow economically sized files. That is, every combination of
                        <code><link linkend="element-l">&lt;l&gt;</link></code> and <code><link
                        linkend="element-m">&lt;m&gt;</link></code> (governed by <code><link
                        linkend="element-lm">&lt;lm&gt;</link></code>) is asserted to be true for
                  every <code><link linkend="element-tok">&lt;tok></link></code>. </para>
               <para>Many TAN-LM files will be populated by a stylesheet or other algorithm that
                  automatically calculate the possible morphological values of each token, for
                  example, "down" being marked as an adjective, an adverb, a noun, and a verb. In
                  this case, you does not wish to claim that a word really is every combination
                  generated. But you do wish to leave open the possibility for cases where such
                  ambiguity must be expressed (e.g., "down" in "Get down off a duck." being equally
                  a noun and adverb). It is advised that automatically calculated results always
                  include <code><link linkend="attribute-cert">@cert</link></code> with weighted
                  values that sum to 1 for each token.</para>
            </section>
         </section>
      </chapter>
      <chapter xml:id="class_3">
         <title>Class-3 TAN Files, Varia</title>
         <para>This chapter provides general background to the elements and attributes that are
            unique to all class 3 TAN files. For detailed discussion see <xref
               linkend="elements-attributes-and-patterns"/>.</para>
         <para>Class 3 TAN formats are those that do not fit either class 1 or 2. This class, at
            present, consists of keywords, of RDF-like claims, and of rules pertaining to
            morphology.</para>
         <section xml:id="tan-key">
            <title>Keyword Vocabulary (<code>TAN-key</code>)</title>
            <para>All too often, a project has a set of vocabulary it draws from time and again. To
               repeat the <xref xlink:href="#pattern-iri_and_name"/>can not only be tedious, it can
               be treacherous, especially when a project decides to change or augment its
               vocabulary, and does so inconsistently or incompletely.</para>
            <para>The TAN-key format is intended to allow a project to define the IRI + name
               patterns for things that it regularly names, to be applied to any element that takes
                  <link linkend="attribute-which"><code>@which</code></link>. For example, it is a
               suitable way to gather the IRI + name patterns for the people who worked on a
               project, or to define special kinds of div types. </para>
            <para>TAN-key files are a core part of the TAN schema, defining commonly used concepts
               in <code><link linkend="element-token-definition"
               >&lt;token-definition></link></code>, <link linkend="element-div-type"
                     ><code>&lt;div-type></code></link>s, and so forth. For a complete list of
               predefined TAN keywords, see <xref linkend="keywords-master-list"/></para>
            <para>For more details on how this format relates to other TAN formats, see <xref
                  linkend="inclusions-and-keys"/>.</para>
            <section>
               <title>Root Element and Head</title>
               <para>A TAN-key file has <code><link linkend="element-TAN-key"
                     >&lt;TAN-key&gt;</link></code> as the root element.</para>
               <para>The <code><link linkend="element-declarations"
                     >&lt;declarations&gt;</link></code> of a TAN-key file will be empty, or have
                        <code><link linkend="element-group-type">&lt;group-type&gt;</link></code>s. </para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-key
                  file consists simply of <code><link linkend="element-item"
                     >&lt;item&gt;</link></code>s, perhaps gathered into groups via <code><link
                        linkend="element-group">&lt;group&gt;</link></code> or <code><link
                        linkend="attribute-group">@group</link></code>. These groups have, at
                  present, no effect upon other TAN files that import them. They have been useful,
                  however, in more advanced uses of the format, particularly in the case of the
                  standard TAN-key file for <code><link linkend="element-div-type"
                        >&lt;div-type&gt;</link></code> (<link
                     xlink:href="../TAN-key/div-types.TAN-key.xml"/>), where common types of
                  divisions have been given a rudimentary typology suitable for transformations into
                  other formats.</para>
               <para>Most frequently, a TAN-key file will contain items that have the IRI + name
                  pattern. The only exception is when it contains <code><link
                        linkend="element-token-definition"
                  >&lt;token-definition&gt;</link></code>s.</para>
            </section>
         </section>
         <section xml:id="TAN-mor">
            <title>Morphological Concepts and Patterns (TAN-mor)</title>
            <para>TAN-mor files are used to describe the grammatical morphological features of a
               given language, to assign codes to those features, and to define rules governing the
               application of those codes. The format allows specificity, flexibility, and
               responsiveness. Assertions in the format may be doubted, rules may be expressed as
               contingent upon other conditions, and warnings and error messages may be sent to
               users who have used a pattern incorrectly, or not in accordance with best
               practices.</para>
            <para>The TAN-mor format is like Schematron for the grammar of human languages. You
               specify the categories and codes for a given language, then you may create tests to
               define invalid uses of those codes. Those tests are attached to reports and
               assertions allowing editors of TAN-LM files to see not only if the rules have been
               violated, but why, and exactly where.</para>
            <para>This chapter should be read in close conjunction with <xref linkend="tan-lm"
               />.</para>
            <section>
               <title>Principles and Assumptions</title>
               <para>Certain assumptions and recommendations are made regarding morphology files,
                  complementing the more general ones; see <xref linkend="design_principles"
                  />.</para>
               <para>TAN-mor files are restricted exclusively to describing the categories and rules
                  for the grammar of a natural language. Editors of these files should be well
                  versed with the grammar of the languages they are describing.</para>
               <para>The TAN-mor format has been designed with the assumption that patterns of word
                  inflection and formation can be categorized, classified, named, and described. It
                  has also been assumed that scholars may reasonably differ, perhaps radically, on
                  those descriptions. TAN-mor is meant to allow those differences to be declared. It
                  is up to other users to decide whether or not to adopt them.</para>
               <para>The TAN-mor format has also been designed to cater to two different approaches
                  to morphological codes: structured or unstructured. </para>
               <para>Structured codes begin with set of major categories used to group morphological
                  features. Structured codes tend to have a set number of code elements, and usually
                  require gaps in the code. For example, the Perseus approach to the morphological
                  categories of Greek, Latin, and other highly inflected languages dictate ten
                  categories, with the first two being the major and minor parts of speech, and the
                  subsequent categories devoted to person, number, tense, and so forth. Each word
                  that is analyzed must have a value, even if null.</para>
               <para>Unstructured codes do not attempt to categorize grammatical features, but
                  simply give each one a unique code, to be applied in any permitted sequence and
                  combination. This approach is viable for any language (including highly inflected
                  ones such as Greek or Latin), but it is most often found in tagging sets for
                  languages that have little inflection, e.g., the Brown and Penn sets for
                  English.</para>
            </section>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a morphological rule file is <code><link
                        linkend="element-TAN-mor">&lt;TAN-mor></link></code>.</para>
               <para>Zero or more <code><link linkend="element-source">&lt;source></link></code>
                  elements describe the grammars or related works that account for the rules
                  declared in the TAN file. If the rules are not based upon any published work, then
                        <code><link linkend="element-source">&lt;source></link></code> may be
                  omitted. Any TAN-mor file without a source will assume to be based upon the
                  personal knowledge of the <code><link linkend="element-agent"
                     >&lt;agent></link></code>s who edited the file.</para>
               <para><code><link linkend="element-declarations">&lt;declarations></link></code> is
                  empty. </para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> of a TAN-mor
                  file takes the customary optional attributes found in other TAN files (see
                        <code><link linkend="attribute-in-progress">@in-progress</link></code> and
                     <xref linkend="edit_stamp"/>). </para>
               <para>The children of <code><link linkend="element-body">&lt;body></link></code>
                  begin with one or more <code><link linkend="element-for-lang"
                     >&lt;for-lang></link></code>s, followed by any number of <code><link
                        linkend="element-assert">&lt;assert></link></code>s, <code><link
                        linkend="element-report">&lt;report></link></code>s, <code><link
                        linkend="element-feature">&lt;feature></link></code>s (for unstructured
                  codes), or <code><link linkend="element-category">&lt;category></link></code>s (if
                  relying upon structured codes). </para>
               <para><code><link linkend="element-category">&lt;category></link></code>, used for
                  structured codes, sorts <link linkend="element-feature"
                     ><code>&lt;feature></code></link>s into groups. <code><link
                        linkend="attribute-code">@code</link></code> values must be unique within a
                        <code><link linkend="element-category">&lt;category></link></code>, but may
                  duplicate the <code><link linkend="attribute-code">@code</link></code> values of
                     <link linkend="element-feature"><code>&lt;feature></code></link>s from other
                        <code><link linkend="element-category">&lt;category></link></code>s. The
                  first <link linkend="element-feature"><code>&lt;feature></code></link> in a
                        <code><link linkend="element-category">&lt;category></link></code> describes
                  the category itself, and is not a <link linkend="element-feature"
                        ><code>&lt;feature></code></link> like the others.</para>
               <para>The values and combinations of <link linkend="element-feature"
                        ><code>&lt;feature></code></link>s (or rather of the <code><link
                        linkend="attribute-code">@code</link></code>s of <link
                     linkend="element-feature"><code>&lt;feature></code></link>s) can be constrained
                  through <code><link linkend="element-assert">&lt;assert></link></code>s and
                        <code><link linkend="element-report">&lt;report></link></code>s, which are
                  used to declare rules that must be followed, or must never be followed, by any
                  dependent TAN-LM file. </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> and
                        <code><link linkend="element-report">&lt;report></link></code> may be
                  restricted to specific features through <code><link linkend="attribute-context"
                        >@context</link></code>. If <code><link linkend="attribute-context"
                        >@context</link></code> is present, then <code><link
                        linkend="element-assert">&lt;assert></link></code> and <code><link
                        linkend="element-report">&lt;report></link></code> declarations will be
                  checked in a TAN-LM file only against values of <code><link linkend="element-m"
                        >&lt;m></link></code> that invoke the feature; otherwise, all <code><link
                        linkend="element-m">&lt;m></link></code>s will be tested. Four kinds of
                  tests are allowed:</para>
               <para>
                  <itemizedlist>
                     <listitem>
                        <para><code><link linkend="attribute-matches-m">@matches-m</link></code>:
                           indicates a regular expression pattern to be checked against the code in
                           an <code><link linkend="element-m">&lt;m></link></code>.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-matches-tok"
                           >@matches-tok</link></code>: indicates a regular expression pattern to be
                           checked against the tokens picked by the values of <code><link
                                 linkend="element-tok">&lt;tok></link></code> in a dependent TAN-LM
                           file.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-feature-test"
                              >@feature-test</link></code>: indicates features to be checked in the
                           content of <code><link linkend="element-m">&lt;m></link></code>s.</para>
                     </listitem>
                     <listitem>
                        <para><code><link linkend="attribute-feature-qty-test"
                                 >@feature-qty-test</link></code>: indicates the number of features
                           to be checked in the content of <code><link linkend="element-m"
                                 >&lt;m></link></code>s.</para>
                     </listitem>
                  </itemizedlist>
               </para>
               <para>An <code><link linkend="element-assert">&lt;assert></link></code> indicates
                  that for any <code><link linkend="element-m">&lt;m></link></code> in any dependent
                  TAN-LM file, if the test proves false, and if the <code><link linkend="element-m"
                        >&lt;m></link></code> has a feature declared in <code><link
                        linkend="attribute-context">@context</link></code>, then the <code><link
                        linkend="element-m">&lt;m></link></code> should be marked as erroneous (or
                  merely a warning should be returned, if <code><link linkend="attribute-code"
                        >@cert</link></code> is present) and the message included by the <code><link
                        linkend="element-assert">&lt;assert></link></code> should be
                  returned.</para>
               <para><code><link linkend="element-report">&lt;report></link></code> has the same
                  effect, but the role of the test is the opposite: the error and message will be
                  returned only if the test proves true.</para>
            </section>
         </section>
         <section xml:id="tan-c">
            <title>Claims and assertions (<code>TAN-c</code>)</title>
            <para>Many projects using the TAN format will need to include in their workflow general
               declarations that do not fit one of the TAN formats. In many cases, there are
               adequate formats that are available. At other times, you may want to encode your
               information in a format much like your other TAN files. For those cases, an
               experimental format, TAN-c, is provided.</para>
            <para>The model is inspired by the Resource Description Framework (RDF; see <xref
                  linkend="rdf_and_lod"/>). RDF depends upon a simple data model, where each datum
               consists of three items termed a subject, a predicate, and an object. The first and
               third are thought of as nodes, and the second as a connector between the nodes.<note>
                  <para>A connector, our preferred term, is frequently elsewhere called an edge, but
                     that metaphor is confusing and misleading. A cylinder, for example, has two
                     edges, but they don't connect anything we might think of as nodes. Furthermore,
                     "edge" implies that what's really of interest is the surface of a
                     three-dimensional object and the void beyond.</para>
               </note></para>
            <para>TAN was designed to serve scholars, who normally find simple declarative
               sentences—the strength of RDF—highly restrictive, absent any context or qualifiers.
               Claims always have a claimant. They are made at certain times, and are subject to
               doubt and nuance. Sometimes our claims are bare negation, e.g., "Aristotle was not
               the author of <emphasis>De mundo</emphasis>"—an assertion not possible to express in
               RDF.</para>
            <para>TAN-c is conceived as a slightly more complex version of RDF. Every claim must be
               assigned to a claimant. The RDF terminology subject + predicate + object is adjusted
               by TAN RDF to subject + verb + object. Furthermore, claims may be tempered by
               certainty, and verbs may be modified by modals. The entire claim may be restricted to
               a particular time or place. If the object is data, the data type can be restricted by
               type and lexical form. Despite being somewhat more complex than RDF, TAN-c syntax is
               more human readable. </para>
            <section>
               <title>Root Element and Header</title>
               <para>The root element of a TAN-c file is <code><link linkend="element-TAN-c"
                        >&lt;TAN-c&gt;</link></code>.</para>
               <para>The <code><link linkend="element-declarations"
                     >&lt;declarations&gt;</link></code> takes <code><link linkend="element-modal"
                        >&lt;modal&gt;</link></code>, <code><link linkend="element-person"
                        >&lt;person&gt;</link></code>, <code><link linkend="element-place"
                        >&lt;place&gt;</link></code>, <code><link linkend="element-unit"
                        >&lt;unit&gt;</link></code>, <code><link linkend="element-verb"
                        >&lt;verb&gt;</link></code>, and <code><link linkend="element-version"
                        >&lt;version&gt;</link></code>, all of which are described more thoroughly
                  at <xref xlink:href="#elements-attributes-and-patterns"/>. Collectively, they
                  provide the vocabulary that can used in the <code><link linkend="element-body"
                        >&lt;body></link></code> of the file.</para>
            </section>
            <section>
               <title>Data (<code><link linkend="element-body">&lt;body></link></code>)</title>
               <para>The <code><link linkend="element-body">&lt;body></link></code> takes a required
                        <code><link linkend="attribute-claimant">@claimant</link></code> and
                        <code><link linkend="attribute-subject">@subject</link></code>, which define
                  the default values for the rest of the data.</para>
               <para>The rest of <code><link linkend="element-body">&lt;body></link></code> consists
                  of a series of <code><link linkend="element-claim">&lt;claim&gt;</link></code>s. </para>
               <para><code><link linkend="element-claim">&lt;claim&gt;</link></code>s are allowed to
                  nest. That is, it is possible to claim that X claims that Y claims that Z claims
                  that.… by nesting <code><link linkend="element-claim">&lt;claim&gt;</link></code>s
                  within each other.</para>
            </section>
         </section>
      </chapter>
      <xi:include href="inclusions/elements-attributes-and-patterns.xml"/>
      <xi:include href="inclusions/keywords.xml"/>
   </part>
   <part xml:id="working_with_tan">
      <title>Working with the Text Alignment Network</title>
      <chapter>
         <title>Best Practices in Working with TAN Files</title>
         <para>In this chapter we discuss ways to manage, create, edit, and share TAN files. The
            material discussed here is non-normative. That is, these are suggestions based upon the
            experience, particularly the mistakes, of TAN users. The material is written for
            intermediate or advanced users of XML technology.</para>
         <section>
            <title>Local Setup</title>
            <para>TAN files may be set up in any kind of structure one wishes, but because those
               files are meant to be shared, it is beneficial to use similar conventions, to
               minimize the possibility of breaking relative URLs in shared TAN files.</para>
            <para>Below is one way to organize the subdirectories of a typical local TAN
               project:</para>
            <para>
               <itemizedlist>
                  <listitem>
                     <para><code>library</code></para>
                     <para>
                        <itemizedlist>
                           <listitem>
                              <para>[collection 1]—TAN-T(EI) files here</para>
                              <para>
                                 <itemizedlist>
                                    <listitem>
                                       <para><code>TAN-A-div</code> —TAN-A-div files here</para>
                                    </listitem>
                                    <listitem>
                                       <para><code>TAN-A-tok</code>—TAN-A-tok files here</para>
                                    </listitem>
                                    <listitem>
                                       <para>[etc.]</para>
                                    </listitem>
                                 </itemizedlist>
                              </para>
                           </listitem>
                           <listitem>
                              <para>[collection 2]</para>
                           </listitem>
                           <listitem>
                              <para>[etc.]</para>
                           </listitem>
                        </itemizedlist>
                     </para>
                  </listitem>
                  <listitem>
                     <para><code>output</code>—saved results from transformations, tests</para>
                  </listitem>
                  <listitem>
                     <para><code>pre-TAN</code>—third-party files to be used to populate TAN
                        files</para>
                  </listitem>
                  <listitem>
                     <para><code>TAN-1-dev</code> —the core TAN files, downloaded from the website
                        or the Git repository</para>
                  </listitem>
                  <listitem>
                     <para><code>stylesheets</code>—stylesheets you have created</para>
                  </listitem>
                  <listitem>
                     <para><code>tools</code>—third-party tools</para>
                  </listitem>
               </itemizedlist>
            </para>
            <para>Under this model, any time you decide to develop a collection of TAN files, you
               create a subdirectory within the library. It is a good idea to try to keep these
               collections to a manageable size, although it cannot be predicted what the limits
               might be. If you use Git, each of these collections could be its own Git repository.
               This is also where you would put other people's TAN collections. Collections
               inevitably need to "talk" to each other, so it is a good idea to name collection
               subdirectories as predictably and briefly as possible, preferably a single word in
               lowercase. For example, scriptural collections could be named simply
                  <code>bible</code> or <code>quran</code>, although you may find a need to add a
               suffix if you are working with overlapping TAN collections.</para>
            <para>When you name class 1 files (the filename, not the IRI name; see <xref
                  xlink:href="#iri_name"/>), it is a good idea to start with an acronym for the
               work, followed by the language code, the editor's last name, and perhaps the date
               when the underlying scriptum was created or published. Class 2 files are tougher.
               Because they bring two or more files or concepts together, filenames could become
               very long or unpredictably structured. At this time, the best recommendation is to
               make sure that each class 2 file is put into a subdirectory, separate from class 1
               files, given a brief but meaningful name that points to the research question that
               motivated its creation. Class 3 are a bit easier. It is recommended that TAN-mor
               files begin with the language code then an acronym for the person or group
               responsible for creating the features. TAN-key and TAN-c files are written generally
               to serve a specific collection, so the collection name and the TAN type should
               suffice.</para>
            <para>If you are have a local copy of someone else's TAN collection, and you wish to
               create TAN files that depend on them, you are in all likelihood going to depend upon
               relative URLs to those files. It is recommended that you also include absolute URL
               through secondary <code><link linkend="element-location"
               >&lt;location></link></code>s. The validation routine checks only the first document
               available. From time to time, you might comment out the first <code><link
                     linkend="element-location">&lt;location></link></code> and run the validation
               process again. If you share your dependent TAN file with someone else who does not
               have a local copy of the collection, the second <code><link
                     linkend="element-location">&lt;location></link></code>, with the absolute URL,
               will furnish a copy of the document.</para>
         </section>
         <section>
            <title>Creating and maintaining TAN collections</title>
            <para>As noted in the previous section, it is ideal to group your TAN files through
               subdirectories in a master library. Those collections should contain files that
               cohere in some way, but this could be for any number of reasons. TAN is designed to
               encourage cross-linguistic and intertextual research, so what might hold various TAN
               files together is unpredictable.</para>
            <para>In a given project, you are likely to repeat basic information, particularly
                     <code><link linkend="element-agent">&lt;agent></link></code>, <code><link
                     linkend="element-role">&lt;role></link></code>, and <code><link
                     linkend="element-work">&lt;work&gt;</link></code>. such as elements with the
                  <xref linkend="pattern-iri_and_name"/>, consider moving those to a TAN-key file.
               It is almost always preferable to develop TAN-keys before resorting to <code><link
                     linkend="element-inclusion">&lt;inclusion></link></code>s. Sorting out lines of
               inclusion can be confusing.</para>
         </section>
         <section>
            <title>Creating and editing TAN files</title>
            <para>Converting to TAN from an irregular format can be a chore. Suppose you have a a
               Word file, a web page, or plain text that you intend to serve as the basis for a TAN
               file. A common first impulse is to copy the desired content, paste it into the body
               of our TAN file, and then begin to manually correct and change things. Although this
               is the most common approach, it means that if there are changes made to your source,
               you may have an enormous task ahead of you to figure out exactly what was changed
               where. Further, some transformations involve complex processes, and you may find, in
               the course of correcting the intermediary, that you made a major mistake that cannot,
               at that point be undone. Perhaps you have accidentally deleted all punctuation when
               you didn't mean to. Or you eliminated line breaks that were useful signals about
               where <code><link linkend="element-div">&lt;div></link></code>s should be separated.
               Even if all goes well, after all that hard work you might be find out that the
               pre-TAN data source has been updated, with errors corrected. If any significant time
               has elapsed since the last transformation, you may have forgotten what procedure you
               followed to convert the data. And if you remember, you have to repeat the steps
               again, and plan for the next time when the pre-TAN source is updated.</para>
            <para>For all these reason, it is recommended that data be converted to a TAN file by
               means of an XSLT stylesheet to analyze and transform the digital source into data
               that is TAN compliant. As you find mistakes such as those described above, no harm is
               done. You can adjust your algorithm and re-run the process as many times as you need,
               each time getting better and better results. This approach requires extra initial
               work. That is, you will need to get to know XSLT (or an alternative) well.
               Establishing a good transformation process can be time consuming. But the investment
               pays off in the long run. All or part of what you write for one set of files may work
               for the next.</para>
            <para>Whether or not you use stylesheets to create or populate your TAN files, it is
               almost always best to begin the process with a sample TAN file that resembles, even
               if skeletally, your desired output, then populate it with the proper content. If you
               feed the TAN template along with the pre-TAN data into a stylesheet, the stylesheet
               becomes an <code><link linkend="element-agent">&lt;agent></link></code> in its own
               right. You are encouraged to give your XSLT file a unique identifier, and to stamp
               the resultant TAN file with an <code><link linkend="element-agent"
                  >&lt;agent></link></code>, a <code><link linkend="element-role"
                  >&lt;role></link></code>, and a <code><link linkend="element-change"
                     >&lt;change></link></code> that documents the changes that were made. </para>
            <para>The XSLT approach to creating and populating TAN files, described above, has been
               used successfully to handle not only historical documents but living ones as well,
               e.g., a working, evolving scholarly translation of ancient texts. In those
               situations, where updates are made very frequently, the traditional
               cut-paste-and-edit method is not only unproductive; it is foolish.</para>
            <para>Writing transformations may seem laborious at first, because of how difficult it
               is to think how how best to handle and manipulate a TAN file. But there is a good
               chance that the labor you have in mind has already been done for you in the built-in
               TAN functions (see <xref xlink:href="#variables-keys-functions-and-templates"
               />).</para>
         </section>
         <section>
            <title>Sharing TAN files</title>
            <para>TAN files have been designed to be shared. Although individual TAN files are
               likely to be valuable on their own, even when removed from their context (e.g., via
               an email attachment), they may be critically crippled without their dependencies. As
               a result, TAN files are most likely to be distributed or published in groups, as
               collections.</para>
            <para>One way to distribute a collection is by making it available as a repository via
               Git or some other version control software (VCS). This approach has many advantages.
               The files become available to whomever wants them, and the editorial history is
               preserved, so that a change one person makes to TAN files used by another need not
               necessarily be written in stone. VCS features and tools are extremely fast and
               useful.</para>
            <para>Collections may also be distributed through shared syncing services (e.g., Drive,
               Box, or Dropbox). Or put on a server. In the latter case, it may be difficult for
               users to browse a collection. In that case, you may wish to expose the collection as
               a compressed ZIP archive. This saves on your own bandwidth, and it still exposes the
               files for XML processing. But a ZIP archive is not suitable for linking from one TAN
               file to another, nor is it appropriate as a <code><link
                     linkend="element-master-location">&lt;master-location></link></code>. Unpacking
               a compressed file requires writing to the disk, which is a security risk, and so is
               disallowed during validation. Such zipped archives are excellent ways to distribute
               collections, but they should not substitute for a primary repository.</para>
         </section>
         <section xml:id="tan-stylesheets-and-function-library">
            <title>Doing Things with TAN Files (Stylesheets and the Function Library)</title>
            <para>The TAN format is not an end in itself. Indeed, there is no point to any file
               format, unless you can do things with it. TAN was designed primarily so that users
               could do unusual and interesting things. <code>/do things</code>, a major
               subdirectory in the project file, is populated with folders named with actions you
               might want to perform on a TAN file, and they contain XSLT stylesheets that fall into
               that area of activity.</para>
            <para>Those stylesheets are the front end of a long process that begins with TAN
               validation. Whenever you validate a TAN file, the Schematron validation file (the
               companion to the RELAX-NG validation file) is invoked. But that Schematron file is
               very small, and does very little work during validation, other than to look for
               errors, information, and help in a second version of the file being validated. That
               second version of the file is created through a very large library of XSLT
               stylesheets that resolve, normalize, and expand the document, and mark its errors. </para>
            <para>That extensive library of XSLT we call here the <emphasis>function
                  library</emphasis> (we use both words, to distinguish the collection from
               individual, generic functions). The function library provides definitive
               interpretations of the TAN format, marking parts that are in error. The function
               library is also an important step to creating your own tools or stylesheets,
               anticipating, as it does, many things you might want to do with a TAN file. Certain
               considerations that have been put into the design of the function library are worth
               noting.</para>
            <para>First, the function library has a structure similar to that of the RELAX-NG
               schemas. That is, the primary access point is through one of the eight XSLT files
               named after a primary TAN formats. Access deeper into the function library structure
               is possible, but you might be missing out on some important features useful to the
               particular TAN format you are working with.</para>
            <para>Before executing any validation, an engine computes all global variables, even
               those that might, in the end, not be required. Therefore the function library defines
               only those global variables that are central to the validation process. Functions,
               templates, and keys, on the other hand, are used by a validation engine only when
               needed, so some of them provide functionality that looks beyond the validation
               process.</para>
            <para>The most complex and important global variables are the two principal
               transformations to the TAN file itself, <code><link linkend="variable-self-resolved"
                     >$self-resolved</link></code> and <code><link linkend="variable-self-prepped"
                     >$self-prepped</link></code>. </para>
            <para><code><link linkend="variable-self-resolved">$self-resolved</link></code> is the
               result of changing the TAN file through some key steps, including (1) stamping the
               original uri of the file <code>@base-uri</code><note>
                  <para>This attribute is one of a number of new attributes and elements that are
                     introduced in the validation process, and are not defined by the TAN
                     schema.</para>
               </note> in the root element, (2) converting all numeration systems to Arabic
               numerals, (3) replacing all elements that have <link linkend="attribute-include"
                     ><code>@include</code></link> with resolved forms of the element, (4) replacing
               elements with <link linkend="attribute-which"><code>@which</code></link> with their
               resolved IRI + name form, (5) stamping elements with <code>@q</code> and a number
               representing the nth place of that element relative to its original siblings
               (included elements are given the <code>@q</code> of their host element).</para>
            <para><code><link linkend="variable-self-prepped">$self-prepped</link></code> is the
               result of combing through the file and looking for errors that have been defined in
               the <link xlink:href="../functions/errors/TAN-errors.xml">master list of
                  errors</link>. The process differs from one TAN file type to the next.</para>
            <para>The next most important global variables have to do with the other TAN files the
               self refers to:</para>
            <para>
               <table frame="all">
                  <title>Global variables for referred files</title>
                  <tgroup cols="4">
                     <colspec colname="c1" colnum="1" colwidth="1.0*"/>
                     <colspec colname="c2" colnum="2" colwidth="1.0*"/>
                     <colspec colname="c3" colnum="3" colwidth="1.0*"/>
                     <colspec colname="newCol4" colnum="4" colwidth="1*"/>
                     <thead>
                        <row>
                           <entry/>
                           <entry>Raw (first document available)</entry>
                           <entry>Resolved</entry>
                           <entry>Prepped</entry>
                        </row>
                     </thead>
                     <tbody>
                        <row>
                           <entry><code><link linkend="element-inclusion"
                                 >&lt;inclusion></link></code></entry>
                           <entry><code><link linkend="variable-inclusions-1st-da"
                                    >$inclusions-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-inclusions-resolved"
                                    >$inclusions-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-key">&lt;key></link></code></entry>
                           <entry><code><link linkend="variable-keys-1st-da"
                                 >$keys-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-keys-resolved"
                                 >$keys-resolved</link></code></entry>
                           <entry><code><link linkend="variable-keys-prepped"
                                 >$keys-prepped</link></code></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-source"
                              >&lt;source></link></code></entry>
                           <entry><code><link linkend="variable-sources-1st-da"
                                    >$sources-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-sources-resolved"
                                    >$sources-resolved</link></code></entry>
                           <entry><code><link linkend="variable-sources-prepped"
                                    >$sources-prepped</link></code></entry>
                        </row>
                        <row>
                           <entry><code><link linkend="element-see-also"
                              >&lt;see-also></link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-1st-da"
                                    >$see-alsos-1st-da</link></code></entry>
                           <entry><code><link linkend="variable-see-alsos-resolved"
                                    >$see-alsos-resolved</link></code></entry>
                           <entry>—</entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </para>
            <para>The first column lists variables that hold the first documents available, without
               alteration. Variable in the second column hold the resolved form of the
                  <code>-1st-da</code> variables, following the same process described above for
                  <code>$self-resolved</code>. Once <code>$self-resolved</code> has been determined,
               neither <code><link linkend="element-inclusion">&lt;inclusion></link></code> nor
                     <code><link linkend="element-key">&lt;key></link></code> are needed for further
               validation, therefore they do not have prepped versions. Any bearing <code><link
                     linkend="element-see-also">&lt;see-also></link></code> has on validation of the
               original TAN file can be determined from the resolved form. But it frequently
               happens, mainly with class 2 files, that the sources need to go through some
               preparation before determining whether or not the original is valid, so a similar
               process of preparation is applied.</para>
            <para>These global variables have been described above very generally. To know more
               precisely how their values are calculated, please consult the function
               library.</para>
            <para>The other components of the function library—the functions, keys, and
               templates—cannot be described conveniently or succinctly here. But they are critical
               parts of building successful stylesheets that transform TAN files. The next chapter
               provides a comprehensive view of how they work.</para>
         </section>
      </chapter>
      <xi:include href="inclusions/variables-keys-functions-and-templates.xml"/>
      <xi:include href="inclusions/errors.xml"/>
   </part>
</book>
