Uncertainties in metadata
Introduction
1Uncertainties are a common challenge in scholarly work. It is fundamentally important to point out uncertainties in order to make the research process and the results transparent and comprehensible. In contrast to natural sciences (e.g. mathematics), the humanities often miss the opportunity to model uncertainties exactly. The handling of uncertainties, especially when working with historical sources, is therefore usually explained descriptively—so far, however, lacking a best practice regarding the retention of uncertainties or decision-making in digital editions.[1]
2Basically, editors face three challenges: First, uncertainties must be identified, second, they must be modelled/made comprehensible, and third, they must be commented on or interpreted by the editor. These steps must be taken into account in the creation of editions—not only in the main text but also in the metadata. The importance of metadata is significant for proof and transparency of research, for findability, distribution and interlinking as well as for their obvious potential for evaluation methods (e.g. network visualizations).
3The core idea of metadata lies in having a simple and accurate data set that summarizes the most important information about the particular information resource. Metadata has been playing an increasingly important role in the research process for some time, and also has become a subject of research in its own right.
4Good metadata facilitates interoperability. Describing/mapping uncertainties in metadata does not necessarily impede this. The goal must therefore be to identify uncertainties but not to fall back to project- or case-specific solutions or data models; otherwise, such an approach threatens the accuracy of scientific research as well as the prospect of interoperability. Precision of the metadata with regard to the disclosure of uncertainties is therefore urgently needed.
Modelling uncertainties in the TEI
5In principle, the TEI guidelines take into account two types of uncertainties: On the one hand, uncertainty regarding the use of the appropriate TEI element and, on the other hand, uncertain ways of reading text passages/words or the identification of entities like persons or places. Both areas offer opportunities to highlight uncertainty regarding entities.
6This article concentrates on uncertain readings as well as dealing with various forms of supplementary information.[2] For the encoding of insecure readings and related information, the chapters 11.3.3.2 and 21 of the TEI Guidelines are of relevance (Chapter 11.3.3.2 "Use of the <gap>, <del>, <damage>, <unclear>, and <supplied> Elements in Combination";[3] Chapter 21 "Certainty, Precision, and Responsibility"[4]). When modelling uncertainty, it is essential to specify the responsibility (@resp or <respons>), the degree of uncertainty (@cert, @precision) and the reason (@reason or via <note>) in order to make the decisions comprehensible for readers. For this purpose, a gradual encoding of uncertainty is possible using @cert with fixed values (high, medium, low, unknown). This area seems to be well covered by the existing possibilities.
7However, problems arise when uncertainties in metadata are to be modelled and these relate primarily to entities and datings. As we observe, there is a certain arbitrariness in the encodings, which may be a signal of inadequate documentation or uncertainty in dealing with the current guidelines. Certainly, the described characteristics of metadata (see introduction) also play a role.
8This impression was reinforced by the discussions and the examples presented in the workshop preceding this handbook on encoding correspondence. The aim of this contribution is to inform about existing possibilities, to give examples as well as to address some existing gaps and to develop coding suggestions.
General challenges
9The challenges of modelling uncertainties of metadata are an integral part of every modelling process in the humanities and accompany editorial projects as well as database-driven development or cataloging projects. The fundamental dilemma in this context is the requirement of metadata for information that is as granular, structured and standardized as possible, and the widespread approach of researchers to enter information that circumscribes uncertainty or to restrict themselves to one of several possible entities. Typical forms are "around 1700", "Late Middle Ages", or "located between Cologne and Dusseldorf". Thus, regardless of the following recommendations, it is necessary to understand the relevance of metadata for automated processing steps, i.e. for further processing of the data in search portals or interfaces. Only if the researchers become aware of this and are willing to engage in structured and standardized information input in the metadata, also with respect to uncertainties, the quality of metadata dealing with uncertainties may sustainably be improved.
10Another challenge is the often unsatisfactory authority data. The use of authority data is a key factor in ensuring that metadata is reusable. In particular, authority data can be used to identify entities in the metadata with added value but there is often a lack of differentiating information in the authority data. To give one example, authority datasets do not include identifiers for all territorial units. While there is a GeoNames identifier for today's state of Bavaria (2951839),[5] identifiers describing the electorate, duchy, or the scale of the early modern Upper and Lower Bavaria are missing. While authority data are used to uniquely identify an entity, they cannot be seen in the same way as content-rich biographical reference works or lexicons but the combination of an authority record with incomplete or non-scientific information poses a problem here. The only remedy is the insight that norm data serve primarily for identification and only secondarily for description.
11Thus, in order to model uncertainties in metadata, one needs to be aware, firstly, of the benefits of metadata and the fact that they are not suitable for mapping all possible interpretive variants, and, secondly, the perspective on norm data as a reliable means of identification and less as a resource for further information.
Recommendations for coding uncertainties
12In the metadata, uncertainties occur at almost any location where information from the source is recorded. If a source document is undated, multi-dated, or ambivalently dated, such as being ambiguous about the underlying calendar, editors feel compelled to identify the most plausible date and assign it to the source. Likewise, location information may be missing or may not be specified with sufficient accuracy. For instance, the resolution of an abbreviation may not be unique, the same place names may occur in different areas, or historical places or field names are no longer common. Similar problems arise with regard to the identification of senders and recipients.
13As general recommendations for the coding of uncertainties in metadata we suggest
the following general rules:
- Always identify uncertainties in the metadata; the recognition, identification and interpretation of uncertainties is a scientific achievement,
- Render the process of coding transparent by indicating who is responsible for encoding the uncertainty,
- Distinguish whether the uncertainty is source-related (intrinsic) or based on inaccurate or ambiguous identification options (extrinsic),
- Name methods (@evidence with recommended values "internal", "external", and "conjecture") to make transparent in which way the uncertainty was resolved,
- If possible, refer to external references (authority data, prosopographic databases, etc.) in order to give researchers the opportunity to familiarize themselves with the phenomenon at hand,
- If possible, add indications on how to process the information for analytical and indexing purposes.
What about degrees of uncertainty?
14Apart from these general recommendations, the question of whether the degree of uncertainty should be encoded is a matter which researchers always have to answer anew for each edition project as there are until now no common guidelines or best practice models. The most straightforward answer to this question is the statement that an indication is either certain or uncertain and therefore establishing a degree of uncertainty is a far too subjective undertaking.[6] In our opinion, it would be advisable to dispense with the specification of granularity in the metadata and to state only generally whether the information is secure or uncertain. The (implicit) default specification reasonably being @cert="true", the statement @cert="false" would only need to be set in cases of uncertainty. However, this would require a corresponding change in the content model of teidata.certainty, which introduces these Boolean values as alternatives to the existing options (in order to allow for backward compatibility). To remain in line with the current options offered by the TEI, projects might provide an indication whether @cert="medium" shall be evaluated as certain or uncertain by applications that distinguish only these two statuses (with "unknown" and "low" defaulting to uncertainty and "high" defaulting to certainty).
Examples of uncertain information and proposed solutions
15Strategies for dealing with uncertainty often include the recording of different options and their evaluation for plausibility and possibly mutual compatibility. There is a need to record possible alternatives and to assign a priority to at least one of them, pursuing the goal to maintain transparency and select a "best option". This prioritised option may then be considered by default in the processing of metadata (such as inclusion in registers, search indices, etc.). In Chapter 16.8 "Alternation", the TEI offers several basic approaches on how to capture and prioritise alternatives.[7]
16One such mechanism is the select attribute (see the following explication), which allows one or more of the detected alternatives to be specified starting from a hierarchically superior element (in the form of space-separated tei.pointers). Because @select mostly refers to identifiers (@xml:id), this procedure is well suited for the recognition of named entities such as places or persons. For time units such as dates or periods of time we are more reluctant to use @xml:id and prefer the use of @select in combination with @n or @ana. This offers the possibility to provide more detailed descriptions. The evaluation of @select would thus use not one specific but two different reference schemes, which is not very elegant and presents additional challenges for processing routines. The appropriate schema would either need to depend on a @type or be based on the syntactic structure of the pointer.
17An alternative approach to referencing specific metadata is local tagging directly at each element. This approach seems to be more consistent. At the same time it facilitates clarity and data processing. This kind of marking can be done explicitly (e.g. via @ana) but it can also be done implicitly by creating just one element of each type—carrying the information that is deemed most precise or probable—directly within <correspAction> and specifying all alternative places, persons and data only at a deeper level. Using this approach, only one place of dispatch, address, recipient and sender as well as a date of dispatch and date of receipt would be specified directly within <correspAction> and more detailed and possibly alternative information might be located in <note> elements, (partially) structured or in prose. Given that letters may be written and sent by more than one person from and to more than one place and at more than one date, it is hardly possible to narrow the schema to allow only one element of each type directly within <correspAction>.[8] Thus we can only suggest a best practice but not enforce it formally.
Dates
18The following examples of uncertain dating show how this kind of hierarchical priorisation might be encoded. For each example, we present the nested way that demotes alternatives to a lower level as well as a flat encoding where alternatives are distinguished using @select.
Dating by the editor
Phenomenon: Differing information in secondary literature
19Implementation:
20Explication: A different date from secondary literature is rated as implausible in a comment. Because there is no direct connection with the source, the alternative date is not recorded in a structured form.
21Alternative implementation (several <date> elements in <correspAction>):
22Here and in the following examples, @ana="date_alt1" might be used instead of @n="date_alt1".
Phenomena: Insecure reading and correction by the editor; multiple dating (two calendars)
23Implementation:
24Explication: The letter date is Roman but was incorrectly converted in a contemporary note. The editor has determined the appropriate date and entered it as regular metadata. He assesses a possible but incorrect reading of the Roman date in an additional remark.
25Alternative implementation (several <date> elements in <correspAction>):
Phenomenon: Date determination from the letter content, intrinsic and extrinsic; multiple dating (Julian and Gregorian calendar)
26Implementation:
27The date of the incompletely dated letter was determined by the editor from the letter content and in accordance with an old signature, whereby the derivation of content is described relatively extensively for better traceability. The absence of @cert indicates that the editor is safe in dating.
28Alternative implementation (several <date> elements in <correspAction>):
Factual corrections by the editor
29Implementation:
30Explication: The example describes the content correction of a factually wrong date by the editor. The evidence emerging from the source and the historical context is described in a note.
31Alternative implementation (several <date> elements in <correspAction>):
Alternative calendars
Phenomenon: Unclear conversion of a Roman letter date
32Implementation:
33Explication: The editor identifies two ways to interpret the date. He lends more credibility to the date calculated after Grotefend and enters the corresponding date as the main metadata but he also describes the alternative possibility.
34Alternative implementation (several <date> elements in <correspAction>):
35For all examples shown here, both variants allow processing routines to select the preferred date and, depending on the application, e.g. to be included in a register or search index or to be provided via an interface. The nested approach seems better suited towards interoperability as the prioritised date is reliably defined with a simple XPath expression (//correspAction[@type="sent|received"]/date).
36It should be noted that this article covers the selection of a date element but not yet the intended processing of the date attributes from the datable- and duration classes such as @notBefore, @notAfter, @from, @to, or @dur (with the respective iso-, w3c- und custom-variants), as they are often used to encode time spans and dates that cannot be defined precisely.[9] The interpretation and processing of such attribute values by processing routines lies beyond the scope of this text.[10]
Locations
Phenomenon: The location is missing in the original and is assigned by the editor
37Implementation:
38Explication: While no place of dispatch is given in the original, the editor specifies Hanover. By specifying "conjecture" in @evidence, the editorial action is made transparent. The degree of certainty provided by the use of @cert is given here according to the default value and therefore strictly speaking not necessary.
39Alternative implementation:
40This example shows two differing reference schemes used in @select to define which date and which location should be considered for further processing (technical/analytical). This could be remediated relatively easily by adding @ana|@n="place_pref" to the preferred <placeName> element (or perhaps better yet by the introduction of an @selection to the TEI Guidelines as a counterpart for @select) but all of this falls back beyond the less ambiguous hierarchical approach.
Phenomenon: The location is unclear and is assigned by the editor
41Implementation:
42Explication: While no place of dispatch is given in the original, the editor specifies Vienna. The fact that this conjecture is based on another letter becomes apparent in @source and its specifics are explained in a note element. Notwithstanding the omission of @cert, this information is certain (default value "true").
43Alternative implementation:
Persons
Phenomenon: Sender and/or recipient are uncertain
44The two strategies discussed and shown above are equally applicable to uncertain senders and recipients. Again, a <correspAction> might contain the most likely senders and recipients in <persName> elements for the sent and received actions and record more precise information in a note.
45Alternatively, all possible options could be encoded in <persName> elements directly within <correspAction> and prioritised using @select on <correspAction> (using pointers such as #sender_pref or #recipient_pref).
Notes
- [1]Regarding graph-based modeling of uncertainties: Kuczera, Andreas, Wübbena, Thorsten, and Thomas Kollatz (eds.). 2019. Die Modellierung des Zweifels — Schlüsselideen und -konzepte zur graphbasierten Modellierung von Unsicherheiten. Wolfenbüttel. (Zeitschrift für digitale Geisteswissenschaften / Sonderbände, 4). DOI: 10.17175/sb004.
- [2]The use of the correct TEI element will need to be discussed elsewhere.
- [3]https://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PHST
- [4]https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CE.html
- [5]http://www.geonames.org/2951839
- [6]It is possible to encode the degree of uncertainty using the values "low", "medium" or "high". Some edition projects employ a more specific indication using percentage.
- [7]https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAAT
- [8]For dates, it has been suggested to narrow the CMIF schema to allow just one date. Cf. https://github.com/TEI-Correspondence-SIG/CMIF/issues/19.
- [9]Cf. e.g. section 5.1 ("Dates") of the encoding guidelines of the edition Letters and Texts. Intellectual Berlin around 1800, https://www.berliner-intellektuelle.eu/encoding-guidelines.pdf.
- [10]Some of the challenges regarding the processing of date attributes may be avoided by encouraging the use of specific dates with @precision="medium". See https://www.stoa.org/epidoc/gl/latest/supp-historigdate.html and https://lsv.uky.edu/scripts/wa.exe?A2=MARKUP;7412fa2e.1901 for recommendations of the EpiDoc community. Another interesting point of vantage is the Extended Date/Time Format (EDTF), cf. https://www.loc.gov/standards/datetime/edtf.html.
Bibliography
- Baillot, Anne, et al. 2013ff. Edition-specific TEI encoding guidelines for the digital edition "Letters and texts. Intellectual Berlin around 1800" ["Briefe und Texte aus dem intellektuellen Berlin um 1800"] https://www.berliner-intellektuelle.eu/encoding-guidelines.pdf (last accessed: 7 August 2020).Zotero
- Berbig, Roland. 2010. Theodor Fontane Chronik. Berlin/New York.Zotero
- Kuczera, Andreas, Wübbena, Thorsten, and Thomas Kollatz (eds.). 2019. Die Modellierung des Zweifels — Schlüsselideen und -konzepte zur graphbasierten Modellierung von Unsicherheiten. Wolfenbüttel. (Zeitschrift für digitale Geisteswissenschaften / Sonderbände, 4). DOI: 10.17175/sb004.Zotero
- TEI Consortium, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.6.0. Last updated on 16th July 2019. TEI Consortium. https://www.tei-c.org/Vault/P5/3.6.0/doc/tei-p5-doc/en/html (last accessed: 7 August 2020).Zotero