DCMI DCSV

Title:

DCMI DCSV: A syntax for writing a list of labelled values in a text string

Creator:
Creator:
Date Issued:
2000-07-11
Identifier:
Replaces:
Is Replaced By:
Latest version:
Status of document:
This is a DCMI Recommendation.
Description of document: We describe a method for recording lists of labelled values in a text string, called Dublin Core™ Structured Values, with the label DCSV. The notation is intended for structured information within attribute values in markup-languages such as HTML and XML. This is likely to be useful in recording complex element values in metadata systems based on the qualified Dublin Core model.
NOTICE TO IMPLEMENTORS:
The syntax examples included in this document are provisional, and are currently under review as part of the DCMI work on recommending coordinated syntax recommendations for HTML, XML, and RDF. These recommendations and minor editorial changes in this document can be expected to take place in the near future.

Table of Contents

  • Introduction
  • Structured Values - the DCSV scheme
  • Parsing DCSV
  • Examples
  • Sample Code for parsing DCSV coded values
  • Acknowledgments
  • References

Introduction

It is highly desirable to be able to encode or serialise structured values within a plain-text string. Some generic methods are in common use. Inheriting conventions from natural languages, commas (,) and semi-colons (;) are frequently used as list separators. Similarly, comma-separated-values (CSV) and tab-separated-values (TSV) are common export formats from spreadsheet and database software, with line-feeds separating rows or tuples. Dots (.) and dashes (-) are sometimes used to imply hierarchies, particularly in thesaurus applications. The eXtensible Markup Language [XML] provides a general solution, using tags contained within angle brackets (<, >) to indicate the structure.

A number of named encoding schemes use punctuation characters within a text string to indicate specific components. For example, a colon (:) terminates the protocol label, and slashes (/), question-marks (?), ampersands (&) and hashes (#) are used to separate other fields in identifiers coded as URI's [URI]. Colons (:) separate specified labels from values within a field, and semi-colons (;) separate fields within a personal description according to a common implementation of vCard [vCard]. Hyphens are used to separate fields in a date according with the W3C profile of ISO8601 [W3C-DTF]. For some schemes - vCard and W3C-DTF, for example - the punctuation indicates a very formal structure to the value, and is expected to be parsed automatically.

Element attributes in markup languages, such as HTML [HTML4] and XML [XML], provide a position for recording data. For some "empty" elements - such as the <IMG > and <META > elements in HTML - attributes are the only place to hold data. In other cases there may be good reasons to store data in element attributes rather than element content. For example, fragments of XML can be included in the <HEAD> of a HTML document, and will be safely ignored by most client software (eg browsers) provided the elements have no content. This syntax trick can be used to embed XML-RDF encoded data safely in current versions of HTML [RDF-in-HTML].

Future versions of HTML are expected to overcome these limitations by allowing general XML documents to be included [XHTML]. Nevertheless, there is strong interest in using HTML <META > elements to record data with more structure than normally implied by a plain-text string, in particular to record metadata according to the qualified Dublin Core™ model [Q-DC-HTML].

However, the use of element attributes for storing data has some technical limitations:

  1. attributes may occur no more than once
  2. values are constrained to a set of types which restricts the permissible character-strings [HTML4] in some contexts. Use of XML's angle-bracket delimiters (<, >) and various other punctuation characters is only valid in certain cases (i.e. when the content type is CDATA), and is only generally reliable using escape-mechanisms (i.e. as character entities). In general, strings containing these characters are prone to misinterpretation by some user-agents (e.g. browsers).

Note that there is no intrinsic way to indicate structure within the values of attributes of HTML elements.

Our intention in this recommendation is to define a compact human-readable data-structuring method for HTML attribute values of content type CDATA, avoiding certain punctuation characters which are prone to cause difficulties in some encoding environments. The notation should normally be used only when no other suitable scheme is available. It is based on methods used and found successful elsewhere, but is more generalised than the preceding standards. It may be used as the basis of profiles designed to encode particular data types [Profiles].

Structured Values - the DCSV scheme

To allow the recording of generic Structured Values , we introduce the Dublin Core™ Structured Values ( DCSV ) scheme.

We distinguish between two types of substring - labels and values, where a label is the name of the type of a value, and a value is the data itself. Furthermore, we allow a complete value to be disaggregated into set of components, each of which has its own label and value. A value that is comprised of components in this way is called a structured value.

Punctuation characters are used in recording a structured value as follows:

  • colons (:) separate plain-text labels of structured value-components from the values themselves
  • semi-colons (;) separate (optionally labelled) value-components within a list
  • dots (.) indicate hierachical structure in labels, if required.

The labels and the component values themselves each consist of a text-string. The intention is that the label will be a word or code corresponding to the name of the value-component. Labels may be absent, in which case the entire sub-string delimited by semi-colons (;) or the end of the string comprise a component value.

The following patterns show how structured values may be recorded in strings using DCSV:

  
       "u1; u2; u3"
"cA:v1"
"cA:v1; cB.part1:v2; cB.part2:v3"
"cA:v1; u2; u3" where u1, u2 and u3 are unlabelled components, cA and cB are the labels of Structured Value components, part1 and part2 are sub-components of, and v1, v2 and v3 are values of the components.

The use of specific punctuation characters in DCSV coded values means that care must be exercised if these characters are to be used directly within strings which comprise the content (either labels or values) of the components. For DCSV, therefore, when a colon (:), or a semi-colon (;) is required within the value, the characters are escaped using a backslash, appearing as : ;, and the backslash itself is escaped similarly \ . There should be no ambiguity regarding the dot, full-stop or period (.) within strings: when it is part of a label, a dot indicates some hierachy; when part of a value, it has the conventional meaning for the context. This method of escaping special characters largely preserves readability and the ability to enter DCSV coded metadata values easily using a text-editor if required. Software written to process DCSV coded values must make the necessary substitutions.

Note that in HTML the double-quote (") character can be used directly within a CDATA attribute value if the full string is delimited by single-quotes ('), but in XML the double-quote must be encoded as a character entity in element attributes.

As there is no explicit grouping mechanism, DCSV can only be used to record a list. DCSV is only intended to be used for relatively simple structured values, probably as an interim approach, pending more general support for syntaxes such as XML which allow recording of more complex hierarchical structures. However, it is more compact than the XML equivalent, and is more easily read and constructed in some common contexts, such as within HTML elements.

Parsing DCSV

A simple method can be used to parse metadata values recorded according to the DCSV scheme. For a single value recorded using the DCSV scheme:

  1. split the text-string into a list of substrings on any unescaped semi-colons (;);
    if no semi-colon is present, there is a single substring
  2. split each substring into its (label,value) on any unescaped colons (:);
    if no colon is present, the label is empty
  3. within each value replace the escaped characters with the actual character required.

A short Perl program which performs this parsing operation is included at the end of this recommendation.

Examples

    
    "name.given:Renato;name.family:Iannella; employer:DSTC; Contact:Level 7, Gehrmann
    Labs, The University of Queensland, Qld. 4072, Australia"
"rows:200; cols:450"

DCSV does not support the full structure available using more complete notations such as XML, but nevertheless relatively rich information may be stored in DCSV and then migrated into fully structured notations when appropriate. The DCSV scheme provides useful support for the representation of complex values for metadata elements in HTML, while remaining fully compatible with all commonally used tools (browsers, editors, metadata harvesters). When used in this way "DCSV" or the name of one of its derivatives can be noted as the value of the SCHEME attribute of the HTML <META> element as shown in the following examples of qualified Dublin Core™ metadata:

      
      <META
      NAME="DC.Creator" SCHEME="DCSV"
CONTENT="name.given:Simon; name.family:Cox; employer:CSIRO; height:177 cm">
<META NAME="DC.Contributor" SCHEME="vCard" CONTENT="fn:Eric Miller; org:OCLC">
<META NAME="DC.Format.extent" SCHEME="DCSV" CONTENT="rows:200; cols:450">
<META NAME="DC.Coverage.spatial" SCHEME="BOX" CONTENT="name:Western Australia; northlimit:-13.5; southlimit:-35.5; westlimit:112.5; eastlimit:129">
<META NAME="DC.Coverage.spatial" SCHEME="POINT" CONTENT="name:Bridgnorth, Shropshire, U.K.; east:372000; north:293000; units:m; projection:U.K. National Grid">
<META NAME="DC.Date" SCHEME="PERIOD" CONTENT="name:Perth International Arts Festival, 2000; start:2000-01-26; end:2000-02-20;">

Sample Code for parsing DCSV coded values

The following Perl program reads a DCSV coded string entered on stdin, and prints a formatted version of the structured result. This code is provided for demonstration purposes only and contains no error-checking.

#!/usr/local/bin/perl

print "Enter string to be parsed:\n";

my $string = join('',);

print "\nString to be parsed is [$string]\n";

First escape % characters

$string =~ s/%/"%".unpack('C',"%")."%"/eg;

Next change \ escaped characters to %d% where d is the character's ascii code

$string =~ s/\(.)/"%".unpack('C',$1)."%"/eg;

print "\nEscaped string is [$string]\n";

Now split the string into components

my @components = split(/;/, $string);

print "\nComponents:\n";
foreach $component (@components) {
my ($name, $value) = split(/:/, $component, 2);

if there is no : $value is empty so copy $name into $value and empty $name

if (!$value) {
$value = $name;
$name = '';
}

strip whitespace from name string

$name =~ s/^\s*(\S+)\s*$/$1/;

convert % escaped characters back in value string

$value =~ s/%(\d+)%/pack('C',$1)/eg;

print "Name [$name] has value [$value]\n";

}

Acknowledgments

John Kunze encouraged us to write up this proposal formally. Kim Covil wrote the perl code. Eric Miller nagged regarding the overlap with XML.

References

[DCMI]
Dublin Core™ Metadata Initiative, OCLC, Dublin Ohio.
http://dublincore.org/

[HTML4]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, 1999, HTML 4.01 Specification
http://www.w3.org/TR/html4/

[Profiles]
DCMI Box - specification of the spatial limits of a place, and methods for encoding this in a text string
http://dublincore.org/specifications/dublin-core/dcmi-box/2000-07-11/

DCMI Point - a point location in space, and methods for encoding this in a text string
http://dublincore.org/specifications/dublin-core/dcmi-point/2000-07-11/
DCMI Period - specification of the limits of a time interval, and methods for encoding this in a text string
http://dublincore.org/specifications/dublin-core/dcmi-period/2000-07-11/
[Q-DC-HTML]
S. Cox, 2000, Recording qualified Dublin Core™ metadata in HTML
http://dublincore.org/specifications/dublin-core/dcq-html/

[RDF-in-HTML]
This uses the most compact form of XML-RDF [RDF-syntax], in which all the data occurs as attribute values. In this form several important capabilities are not available, such as multiple (repeated) values. For an example, see Figure 5 in S.J.D. Cox and K.D. Covil, "A web-based geological information system using metadata", Proc. 3rd IEEE META-DATA Conference, http://computer.org/conferen/proceed/meta/1999/papers/7/cox_covil.html
[URI]
T. Berners-Lee, R. Fielding, L Masinter, 1998 Uniform Resource Identifiers (URI): Generic Syntax RFC2396
http://www.ieft.org/rfc/rfc2396.txt

T. Berners-Lee, L. Masinter, and M. McCahill, 1994 Uniform Resource Locators, RFC1738
http://www.ieft.org/rfc/ rfc1738.txt
T. Berners-Lee, 1994 Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web, RFC1630
http://www.ieft.org/rfc/ rfc1630.txt
[vCard]
F. Dawson, T. Howes, vCard MIME Directory Profile RFC2426
http://www.ieft.org/rfc/ rfc1630.txt
[W3C-DTF]
M. Wolf, C. Wicksteed, 1997, Date and Time Formats
http://www.w3.org/TR/NOTE-datetime

[XHTML]
Steven Pemberton and many others, 1999 XHTML 1.0: The Extensible HyperText Markup Language
http://www.w3.org/TR/xhtml1/

See also Dave Raggett, HyperText Markup Language Activity Statement
http://www.w3.org/MarkUp/Activity.html

[XML]
Extensible Markup Language
http://www.w3.org/XML/