Skip to content

CLARIN Standards Information System

Piotr Banski edited this page Jan 23, 2017 · 21 revisions

Clarin Standard Guidance Documentation

Eliza Margaretha and Antonina Werthmann

May 2014

Latest update: September 2016

Table of Contents

Introduction

Clarin Standard Guidance1 is a website providing general information about standards used particularly in the areas of linguistics and computer linguistics. Since a lot of standards have been developed for various purposes by many different parties, such a portal is useful to give users guidance in choosing a standard suitable for their needs and to compare standards of similar purposes or topics.

The website has been developed in the Institut für Deutsche Sprache (IDS) within the CLARIN project since 2011. For more information about CLARIN and this website, please contact Andreas Witt (

<em>witt@ids-mannheim.de</em> (witt at ids-mannheim dot de)

). For more information regarding the content of the website, please contact Antonina Werthmann (

<em>werthmann@ids-mannheim.de</em> (werthmann at ids-mannheim dot de)

).

This document summarizes information about the system running the website and its content. It describes the installation steps, the system architecture and features, the definition of the schema used and the data contained in the system. In this documentation, the term “standard” and “specification” are used interchangeably.

Installation

The website is developed based-on XML-technologies. The data are represented in XML and stored in an XML database engine, namely eXist-db which can be downloaded from the eXist-db home page2. The current website uses eXist-db version 2.0. To install the system, you need a dump file which can be requested from Antonina Werthmann (

<em>werthmann@ids-mannheim.de</em> (werthmann at ids-mannheim dot de)

).

Please follow the instructions below.

  1. Install eXist-db 2.0 according to the instructions in the eXist documentation.3
  2. Configure the Jetty port in the
{eXist-db-installation-folder}/tools/jetty/etc/jetty.xml

in the Set connectors section (currently the website uses the port 8889 and 8444 for SSL).

  1. Run the eXist database
./{eXist-db-installation-folder}/bin/startup.sh
  1. Activate the admin client from the eXist-db Dashboard or run
./{eXist-db-installation-folder}/bin/client.sh
  1. Login into the eXist-db admin client and restore the dump file.
  2. In the admin client or Oxygen XML-Editor, go to /apps/clarin/modules/app.xql and change the $app:base and $app-securebase variable to the actual website URL.
The installation is done. You can now go to the website URL on a browser. To synchronize eXist-db and Oxygen XML Editor, refer to the documentation in the Oxygen Website4.

System

The system is mainly written in XQuery. The architecture of the system is described in Section 3.1. The system provides features for different kinds of users, which are explained in Section 3.2.

Architecture

A simple XQuery code may contain all the functions for accessing and processing the XML data, as well as the user interface. However, a separation between the functions and the user interface is useful to have more structured codes, so that the user interface and the application layer become independent to each other. Hence, the changes in the application layer would not affect the user interface.

XQuery is a functional programming language by which we can define functions belonging to an XQuery module. For the separation between the user interface and the application layer, the functions in the application layer can be defined to only manage and generate contents for the user interface, and not to deal with the interface design. On the other hand, the user interface is to contain an abstraction of what should be in the web page, which practically calls the application layer functions.

The system is designed in an MVC (Model-View-Controller) -like architecture. The MVC components have different file extensions representing their functions xqm (Model), xq (View), and xql (Controller).

Since the data is written in XML, it has been modeled as a tree structure which can be well navigated by using XPATH. Thus, the Model component is not responsible to model the data, but only to perform direct interactions with the XML data comparable to database queries, such as selecting, storing and updating nodes. It also defines the paths to the data.

The View component is basically the user interface layer. It describes how the user interface should look like and what they should contain.

The Controller component deals with the nodes selected from the XML files by the Model component, navigates through them, and selects some more detailed information from them. The Controller is also a mediator performing all the operations between the Model and the View. It manages and generates contents for the View.

Users and Features

Users of the website are categorized into three categories based on their roles: guest, registered user and web-admin. Guests can do basics operations on the website such as browsing the standard descriptions or searching for some standards by topics, standard body and soon. Registered users and web-admins have the privilege to submit a standard description, which is then generated into an XML file. The standard description submitted by a registered user is stored in the doc­/

folder (see Section 5) and to be reviewed by a web-admin. The standard description submitted by a web-admin submit, however, is directly stored in specifications/ folder. Since the generated XML file does not go through a validation process against its schema (see Section 17.1.1.1), the web-admin should check it personally. Additionally, web-admins can edit the standard descriptions.

The following functions have been implemented in the system:

  1. User registration           
  2. Login
  3. Browsing (standards, standard bodies, standard topics)
  4. Searching for standards
  5. Submitting standard (including parts and versions)
  6. Editing standard description (including the standard parts and versions)
Additional features are:
  • Tag-clouds of standards on the homepage,
  • Tag-clouds of relevant keywords on the standard pages
  • Standard relation graphs at standard pages and the standard list page.
  • Standard body relation graphs
The standards are grouped by two categories: topics and standard bodies.  Topics are various areas/fields in which the standards are used. Standard bodies are organizations that have developed or maintain the standards.

Database Clarin

The system database contains all the codes running the system, the XML data collection, and the XML schemas structuring the XML data. The database is run by eXist-db. Figure 4 shows the directory tree of the database opened in Oxygen XML editor. The root directory is /db/apps/clarin. 

Data

The /data folder contains the collection of documents describing the standards, standard bodies, topics, user information and other documents referred in the website contents, such as examples of standard applications. The standards, standard bodies, topics and user information are written in XML. The schemas for the XML data are described in Section 14.1.

Doc

The doc/ folder consists of any kinds of documents adding extra information to the standard descriptions. For instance, it contains examples of standard applications, such as annotation in MAF; or the journal articles, conference papers, reviews about the standards. All the documents in this folder must be legally available to be publicly shared. Thus, it is important to check the licence or copyright of the documents beforehand.

Review

The review/ folder contains the standard descriptions submitted by users. These documents are to be reviewed by an administrator. The administrator should verify the document content before it can be moved to the specifications/ folder.

222x434px

Figure 1: Database Structure

Specifications

The specifications/ folder contains various standard descriptions written in XML. The schema for the XML is described in detail in Section 17.1.1.1.

Standard Body XML Data

The standards body XML data provides information about each organization or group of experts that develop the standards listed in the specifications/ folder. The description of standard bodies in sbs.xml is defined according to the XML Schema Definition in Section 17.1.1.2.

Topic XML Data

Topic describes the conceptual subject or area of interest, in which a standard was/is developed, or in which areas its development and use are particularly important. Like keywords, topics help to find similar standards or standards in the same area.

The topics are listed in topics.xml according to the schema described in Section 17.1.1.3.  

User XML Data

The users.xml contains information about registered users.

Edit

The edit/ folder contains XQuery controller codes for editing standard descriptions. The editing is done via AJAX. The JavaScript on the client side handles the editing request and response. The controller receives the editing request from JavaScript, and sends the results back to the JavaScript. The JavaScript will then update the web page according to the results.

Model

The model/ folder contains the XQuery model codes facilitating direct interactions (e.g. select, insert, update) with the XML data.

Modules

The modules/ folder contains XQuery controller codes for manipulating XML data and generating contents of the web pages.

Resources

The resources/ folder contains the additional files the system needs, such as images, CSS Stylesheets, Javascript codes and XQuery libraries.

CSS

The css/ folder contains Cascading Style Sheets for designing the web-pages and the tag-clouds.

Images

The images/ folder contains all the image files used in the system.

Libraries

The lib/ folder contains the libraries used by the XQuery codes.

Scripts 

The scripts/ folder contains Javascripts for the tag-clouds, the graph visualizations and some general functions used in editing standard descriptions. The tag clouds use the tagcanvas library5, and the graph visualizations use the D36 library. Besides, Tinymce7 is used as the XML editor for writing the description element of the standards, and Dijit ComboBox8 for choosing an existing standard body or adding a new organization for the responsible statement element of a standard.

Schemas

The schemas/ folder contains all the XML schemas structuring the XML data collection. There are three schemas used in the system.

XML XSD Data

For more information about the XML XSD Data, please visit the web site of W3C guidelines.9

Catalog XSD Data

For more information about the Catalog XSD Data, please visit the web site of the OASIS guidelines.10

Spec XSD Data

The specification schema (spec.xsd) defines the general structure and the elements of the standard XML files, the list of standard topics (topic.xml), and the list of standard bodies (sbs.xml).

Schema for the Specifications

The purpose of this part of XML Schema is to define all the standard XML files in the specifications/ folder. The root node of a standard XML file is <spec>.

Spec

A standard XML file has the root node <spec> and it has four attributes: @id, @display, @topic and @standardSettingBody. The @id defines the identifier of the standard file and must starts with “Spec”. The @topic designates the topic ids of the standard topics. Multiple topic ids are separated with a space. The @standardSettingBody designates the current standard body managing the standard.

The <spec> must contains a <titleStmt>, a <scope>, at least one <info> and at least one <part> or <version>. Optionally, it can also contain the following elements: <keyword>, <features>, <address>, <relation> and <asset>. The elements in <spec> must follow a certain order, namely <titleStmt>, <scope>, <keyword>, <info>, <features>, <address>, <relation>, <part>, <version>, <asset>.

TitleStmt, Title, Abbr, RespStmt    

The element <titleStmt> stands for title statement. The node <titleStmt> contains information about the title <title>, abbreviation <abbr>, and responsible statement <respStmt>. The title and abbreviation are obligatory, whereas the responsible statement is optional. The <title> node is obligatory in the <spec> and <part>, but optional in <version>, because a standard version does not always have a title.

The node <titleStmt> also appears in the <sbs> (see Section 17.1.1.2) and <topic> (see Section 17.1.1.3).

For the tag clouds and relation graphs, abbreviations of standards and versions are necessary. Therefore the <abbr> is obligatory in <spec> and <version> (but not in <part>). However, the abbreviation of a standard and a version is not always available. In this case, an abbreviation must be created for the use in the website and must be marked with the attribute value of @internal set to “yes”. The creation of a version abbreviation should be in the format [part-abbr]-[version-year].

A <titleStmt> can have more than one <respStmt>, therefore each <respStmt> must have an @id attribute. The @id is needed, for instance to select which <respStmt> to update or remove. A <respStmt> contains a <resp> and at least one <name>. The <resp> designates the types of responsibility whose value is restricted to editor, author, publisher, convenor, chair and secretary. The <name> designates the name of the responsible entity(s) and has the attribute type which can be an organization with the element <org> or a person with the element <person>. If the responsible entity happens to be a standard body listed in the sbs.xml, the id of the standard body must be the value of @id of the <name>.  This is necessary to create a reference or a link to the standard body page.

Scope

A <scope> describes the purpose of a standard and to what extent it is useful. It may be similar to a standard topic, but it is not limited to a pre-defined set of areas.    

Keyword

The <keyword> nodes signify important hints about a standard. The standard abbreviation is not allowed to be a keyword because it creates redundancy in the standard tag cloud.

Info, BiblStruct

An <info> has different functions based on its @type. For instance

  • <info type=”description”> contains general textual information about the standard.
  • <info type=”recReading”> contains a bibliography or references to related papers, which are recommended to be read, and are defined by the element <biblStruct>.
The <biblStruct> node defines a reference to a paper or a book about the standard. The bibliographic structure of the node is adopted from TEI P5 Guidelines.11
Features    

A <features> node defines the information about technical and formal aspects of a standard. Here can be specified, what meta language (SGML vs. XML), respective grammar class or the notation (inline vs. standoff) are used, what the constraint language defines the markup language, or other information, that can be relevant for a standard. The description of the feature set adopts the principles of TEI feature structure representation.12

The node <features> has an optional @name attribute for its features name and can contain the elements <fs> or <vColl>.

The <fs> stands for “feature structure” and can be used to represent different kinds of information.  The <fs> element has an optional @type attribute, which indicates the type of feature structure it represents. An <fs> element groups a sequence of feature-value pairs together. A feature is defined as an element <f> with a @name attribute indicating the name of the feature and any number of associated values, such as <binary>, <symbol>, <numeric>, and <string>.

The <vColl> element stands for “collection of values”. It allows the encoding of lists, sets and bags (i.e., multisets) of the values.

Address

An <address> element refers to a URL or a postal address.   

Relation

A <relation> node describes an association between two standards or standard versions. A <relation> has two attributes:  @type signifying the kind of relation such as isVersionOf and @target signifying the target of the relation. The relation types are defined in the XML Schema (see Section 17). The <relation> nodes are used to create the standard relation graphs.

Part

A standard is sometimes divided into several parts. The <part> node describes the information about a standard part. A part must have an <id> and a <title>.  Besides, it can have other elements that a <spec> can have. However, it cannot have a sub-part, thus it must not consist of any <part>. Instead, a part must have at least one version.

Version

A standard is typically published more than one time, because the standard may be improved from time to time. On each publication, a new standard version is delivered. A <version> node describes information about a standard version. A <version> has the attribute @id whose value must starts with “Spec” like a standard id, and @status  indicating the current status of the version such as working draft, final and recommendation.

A <version> must have an <abbr> in its <titleStmt> and a published date <date>. Besides, it can contain one or more optional nodes: <versionNumber>, <info>, <features>, <address>, <relation> and <asset>. A <versionNumber> can have @type major or minor. A major version number usually corresponds to a major revision with significant changes in the standard version, whereas a minor one contains only small changes.

Asset

The <asset> node of a standard lists links referring to some standard documents, such as examples of standard applications.

Schema for Standard Body

A standard body XML file has the root node <sbs> and stands for Standard Body Set. It describes all the standards bodies whose ids are referred in the attribute @standardSettingBody in <spec>.

Sbs

The <sbs> root element contains child elements <sb> describing information about each standard body individually.

Sb

The <sb> has three attributes: @id, @type and @display. The @id defines the identifier for the standard body. It must start with “SB” and in a normal case should be complemented with the short name or acronym of a standard organization, for example “SBISO” for International Organization for Standardization or “SBW3C” for World Wide Web Consortium.

The <sb> must contain a <titleStmt> and an <info>. Optionally it can also contain the elements <address> and <relation>. The elements in <sb> must follow a certain order, namely <titleStmt>, <info>, <address>, <relation>.

Not only a standard organization can be defined as <sb>, but also a technical committee, a subcommittee or a working group in a standard organization. It can be defined in the @type which is optional. A relation between standard organizations can be specified in a <relation> element.

The attribute values “hide” and “show” of @display determines whether the information about standard body will be shown on the web-page or not. For instance, the standard body is hidden when its information is still incomplete.

Schema for Topic

Each standard can be assigned to one or more topics. These topics should be listed in the <spec> element as the value of the attribute @topic. Multiple topics must be separated with a space character. By means of these topics, standards of similar topics can be grouped together.

The topic XML file has the root node <topics> with child elements <topic>. The element <topic> includes the information about each topic individually and has a mandatory attribute @id. The attribute defines the identifier for the topic, which must start with “Topic” and should be complemented with the short name or acronym of the topic name, for example “TopicSemAnn” stands for Topic Semantic Annotation. A <topic> must include the elements <titleStmt> and <info>.

Search

The search /folder contains XQuery view codes for searching standard descriptions.

User

The /user folder contains the XQuery view codes for user registration and login.

Views

The /views folder contains XQuery view codes defining the web-pages.

Controller

The only controller XQuery needed, for example for URLRewriting13, is controller.xql.

Index

The index.xq manages the web homepage.

Web Login

The website has already a web-admin and a test user accounts. The credentials of the accounts can be obtained from Antonina Werthmann (

<em>werthmann@ids-mannheim.de</em> (werthmann at ids-mannheim dot de)

). Although the website provides a registration feature for new users, adding a new web-admin account has to be done manually. Please set the email address and MD5 encoded password in the user XML Data (see section 4.1.6).

Further Work

The following tasks are planned to be done in the future:

  • Expansion of the standard collection in the specifications/ folder with descriptions of further standards or best-practice guidelines exploited in CLARIN-D project and their relations to the project. For example, the collection lacks the standards for linguistic annotation, metadata annotation, data retention, data archiving and so on.
  • Extension and update of the existing standard descriptions including filling any information gaps that may exist.
  • Expansion of the collection of the examples in the doc/ folder for the existing standards with direct links to them.
  • Addition of actual bibliography entries for the standards.
  • Description and completion of missing information for all existing topics in the Section REF _Ref387652078 \r \h9
  • Addition of new topics.
  • Links to external websites are to be continuously monitored, maintained and updated.
  • Extensive testing to make sure that each function works properly.
Clone this wiki locally