aboutsummaryrefslogtreecommitdiffhomepage
path: root/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml
diff options
context:
space:
mode:
Diffstat (limited to 'data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml')
-rw-r--r--data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml599
1 files changed, 599 insertions, 0 deletions
diff --git a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml
new file mode 100644
index 00000000..2b0d3432
--- /dev/null
+++ b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml
@@ -0,0 +1,599 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<?xml-stylesheet type="text/css" href="../_sisu/css/sax.css"?>
+<!-- Document processing information:
+ * Generated by: SiSU 0.59.0 of 2007w38/0 (2007-09-23)
+ * Ruby version: ruby 1.8.6 (2007-06-07 patchlevel 36) [i486-linux]
+ *
+ * Last Generated on: Sun Sep 23 04:12:00 +0100 2007
+ * SiSU http://www.jus.uio.no/sisu
+-->
+
+<document>
+<head>
+ <meta>Title:</meta>
+ <title class="dc">
+ SiSU - Commands [0.58]
+ </title>
+ <br />
+ <meta>Creator:</meta>
+ <creator class="dc">
+ Ralph Amissah
+ </creator>
+ <br />
+ <meta>Rights:</meta>
+ <rights class="dc">
+ Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3
+ </rights>
+ <br />
+ <meta>Type:</meta>
+ <type class="dc">
+ information
+ </type>
+ <br />
+ <meta>Subject:</meta>
+ <subject class="dc">
+ ebook, epublishing, electronic book, electronic publishing, electronic document, electronic citation, data structure, citation systems, search
+ </subject>
+ <br />
+ <meta>Date created:</meta>
+ <date_created class="extra">
+ 2002-08-28
+ </date_created>
+ <br />
+ <meta>Date issued:</meta>
+ <date_issued class="extra">
+ 2002-08-28
+ </date_issued>
+ <br />
+ <meta>Date available:</meta>
+ <date_available class="extra">
+ 2002-08-28
+ </date_available>
+ <br />
+ <meta>Date modified:</meta>
+ <date_modified class="extra">
+ 2007-09-16
+ </date_modified>
+ <br />
+ <meta>Date:</meta>
+ <date class="dc">
+ 2007-09-16
+ </date>
+ <br />
+</head>
+<body>
+<object id="1">
+ <ocn>1</ocn>
+ <text class="h1">
+ SiSU - Commands [0.58],<br /> Ralph Amissah
+ </text>
+</object>
+<object id="2">
+ <ocn>2</ocn>
+ <text class="h2">
+ What is SiSU?
+ </text>
+</object>
+<object id="3">
+ <ocn>3</ocn>
+ <text class="h3">
+ Description
+ </text>
+</object>
+<object id="4">
+ <ocn>4</ocn>
+ <text class="h4">
+ 1. Introduction - What is SiSU?
+ </text>
+</object>
+<object id="5">
+ <ocn>5</ocn>
+ <text class="norm">
+ <b>SiSU</b> is a system for document markup, publishing (in multiple
+open standard formats) and search
+ </text>
+</object>
+<object id="6">
+ <ocn>6</ocn>
+ <text class="norm">
+ <b>SiSU</b><en>1</en> is a<en>2</en> framework for document
+structuring, publishing and search, comprising of (a) a lightweight
+document structure and presentation markup syntax and (b) an
+accompanying engine for generating standard document format outputs
+from documents prepared in sisu markup syntax, which is able to produce
+multiple standard outputs that (can) share a common numbering system
+for the citation of text within a document.
+ </text>
+ <endnote notenumber="1">
+ <number>1</number>
+ <note>
+ "<b>SiSU</b> information Structuring Universe" or "Structured
+information, Serialized Units".<br /> also chosen for the meaning of
+the Finnish term "sisu".
+ </note>
+ </endnote>
+ <endnote notenumber="2">
+ <number>2</number>
+ <note>
+ Unix command line oriented
+ </note>
+ </endnote>
+</object>
+<object id="7">
+ <ocn>7</ocn>
+ <text class="norm">
+ <b>SiSU</b> is developed under an open source, software libre license
+(GPL3). It has been developed in the context of coping with large
+document sets with evolving markup related technologies, for which you
+want multiple output formats, a common mechanism for
+cross-output-format citation, and search.
+ </text>
+</object>
+<object id="8">
+ <ocn>8</ocn>
+ <text class="norm">
+ <b>SiSU</b> both defines a markup syntax and provides an engine that
+produces open standards format outputs from documents prepared with
+<b>SiSU</b> markup. From a single lightly prepared document sisu custom
+builds several standard output formats which share a common (text
+object) numbering system for citation of content within a document
+(that also has implications for search). The sisu engine works with an
+abstraction of the document's structure and content from which it is
+possible to generate different forms of representation of the document.
+Significantly <b>SiSU</b> markup is more sparse than html and outputs
+which include html, LaTeX, landscape and portrait pdfs, Open Document
+Format (ODF), all of which can be added to and updated. <b>SiSU</b> is
+also able to populate SQL type databases at an object level, which
+means that searches can be made with that degree of granularity.
+Results of objects (primarily paragraphs and headings) can be viewed
+directly in the database, or just the object numbers shown - your
+search criteria is met in these documents and at these locations within
+each document.
+ </text>
+</object>
+<object id="9">
+ <ocn>9</ocn>
+ <text class="norm">
+ Source document preparation and output generation is a two step
+process: (i) document source is prepared, that is, marked up in sisu
+markup syntax and (ii) the desired output subsequently generated by
+running the sisu engine against document source. Output representations
+if updated (in the sisu engine) can be generated by re-running the
+engine against the prepared source. Using <b>SiSU</b> markup applied to
+a document, <b>SiSU</b> custom builds various standard open output
+formats including plain text, HTML, XHTML, XML, OpenDocument, LaTeX or
+PDF files, and populate an SQL database with objects<en>3</en>
+(equating generally to paragraph-sized chunks) so searches may be
+performed and matches returned with that degree of granularity ( e.g.
+your search criteria is met by these documents and at these locations
+within each document). Document output formats share a common object
+numbering system for locating content. This is particularly suitable
+for "published" works (finalized texts as opposed to works that are
+frequently changed or updated) for which it provides a fixed means of
+reference of content.
+ </text>
+ <endnote notenumber="3">
+ <number>3</number>
+ <note>
+ objects include: headings, paragraphs, verse, tables, images, but not
+footnotes/endnotes which are numbered separately and tied to the object
+from which they are referenced.
+ </note>
+ </endnote>
+</object>
+<object id="10">
+ <ocn>10</ocn>
+ <text class="norm">
+ In preparing a <b>SiSU</b> document you optionally provide semantic
+information related to the document in a document header, and in
+marking up the substantive text provide information on the structure of
+the document, primarily indicating heading levels and footnotes. You
+also provide information on basic text attributes where used. The rest
+is automatic, sisu from this information custom builds<en>4</en> the
+different forms of output requested.
+ </text>
+ <endnote notenumber="4">
+ <number>4</number>
+ <note>
+ i.e. the html, pdf, odf outputs are each built individually and
+optimised for that form of presentation, rather than for example the
+html being a saved version of the odf, or the pdf being a saved version
+of the html.
+ </note>
+ </endnote>
+</object>
+<object id="11">
+ <ocn>11</ocn>
+ <text class="norm">
+ <b>SiSU</b> works with an abstraction of the document based on its
+structure which is comprised of its frame<en>5</en> and the
+objects<en>6</en> it contains, which enables <b>SiSU</b> to represent
+the document in many different ways, and to take advantage of the
+strengths of different ways of presenting documents. The objects are
+numbered, and these numbers can be used to provide a common base for
+citing material within a document across the different output format
+types. This is significant as page numbers are not suited to the
+digital age, in web publishing, changing a browser's default font or
+using a different browser means that text appears on different pages;
+and in publishing in different formats, html, landscape and portrait
+pdf etc. again page numbers are of no use to cite text in a manner that
+is relevant against the different output types. Dealing with documents
+at an object level together with object numbering also has implications
+for search.
+ </text>
+ <endnote notenumber="5">
+ <number>5</number>
+ <note>
+ the different heading levels
+ </note>
+ </endnote>
+ <endnote notenumber="6">
+ <number>6</number>
+ <note>
+ units of text, primarily paragraphs and headings, also any tables,
+poems, code-blocks
+ </note>
+ </endnote>
+</object>
+<object id="12">
+ <ocn>12</ocn>
+ <text class="norm">
+ One of the challenges of maintaining documents is to keep them in a
+format that would allow users to use them without depending on a
+proprietary software popular at the time. Consider the ease of dealing
+with legacy proprietary formats today and what guarantee you have that
+old proprietary formats will remain (or can be read without proprietary
+software/equipment) in 15 years time, or the way the way in which html
+has evolved over its relatively short span of existence. <b>SiSU</b>
+provides the flexibility of outputing documents in multiple
+non-proprietary open formats including html, pdf<en>7</en> and the ISO
+standard ODF.<en>8</en> Whilst <b>SiSU</b> relies on software, the
+markup is uncomplicated and minimalistic which guarantees that future
+engines can be written to run against it. It is also easily converted
+to other formats, which means documents prepared in <b>SiSU</b> can be
+migrated to other document formats. Further security is provided by the
+fact that the software itself, <b>SiSU</b> is available under GPL3 a
+licence that guarantees that the source code will always be open, and
+free as in libre which means that that code base can be used updated
+and further developed as required under the terms of its license.
+Another challenge is to keep up with a moving target. <b>SiSU</b>
+permits new forms of output to be added as they become important, (Open
+Document Format text was added in 2006), and existing output to be
+updated (html has evolved and the related module has been updated
+repeatedly over the years, presumably when the World Wide Web
+Consortium (w3c) finalises html 5 which is currently under development,
+the html module will again be updated allowing all existing documents
+to be regenerated as html 5).
+ </text>
+ <endnote notenumber="7">
+ <number>7</number>
+ <note>
+ Specification submitted by Adobe to ISO to become a full open ISO
+specification <br /> &lt;<link
+xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
+xlink:href="http://www.linux-watch.com/news/NS7542722606.html">http://www.linux-watch.com/news/NS7542722606.html</link>&gt;
+ </note>
+ </endnote>
+ <endnote notenumber="8">
+ <number>8</number>
+ <note>
+ ISO/IEC 26300:2006
+ </note>
+ </endnote>
+</object>
+<object id="13">
+ <ocn>13</ocn>
+ <text class="norm">
+ The document formats are written to the file-system and available for
+indexing by independent indexing tools, whether off the web like Google
+and Yahoo or on the site like Lucene and Hyperestraier.
+ </text>
+</object>
+<object id="14">
+ <ocn>14</ocn>
+ <text class="norm">
+ <b>SiSU</b> also provides other features such as concordance files and
+document content certificates, and the working against an abstraction
+of document structure has further possibilities for the research and
+development of other document representations, the availability of
+objects is useful for example for topic maps and the commercial law
+thesaurus by Vikki Rogers and Al Krtizer, together with the flexibility
+of <b>SiSU</b> offers great possibilities.
+ </text>
+</object>
+<object id="15">
+ <ocn>15</ocn>
+ <text class="norm">
+ <b>SiSU</b> is primarily for published works, which can take advantage
+of the citation system to reliably reference its documents. <b>SiSU</b>
+works well in a complementary manner with such collaborative
+technologies as Wikis, which can take advantage of and be used to
+discuss the substance of content prepared in <b>SiSU</b>.
+ </text>
+</object>
+<object id="16">
+ <ocn>16</ocn>
+ <text class="norm">
+ &lt;<link xmlns:xlink="http://www.w3.org/1999/xlink"
+xlink:type="simple"
+xlink:href="http://www.jus.uio.no/sisu">http://www.jus.uio.no/sisu</link>&gt;
+ </text>
+</object>
+<object id="17">
+ <ocn>17</ocn>
+ <text class="h4">
+ 2. How does sisu work?
+ </text>
+</object>
+<object id="18">
+ <ocn>18</ocn>
+ <text class="norm">
+ <b>SiSU</b> markup is fairly minimalistic, it consists of: a (largely
+optional) document header, made up of information about the document
+(such as when it was published, who authored it, and granting what
+rights) and any processing instructions; and markup within the
+substantive text of the document, which is related to document
+structure and typeface. <b>SiSU</b> must be able to discern the
+structure of a document, (text headings and their levels in relation to
+each other), either from information provided in the document header or
+from markup within the text (or from a combination of both). Processing
+is done against an abstraction of the document comprising of
+information on the document's structure and its objects,[2] which the
+program serializes (providing the object numbers) and which are
+assigned hash sum values based on their content. This abstraction of
+information about document structure, objects, (and hash sums),
+provides considerable flexibility in representing documents different
+ways and for different purposes (e.g. search, document layout,
+publishing, content certification, concordance etc.), and makes it
+possible to take advantage of some of the strengths of established ways
+of representing documents, (or indeed to create new ones).
+ </text>
+</object>
+<object id="19">
+ <ocn>19</ocn>
+ <text class="h4">
+ 3. Summary of features
+ </text>
+</object>
+<object id="20">
+ <ocn>20</ocn>
+ <text class="indent_bullet">
+ sparse/minimal markup (clean utf-8 source texts). Documents are
+prepared in a single UTF-8 file using a minimalistic mnemonic syntax.
+Typical literature, documents like "War and Peace" require almost no
+markup, and most of the headers are optional.
+ </text>
+</object>
+<object id="21">
+ <ocn>21</ocn>
+ <text class="indent_bullet">
+ markup is easily readable/parsable by the human eye, (basic markup is
+simpler and more sparse than the most basic HTML), [this may also be
+converted to XML representations of the same input/source document].
+ </text>
+</object>
+<object id="22">
+ <ocn>22</ocn>
+ <text class="indent_bullet">
+ markup defines document structure (this may be done once in a header
+pattern-match description, or for heading levels individually); basic
+text attributes (bold, italics, underscore, strike-through etc.) as
+required; and semantic information related to the document (header
+information, extended beyond the Dublin core and easily further
+extended as required); the headers may also contain processing
+instructions. <b>SiSU</b> markup is primarily an abstraction of
+document structure and document metadata to permit taking advantage of
+the basic strengths of existing alternative practical standard ways of
+representing documents [be that browser viewing, paper publication, sql
+search etc.] (html, xml, odf, latex, pdf, sql)
+ </text>
+</object>
+<object id="23">
+ <ocn>23</ocn>
+ <text class="indent_bullet">
+ for output produces reasonably elegant output of established industry
+and institutionally accepted open standard formats.[3] takes advantage
+of the different strengths of various standard formats for representing
+documents, amongst the output formats currently supported are:
+ </text>
+</object>
+<object id="24">
+ <ocn>24</ocn>
+ <text class="indent_bullet1">
+ html - both as a single scrollable text and a segmented document
+ </text>
+</object>
+<object id="25">
+ <ocn>25</ocn>
+ <text class="indent_bullet1">
+ xhtml
+ </text>
+</object>
+<object id="26">
+ <ocn>26</ocn>
+ <text class="indent_bullet1">
+ XML - both in sax and dom style xml structures for further
+development as required
+ </text>
+</object>
+<object id="27">
+ <ocn>27</ocn>
+ <text class="indent_bullet1">
+ ODF - open document format, the iso standard for document storage
+ </text>
+</object>
+<object id="28">
+ <ocn>28</ocn>
+ <text class="indent_bullet1">
+ LaTeX - used to generate pdf
+ </text>
+</object>
+<object id="29">
+ <ocn>29</ocn>
+ <text class="indent_bullet1">
+ pdf (via LaTeX)
+ </text>
+</object>
+<object id="30">
+ <ocn>30</ocn>
+ <text class="indent_bullet1">
+ sql - population of an sql database, (at the same object level
+that is used to cite text within a document)
+ </text>
+</object>
+<object id="31">
+ <ocn>31</ocn>
+ <text class="norm">
+ Also produces: concordance files; document content certificates (md5 or
+sha256 digests of headings, paragraphs, images etc.) and html manifests
+(and sitemaps of content). (b) takes advantage of the strengths
+implicit in these very different output types, (e.g. PDFs produced
+using typesetting of LaTeX, databases populated with documents at an
+individual object/paragraph level, making possible granular search (and
+related possibilities))
+ </text>
+</object>
+<object id="32">
+ <ocn>32</ocn>
+ <text class="indent_bullet">
+ ensuring content can be cited in a meaningful way regardless of
+selected output format. Online publishing (and publishing in multiple
+document formats) lacks a useful way of citing text internally within
+documents (important to academics generally and to lawyers) as page
+numbers are meaningless across browsers and formats. sisu seeks to
+provide a common way of pinpoint the text within a document, (which can
+be utilized for citation and by search engines). The outputs share a
+common numbering system that is meaningful (to man and machine) across
+all digital outputs whether paper, screen, or database oriented, (pdf,
+HTML, xml, sqlite, postgresql), this numbering system can be used to
+reference content.
+ </text>
+</object>
+<object id="33">
+ <ocn>33</ocn>
+ <text class="indent_bullet">
+ Granular search within documents. SQL databases are populated at an
+object level (roughly headings, paragraphs, verse, tables) and become
+searchable with that degree of granularity, the output information
+provides the object/paragraph numbers which are relevant across all
+generated outputs; it is also possible to look at just the matching
+paragraphs of the documents in the database; [output indexing also work
+well with search indexing tools like hyperestraier].
+ </text>
+</object>
+<object id="34">
+ <ocn>34</ocn>
+ <text class="indent_bullet">
+ long term maintainability of document collections in a world of
+changing formats, having a very sparsely marked-up source document
+base. there is a considerable degree of future-proofing, output
+representations are "upgradeable", and new document formats may be
+added. e.g. addition of odf (open document text) module in 2006 and in
+future html5 output sometime in future, without modification of
+existing prepared texts
+ </text>
+</object>
+<object id="35">
+ <ocn>35</ocn>
+ <text class="indent_bullet">
+ SQL search aside, documents are generated as required and static once
+generated.
+ </text>
+</object>
+<object id="36">
+ <ocn>36</ocn>
+ <text class="indent_bullet">
+ documents produced are static files, and may be batch processed, this
+needs to be done only once but may be repeated for various reasons as
+desired (updated content, addition of new output formats, updated
+technology document presentations/representations)
+ </text>
+</object>
+<object id="37">
+ <ocn>37</ocn>
+ <text class="indent_bullet">
+ document source (plaintext utf-8) if shared on the net may be used as
+input and processed locally to produce the different document outputs
+ </text>
+</object>
+<object id="38">
+ <ocn>38</ocn>
+ <text class="indent_bullet">
+ document source may be bundled together (automatically) with associated
+documents (multiple language versions or master document with
+inclusions) and images and sent as a zip file called a sisupod, if
+shared on the net these too may be processed locally to produce the
+desired document outputs
+ </text>
+</object>
+<object id="39">
+ <ocn>39</ocn>
+ <text class="indent_bullet">
+ generated document outputs may automatically be posted to remote sites.
+ </text>
+</object>
+<object id="40">
+ <ocn>40</ocn>
+ <text class="indent_bullet">
+ for basic document generation, the only software dependency is
+<b>Ruby</b>, and a few standard Unix tools (this covers plaintext,
+HTML, XML, ODF, LaTeX). To use a database you of course need that, and
+to convert the LaTeX generated to pdf, a latex processor like tetex or
+texlive.
+ </text>
+</object>
+<object id="41">
+ <ocn>41</ocn>
+ <text class="indent_bullet">
+ as a developers tool it is flexible and extensible
+ </text>
+</object>
+<object id="42">
+ <ocn>42</ocn>
+ <text class="norm">
+ Syntax highlighting for <b>SiSU</b> markup is available for a number of
+text editors.
+ </text>
+</object>
+<object id="43">
+ <ocn>43</ocn>
+ <text class="norm">
+ <b>SiSU</b> is less about document layout than about finding a way with
+little markup to be able to construct an abstract representation of a
+document that makes it possible to produce multiple representations of
+it which may be rather different from each other and used for different
+purposes, whether layout and publishing, or search of content
+ </text>
+</object>
+<object id="44">
+ <ocn>44</ocn>
+ <text class="norm">
+ i.e. to be able to take advantage from this minimal preparation
+starting point of some of the strengths of rather different established
+ways of representing documents for different purposes, whether for
+search (relational database, or indexed flat files generated for that
+purpose whether of complete documents, or say of files made up of
+objects), online viewing (e.g. html, xml, pdf), or paper publication
+(e.g. pdf)...
+ </text>
+</object>
+<object id="45">
+ <ocn>45</ocn>
+ <text class="norm">
+ the solution arrived at is by extracting structural information about
+the document (about headings within the document) and by tracking
+objects (which are serialized and also given hash values) in the manner
+described. It makes possible representations that are quite different
+from those offered at present. For example objects could be saved
+individually and identified by their hashes, with an index of how the
+objects relate to each other to form a document.
+ </text>
+</object>
+<object id="0">
+ <ocn>0</ocn>
+ <text class="h4">
+ Endnotes
+ </text>
+</object>
+</body>
+</document>