diff options
Diffstat (limited to 'data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml')
-rw-r--r-- | data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml | 599 |
1 files changed, 599 insertions, 0 deletions
diff --git a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml new file mode 100644 index 00000000..2b0d3432 --- /dev/null +++ b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml @@ -0,0 +1,599 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<?xml-stylesheet type="text/css" href="../_sisu/css/sax.css"?> +<!-- Document processing information: + * Generated by: SiSU 0.59.0 of 2007w38/0 (2007-09-23) + * Ruby version: ruby 1.8.6 (2007-06-07 patchlevel 36) [i486-linux] + * + * Last Generated on: Sun Sep 23 04:12:00 +0100 2007 + * SiSU http://www.jus.uio.no/sisu +--> + +<document> +<head> + <meta>Title:</meta> + <title class="dc"> + SiSU - Commands [0.58] + </title> + <br /> + <meta>Creator:</meta> + <creator class="dc"> + Ralph Amissah + </creator> + <br /> + <meta>Rights:</meta> + <rights class="dc"> + Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3 + </rights> + <br /> + <meta>Type:</meta> + <type class="dc"> + information + </type> + <br /> + <meta>Subject:</meta> + <subject class="dc"> + ebook, epublishing, electronic book, electronic publishing, electronic document, electronic citation, data structure, citation systems, search + </subject> + <br /> + <meta>Date created:</meta> + <date_created class="extra"> + 2002-08-28 + </date_created> + <br /> + <meta>Date issued:</meta> + <date_issued class="extra"> + 2002-08-28 + </date_issued> + <br /> + <meta>Date available:</meta> + <date_available class="extra"> + 2002-08-28 + </date_available> + <br /> + <meta>Date modified:</meta> + <date_modified class="extra"> + 2007-09-16 + </date_modified> + <br /> + <meta>Date:</meta> + <date class="dc"> + 2007-09-16 + </date> + <br /> +</head> +<body> +<object id="1"> + <ocn>1</ocn> + <text class="h1"> + SiSU - Commands [0.58],<br /> Ralph Amissah + </text> +</object> +<object id="2"> + <ocn>2</ocn> + <text class="h2"> + What is SiSU? + </text> +</object> +<object id="3"> + <ocn>3</ocn> + <text class="h3"> + Description + </text> +</object> +<object id="4"> + <ocn>4</ocn> + <text class="h4"> + 1. Introduction - What is SiSU? + </text> +</object> +<object id="5"> + <ocn>5</ocn> + <text class="norm"> + <b>SiSU</b> is a system for document markup, publishing (in multiple +open standard formats) and search + </text> +</object> +<object id="6"> + <ocn>6</ocn> + <text class="norm"> + <b>SiSU</b><en>1</en> is a<en>2</en> framework for document +structuring, publishing and search, comprising of (a) a lightweight +document structure and presentation markup syntax and (b) an +accompanying engine for generating standard document format outputs +from documents prepared in sisu markup syntax, which is able to produce +multiple standard outputs that (can) share a common numbering system +for the citation of text within a document. + </text> + <endnote notenumber="1"> + <number>1</number> + <note> + "<b>SiSU</b> information Structuring Universe" or "Structured +information, Serialized Units".<br /> also chosen for the meaning of +the Finnish term "sisu". + </note> + </endnote> + <endnote notenumber="2"> + <number>2</number> + <note> + Unix command line oriented + </note> + </endnote> +</object> +<object id="7"> + <ocn>7</ocn> + <text class="norm"> + <b>SiSU</b> is developed under an open source, software libre license +(GPL3). It has been developed in the context of coping with large +document sets with evolving markup related technologies, for which you +want multiple output formats, a common mechanism for +cross-output-format citation, and search. + </text> +</object> +<object id="8"> + <ocn>8</ocn> + <text class="norm"> + <b>SiSU</b> both defines a markup syntax and provides an engine that +produces open standards format outputs from documents prepared with +<b>SiSU</b> markup. From a single lightly prepared document sisu custom +builds several standard output formats which share a common (text +object) numbering system for citation of content within a document +(that also has implications for search). The sisu engine works with an +abstraction of the document's structure and content from which it is +possible to generate different forms of representation of the document. +Significantly <b>SiSU</b> markup is more sparse than html and outputs +which include html, LaTeX, landscape and portrait pdfs, Open Document +Format (ODF), all of which can be added to and updated. <b>SiSU</b> is +also able to populate SQL type databases at an object level, which +means that searches can be made with that degree of granularity. +Results of objects (primarily paragraphs and headings) can be viewed +directly in the database, or just the object numbers shown - your +search criteria is met in these documents and at these locations within +each document. + </text> +</object> +<object id="9"> + <ocn>9</ocn> + <text class="norm"> + Source document preparation and output generation is a two step +process: (i) document source is prepared, that is, marked up in sisu +markup syntax and (ii) the desired output subsequently generated by +running the sisu engine against document source. Output representations +if updated (in the sisu engine) can be generated by re-running the +engine against the prepared source. Using <b>SiSU</b> markup applied to +a document, <b>SiSU</b> custom builds various standard open output +formats including plain text, HTML, XHTML, XML, OpenDocument, LaTeX or +PDF files, and populate an SQL database with objects<en>3</en> +(equating generally to paragraph-sized chunks) so searches may be +performed and matches returned with that degree of granularity ( e.g. +your search criteria is met by these documents and at these locations +within each document). Document output formats share a common object +numbering system for locating content. This is particularly suitable +for "published" works (finalized texts as opposed to works that are +frequently changed or updated) for which it provides a fixed means of +reference of content. + </text> + <endnote notenumber="3"> + <number>3</number> + <note> + objects include: headings, paragraphs, verse, tables, images, but not +footnotes/endnotes which are numbered separately and tied to the object +from which they are referenced. + </note> + </endnote> +</object> +<object id="10"> + <ocn>10</ocn> + <text class="norm"> + In preparing a <b>SiSU</b> document you optionally provide semantic +information related to the document in a document header, and in +marking up the substantive text provide information on the structure of +the document, primarily indicating heading levels and footnotes. You +also provide information on basic text attributes where used. The rest +is automatic, sisu from this information custom builds<en>4</en> the +different forms of output requested. + </text> + <endnote notenumber="4"> + <number>4</number> + <note> + i.e. the html, pdf, odf outputs are each built individually and +optimised for that form of presentation, rather than for example the +html being a saved version of the odf, or the pdf being a saved version +of the html. + </note> + </endnote> +</object> +<object id="11"> + <ocn>11</ocn> + <text class="norm"> + <b>SiSU</b> works with an abstraction of the document based on its +structure which is comprised of its frame<en>5</en> and the +objects<en>6</en> it contains, which enables <b>SiSU</b> to represent +the document in many different ways, and to take advantage of the +strengths of different ways of presenting documents. The objects are +numbered, and these numbers can be used to provide a common base for +citing material within a document across the different output format +types. This is significant as page numbers are not suited to the +digital age, in web publishing, changing a browser's default font or +using a different browser means that text appears on different pages; +and in publishing in different formats, html, landscape and portrait +pdf etc. again page numbers are of no use to cite text in a manner that +is relevant against the different output types. Dealing with documents +at an object level together with object numbering also has implications +for search. + </text> + <endnote notenumber="5"> + <number>5</number> + <note> + the different heading levels + </note> + </endnote> + <endnote notenumber="6"> + <number>6</number> + <note> + units of text, primarily paragraphs and headings, also any tables, +poems, code-blocks + </note> + </endnote> +</object> +<object id="12"> + <ocn>12</ocn> + <text class="norm"> + One of the challenges of maintaining documents is to keep them in a +format that would allow users to use them without depending on a +proprietary software popular at the time. Consider the ease of dealing +with legacy proprietary formats today and what guarantee you have that +old proprietary formats will remain (or can be read without proprietary +software/equipment) in 15 years time, or the way the way in which html +has evolved over its relatively short span of existence. <b>SiSU</b> +provides the flexibility of outputing documents in multiple +non-proprietary open formats including html, pdf<en>7</en> and the ISO +standard ODF.<en>8</en> Whilst <b>SiSU</b> relies on software, the +markup is uncomplicated and minimalistic which guarantees that future +engines can be written to run against it. It is also easily converted +to other formats, which means documents prepared in <b>SiSU</b> can be +migrated to other document formats. Further security is provided by the +fact that the software itself, <b>SiSU</b> is available under GPL3 a +licence that guarantees that the source code will always be open, and +free as in libre which means that that code base can be used updated +and further developed as required under the terms of its license. +Another challenge is to keep up with a moving target. <b>SiSU</b> +permits new forms of output to be added as they become important, (Open +Document Format text was added in 2006), and existing output to be +updated (html has evolved and the related module has been updated +repeatedly over the years, presumably when the World Wide Web +Consortium (w3c) finalises html 5 which is currently under development, +the html module will again be updated allowing all existing documents +to be regenerated as html 5). + </text> + <endnote notenumber="7"> + <number>7</number> + <note> + Specification submitted by Adobe to ISO to become a full open ISO +specification <br /> <<link +xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" +xlink:href="http://www.linux-watch.com/news/NS7542722606.html">http://www.linux-watch.com/news/NS7542722606.html</link>> + </note> + </endnote> + <endnote notenumber="8"> + <number>8</number> + <note> + ISO/IEC 26300:2006 + </note> + </endnote> +</object> +<object id="13"> + <ocn>13</ocn> + <text class="norm"> + The document formats are written to the file-system and available for +indexing by independent indexing tools, whether off the web like Google +and Yahoo or on the site like Lucene and Hyperestraier. + </text> +</object> +<object id="14"> + <ocn>14</ocn> + <text class="norm"> + <b>SiSU</b> also provides other features such as concordance files and +document content certificates, and the working against an abstraction +of document structure has further possibilities for the research and +development of other document representations, the availability of +objects is useful for example for topic maps and the commercial law +thesaurus by Vikki Rogers and Al Krtizer, together with the flexibility +of <b>SiSU</b> offers great possibilities. + </text> +</object> +<object id="15"> + <ocn>15</ocn> + <text class="norm"> + <b>SiSU</b> is primarily for published works, which can take advantage +of the citation system to reliably reference its documents. <b>SiSU</b> +works well in a complementary manner with such collaborative +technologies as Wikis, which can take advantage of and be used to +discuss the substance of content prepared in <b>SiSU</b>. + </text> +</object> +<object id="16"> + <ocn>16</ocn> + <text class="norm"> + <<link xmlns:xlink="http://www.w3.org/1999/xlink" +xlink:type="simple" +xlink:href="http://www.jus.uio.no/sisu">http://www.jus.uio.no/sisu</link>> + </text> +</object> +<object id="17"> + <ocn>17</ocn> + <text class="h4"> + 2. How does sisu work? + </text> +</object> +<object id="18"> + <ocn>18</ocn> + <text class="norm"> + <b>SiSU</b> markup is fairly minimalistic, it consists of: a (largely +optional) document header, made up of information about the document +(such as when it was published, who authored it, and granting what +rights) and any processing instructions; and markup within the +substantive text of the document, which is related to document +structure and typeface. <b>SiSU</b> must be able to discern the +structure of a document, (text headings and their levels in relation to +each other), either from information provided in the document header or +from markup within the text (or from a combination of both). Processing +is done against an abstraction of the document comprising of +information on the document's structure and its objects,[2] which the +program serializes (providing the object numbers) and which are +assigned hash sum values based on their content. This abstraction of +information about document structure, objects, (and hash sums), +provides considerable flexibility in representing documents different +ways and for different purposes (e.g. search, document layout, +publishing, content certification, concordance etc.), and makes it +possible to take advantage of some of the strengths of established ways +of representing documents, (or indeed to create new ones). + </text> +</object> +<object id="19"> + <ocn>19</ocn> + <text class="h4"> + 3. Summary of features + </text> +</object> +<object id="20"> + <ocn>20</ocn> + <text class="indent_bullet"> + sparse/minimal markup (clean utf-8 source texts). Documents are +prepared in a single UTF-8 file using a minimalistic mnemonic syntax. +Typical literature, documents like "War and Peace" require almost no +markup, and most of the headers are optional. + </text> +</object> +<object id="21"> + <ocn>21</ocn> + <text class="indent_bullet"> + markup is easily readable/parsable by the human eye, (basic markup is +simpler and more sparse than the most basic HTML), [this may also be +converted to XML representations of the same input/source document]. + </text> +</object> +<object id="22"> + <ocn>22</ocn> + <text class="indent_bullet"> + markup defines document structure (this may be done once in a header +pattern-match description, or for heading levels individually); basic +text attributes (bold, italics, underscore, strike-through etc.) as +required; and semantic information related to the document (header +information, extended beyond the Dublin core and easily further +extended as required); the headers may also contain processing +instructions. <b>SiSU</b> markup is primarily an abstraction of +document structure and document metadata to permit taking advantage of +the basic strengths of existing alternative practical standard ways of +representing documents [be that browser viewing, paper publication, sql +search etc.] (html, xml, odf, latex, pdf, sql) + </text> +</object> +<object id="23"> + <ocn>23</ocn> + <text class="indent_bullet"> + for output produces reasonably elegant output of established industry +and institutionally accepted open standard formats.[3] takes advantage +of the different strengths of various standard formats for representing +documents, amongst the output formats currently supported are: + </text> +</object> +<object id="24"> + <ocn>24</ocn> + <text class="indent_bullet1"> + html - both as a single scrollable text and a segmented document + </text> +</object> +<object id="25"> + <ocn>25</ocn> + <text class="indent_bullet1"> + xhtml + </text> +</object> +<object id="26"> + <ocn>26</ocn> + <text class="indent_bullet1"> + XML - both in sax and dom style xml structures for further +development as required + </text> +</object> +<object id="27"> + <ocn>27</ocn> + <text class="indent_bullet1"> + ODF - open document format, the iso standard for document storage + </text> +</object> +<object id="28"> + <ocn>28</ocn> + <text class="indent_bullet1"> + LaTeX - used to generate pdf + </text> +</object> +<object id="29"> + <ocn>29</ocn> + <text class="indent_bullet1"> + pdf (via LaTeX) + </text> +</object> +<object id="30"> + <ocn>30</ocn> + <text class="indent_bullet1"> + sql - population of an sql database, (at the same object level +that is used to cite text within a document) + </text> +</object> +<object id="31"> + <ocn>31</ocn> + <text class="norm"> + Also produces: concordance files; document content certificates (md5 or +sha256 digests of headings, paragraphs, images etc.) and html manifests +(and sitemaps of content). (b) takes advantage of the strengths +implicit in these very different output types, (e.g. PDFs produced +using typesetting of LaTeX, databases populated with documents at an +individual object/paragraph level, making possible granular search (and +related possibilities)) + </text> +</object> +<object id="32"> + <ocn>32</ocn> + <text class="indent_bullet"> + ensuring content can be cited in a meaningful way regardless of +selected output format. Online publishing (and publishing in multiple +document formats) lacks a useful way of citing text internally within +documents (important to academics generally and to lawyers) as page +numbers are meaningless across browsers and formats. sisu seeks to +provide a common way of pinpoint the text within a document, (which can +be utilized for citation and by search engines). The outputs share a +common numbering system that is meaningful (to man and machine) across +all digital outputs whether paper, screen, or database oriented, (pdf, +HTML, xml, sqlite, postgresql), this numbering system can be used to +reference content. + </text> +</object> +<object id="33"> + <ocn>33</ocn> + <text class="indent_bullet"> + Granular search within documents. SQL databases are populated at an +object level (roughly headings, paragraphs, verse, tables) and become +searchable with that degree of granularity, the output information +provides the object/paragraph numbers which are relevant across all +generated outputs; it is also possible to look at just the matching +paragraphs of the documents in the database; [output indexing also work +well with search indexing tools like hyperestraier]. + </text> +</object> +<object id="34"> + <ocn>34</ocn> + <text class="indent_bullet"> + long term maintainability of document collections in a world of +changing formats, having a very sparsely marked-up source document +base. there is a considerable degree of future-proofing, output +representations are "upgradeable", and new document formats may be +added. e.g. addition of odf (open document text) module in 2006 and in +future html5 output sometime in future, without modification of +existing prepared texts + </text> +</object> +<object id="35"> + <ocn>35</ocn> + <text class="indent_bullet"> + SQL search aside, documents are generated as required and static once +generated. + </text> +</object> +<object id="36"> + <ocn>36</ocn> + <text class="indent_bullet"> + documents produced are static files, and may be batch processed, this +needs to be done only once but may be repeated for various reasons as +desired (updated content, addition of new output formats, updated +technology document presentations/representations) + </text> +</object> +<object id="37"> + <ocn>37</ocn> + <text class="indent_bullet"> + document source (plaintext utf-8) if shared on the net may be used as +input and processed locally to produce the different document outputs + </text> +</object> +<object id="38"> + <ocn>38</ocn> + <text class="indent_bullet"> + document source may be bundled together (automatically) with associated +documents (multiple language versions or master document with +inclusions) and images and sent as a zip file called a sisupod, if +shared on the net these too may be processed locally to produce the +desired document outputs + </text> +</object> +<object id="39"> + <ocn>39</ocn> + <text class="indent_bullet"> + generated document outputs may automatically be posted to remote sites. + </text> +</object> +<object id="40"> + <ocn>40</ocn> + <text class="indent_bullet"> + for basic document generation, the only software dependency is +<b>Ruby</b>, and a few standard Unix tools (this covers plaintext, +HTML, XML, ODF, LaTeX). To use a database you of course need that, and +to convert the LaTeX generated to pdf, a latex processor like tetex or +texlive. + </text> +</object> +<object id="41"> + <ocn>41</ocn> + <text class="indent_bullet"> + as a developers tool it is flexible and extensible + </text> +</object> +<object id="42"> + <ocn>42</ocn> + <text class="norm"> + Syntax highlighting for <b>SiSU</b> markup is available for a number of +text editors. + </text> +</object> +<object id="43"> + <ocn>43</ocn> + <text class="norm"> + <b>SiSU</b> is less about document layout than about finding a way with +little markup to be able to construct an abstract representation of a +document that makes it possible to produce multiple representations of +it which may be rather different from each other and used for different +purposes, whether layout and publishing, or search of content + </text> +</object> +<object id="44"> + <ocn>44</ocn> + <text class="norm"> + i.e. to be able to take advantage from this minimal preparation +starting point of some of the strengths of rather different established +ways of representing documents for different purposes, whether for +search (relational database, or indexed flat files generated for that +purpose whether of complete documents, or say of files made up of +objects), online viewing (e.g. html, xml, pdf), or paper publication +(e.g. pdf)... + </text> +</object> +<object id="45"> + <ocn>45</ocn> + <text class="norm"> + the solution arrived at is by extracting structural information about +the document (about headings within the document) and by tracking +objects (which are serialized and also given hash values) in the manner +described. It makes possible representations that are quite different +from those offered at present. For example objects could be saved +individually and identified by their hashes, with an index of how the +objects relate to each other to form a document. + </text> +</object> +<object id="0"> + <ocn>0</ocn> + <text class="h4"> + Endnotes + </text> +</object> +</body> +</document> |