diff options
Diffstat (limited to 'data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml')
-rw-r--r-- | data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml | 599 |
1 files changed, 0 insertions, 599 deletions
diff --git a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml deleted file mode 100644 index 4264c388..00000000 --- a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml +++ /dev/null @@ -1,599 +0,0 @@ -<?xml version="1.0" encoding="UTF-8" standalone="no"?> -<?xml-stylesheet type="text/css" href="../_sisu/css/sax.css"?> -<!-- Document processing information: - * Generated by: SiSU 0.59.1 of 2007w39/2 (2007-09-25) - * Ruby version: ruby 1.8.6 (2007-06-07 patchlevel 36) [i486-linux] - * - * Last Generated on: Tue Sep 25 02:52:54 +0100 2007 - * SiSU http://www.jus.uio.no/sisu ---> - -<document> -<head> - <meta>Title:</meta> - <title class="dc"> - SiSU - Commands - </title> - <br /> - <meta>Creator:</meta> - <creator class="dc"> - Ralph Amissah - </creator> - <br /> - <meta>Rights:</meta> - <rights class="dc"> - Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3 - </rights> - <br /> - <meta>Type:</meta> - <type class="dc"> - information - </type> - <br /> - <meta>Subject:</meta> - <subject class="dc"> - ebook, epublishing, electronic book, electronic publishing, electronic document, electronic citation, data structure, citation systems, search - </subject> - <br /> - <meta>Date created:</meta> - <date_created class="extra"> - 2002-08-28 - </date_created> - <br /> - <meta>Date issued:</meta> - <date_issued class="extra"> - 2002-08-28 - </date_issued> - <br /> - <meta>Date available:</meta> - <date_available class="extra"> - 2002-08-28 - </date_available> - <br /> - <meta>Date modified:</meta> - <date_modified class="extra"> - 2007-09-16 - </date_modified> - <br /> - <meta>Date:</meta> - <date class="dc"> - 2007-09-16 - </date> - <br /> -</head> -<body> -<object id="1"> - <ocn>1</ocn> - <text class="h1"> - SiSU - Commands,<br /> Ralph Amissah - </text> -</object> -<object id="2"> - <ocn>2</ocn> - <text class="h2"> - What is SiSU? - </text> -</object> -<object id="3"> - <ocn>3</ocn> - <text class="h3"> - Description - </text> -</object> -<object id="4"> - <ocn>4</ocn> - <text class="h4"> - 1. Introduction - What is SiSU? - </text> -</object> -<object id="5"> - <ocn>5</ocn> - <text class="norm"> - <b>SiSU</b> is a system for document markup, publishing (in multiple -open standard formats) and search - </text> -</object> -<object id="6"> - <ocn>6</ocn> - <text class="norm"> - <b>SiSU</b><en>1</en> is a<en>2</en> framework for document -structuring, publishing and search, comprising of (a) a lightweight -document structure and presentation markup syntax and (b) an -accompanying engine for generating standard document format outputs -from documents prepared in sisu markup syntax, which is able to produce -multiple standard outputs that (can) share a common numbering system -for the citation of text within a document. - </text> - <endnote notenumber="1"> - <number>1</number> - <note> - "<b>SiSU</b> information Structuring Universe" or "Structured -information, Serialized Units".<br /> also chosen for the meaning of -the Finnish term "sisu". - </note> - </endnote> - <endnote notenumber="2"> - <number>2</number> - <note> - Unix command line oriented - </note> - </endnote> -</object> -<object id="7"> - <ocn>7</ocn> - <text class="norm"> - <b>SiSU</b> is developed under an open source, software libre license -(GPL3). It has been developed in the context of coping with large -document sets with evolving markup related technologies, for which you -want multiple output formats, a common mechanism for -cross-output-format citation, and search. - </text> -</object> -<object id="8"> - <ocn>8</ocn> - <text class="norm"> - <b>SiSU</b> both defines a markup syntax and provides an engine that -produces open standards format outputs from documents prepared with -<b>SiSU</b> markup. From a single lightly prepared document sisu custom -builds several standard output formats which share a common (text -object) numbering system for citation of content within a document -(that also has implications for search). The sisu engine works with an -abstraction of the document's structure and content from which it is -possible to generate different forms of representation of the document. -Significantly <b>SiSU</b> markup is more sparse than html and outputs -which include html, LaTeX, landscape and portrait pdfs, Open Document -Format (ODF), all of which can be added to and updated. <b>SiSU</b> is -also able to populate SQL type databases at an object level, which -means that searches can be made with that degree of granularity. -Results of objects (primarily paragraphs and headings) can be viewed -directly in the database, or just the object numbers shown - your -search criteria is met in these documents and at these locations within -each document. - </text> -</object> -<object id="9"> - <ocn>9</ocn> - <text class="norm"> - Source document preparation and output generation is a two step -process: (i) document source is prepared, that is, marked up in sisu -markup syntax and (ii) the desired output subsequently generated by -running the sisu engine against document source. Output representations -if updated (in the sisu engine) can be generated by re-running the -engine against the prepared source. Using <b>SiSU</b> markup applied to -a document, <b>SiSU</b> custom builds various standard open output -formats including plain text, HTML, XHTML, XML, OpenDocument, LaTeX or -PDF files, and populate an SQL database with objects<en>3</en> -(equating generally to paragraph-sized chunks) so searches may be -performed and matches returned with that degree of granularity ( e.g. -your search criteria is met by these documents and at these locations -within each document). Document output formats share a common object -numbering system for locating content. This is particularly suitable -for "published" works (finalized texts as opposed to works that are -frequently changed or updated) for which it provides a fixed means of -reference of content. - </text> - <endnote notenumber="3"> - <number>3</number> - <note> - objects include: headings, paragraphs, verse, tables, images, but not -footnotes/endnotes which are numbered separately and tied to the object -from which they are referenced. - </note> - </endnote> -</object> -<object id="10"> - <ocn>10</ocn> - <text class="norm"> - In preparing a <b>SiSU</b> document you optionally provide semantic -information related to the document in a document header, and in -marking up the substantive text provide information on the structure of -the document, primarily indicating heading levels and footnotes. You -also provide information on basic text attributes where used. The rest -is automatic, sisu from this information custom builds<en>4</en> the -different forms of output requested. - </text> - <endnote notenumber="4"> - <number>4</number> - <note> - i.e. the html, pdf, odf outputs are each built individually and -optimised for that form of presentation, rather than for example the -html being a saved version of the odf, or the pdf being a saved version -of the html. - </note> - </endnote> -</object> -<object id="11"> - <ocn>11</ocn> - <text class="norm"> - <b>SiSU</b> works with an abstraction of the document based on its -structure which is comprised of its frame<en>5</en> and the -objects<en>6</en> it contains, which enables <b>SiSU</b> to represent -the document in many different ways, and to take advantage of the -strengths of different ways of presenting documents. The objects are -numbered, and these numbers can be used to provide a common base for -citing material within a document across the different output format -types. This is significant as page numbers are not suited to the -digital age, in web publishing, changing a browser's default font or -using a different browser means that text appears on different pages; -and in publishing in different formats, html, landscape and portrait -pdf etc. again page numbers are of no use to cite text in a manner that -is relevant against the different output types. Dealing with documents -at an object level together with object numbering also has implications -for search. - </text> - <endnote notenumber="5"> - <number>5</number> - <note> - the different heading levels - </note> - </endnote> - <endnote notenumber="6"> - <number>6</number> - <note> - units of text, primarily paragraphs and headings, also any tables, -poems, code-blocks - </note> - </endnote> -</object> -<object id="12"> - <ocn>12</ocn> - <text class="norm"> - One of the challenges of maintaining documents is to keep them in a -format that would allow users to use them without depending on a -proprietary software popular at the time. Consider the ease of dealing -with legacy proprietary formats today and what guarantee you have that -old proprietary formats will remain (or can be read without proprietary -software/equipment) in 15 years time, or the way the way in which html -has evolved over its relatively short span of existence. <b>SiSU</b> -provides the flexibility of outputing documents in multiple -non-proprietary open formats including html, pdf<en>7</en> and the ISO -standard ODF.<en>8</en> Whilst <b>SiSU</b> relies on software, the -markup is uncomplicated and minimalistic which guarantees that future -engines can be written to run against it. It is also easily converted -to other formats, which means documents prepared in <b>SiSU</b> can be -migrated to other document formats. Further security is provided by the -fact that the software itself, <b>SiSU</b> is available under GPL3 a -licence that guarantees that the source code will always be open, and -free as in libre which means that that code base can be used updated -and further developed as required under the terms of its license. -Another challenge is to keep up with a moving target. <b>SiSU</b> -permits new forms of output to be added as they become important, (Open -Document Format text was added in 2006), and existing output to be -updated (html has evolved and the related module has been updated -repeatedly over the years, presumably when the World Wide Web -Consortium (w3c) finalises html 5 which is currently under development, -the html module will again be updated allowing all existing documents -to be regenerated as html 5). - </text> - <endnote notenumber="7"> - <number>7</number> - <note> - Specification submitted by Adobe to ISO to become a full open ISO -specification <br /> <<link -xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" -xlink:href="http://www.linux-watch.com/news/NS7542722606.html">http://www.linux-watch.com/news/NS7542722606.html</link>> - </note> - </endnote> - <endnote notenumber="8"> - <number>8</number> - <note> - ISO/IEC 26300:2006 - </note> - </endnote> -</object> -<object id="13"> - <ocn>13</ocn> - <text class="norm"> - The document formats are written to the file-system and available for -indexing by independent indexing tools, whether off the web like Google -and Yahoo or on the site like Lucene and Hyperestraier. - </text> -</object> -<object id="14"> - <ocn>14</ocn> - <text class="norm"> - <b>SiSU</b> also provides other features such as concordance files and -document content certificates, and the working against an abstraction -of document structure has further possibilities for the research and -development of other document representations, the availability of -objects is useful for example for topic maps and the commercial law -thesaurus by Vikki Rogers and Al Krtizer, together with the flexibility -of <b>SiSU</b> offers great possibilities. - </text> -</object> -<object id="15"> - <ocn>15</ocn> - <text class="norm"> - <b>SiSU</b> is primarily for published works, which can take advantage -of the citation system to reliably reference its documents. <b>SiSU</b> -works well in a complementary manner with such collaborative -technologies as Wikis, which can take advantage of and be used to -discuss the substance of content prepared in <b>SiSU</b>. - </text> -</object> -<object id="16"> - <ocn>16</ocn> - <text class="norm"> - <<link xmlns:xlink="http://www.w3.org/1999/xlink" -xlink:type="simple" -xlink:href="http://www.jus.uio.no/sisu">http://www.jus.uio.no/sisu</link>> - </text> -</object> -<object id="17"> - <ocn>17</ocn> - <text class="h4"> - 2. How does sisu work? - </text> -</object> -<object id="18"> - <ocn>18</ocn> - <text class="norm"> - <b>SiSU</b> markup is fairly minimalistic, it consists of: a (largely -optional) document header, made up of information about the document -(such as when it was published, who authored it, and granting what -rights) and any processing instructions; and markup within the -substantive text of the document, which is related to document -structure and typeface. <b>SiSU</b> must be able to discern the -structure of a document, (text headings and their levels in relation to -each other), either from information provided in the document header or -from markup within the text (or from a combination of both). Processing -is done against an abstraction of the document comprising of -information on the document's structure and its objects,[2] which the -program serializes (providing the object numbers) and which are -assigned hash sum values based on their content. This abstraction of -information about document structure, objects, (and hash sums), -provides considerable flexibility in representing documents different -ways and for different purposes (e.g. search, document layout, -publishing, content certification, concordance etc.), and makes it -possible to take advantage of some of the strengths of established ways -of representing documents, (or indeed to create new ones). - </text> -</object> -<object id="19"> - <ocn>19</ocn> - <text class="h4"> - 3. Summary of features - </text> -</object> -<object id="20"> - <ocn>20</ocn> - <text class="indent_bullet"> - sparse/minimal markup (clean utf-8 source texts). Documents are -prepared in a single UTF-8 file using a minimalistic mnemonic syntax. -Typical literature, documents like "War and Peace" require almost no -markup, and most of the headers are optional. - </text> -</object> -<object id="21"> - <ocn>21</ocn> - <text class="indent_bullet"> - markup is easily readable/parsable by the human eye, (basic markup is -simpler and more sparse than the most basic HTML), [this may also be -converted to XML representations of the same input/source document]. - </text> -</object> -<object id="22"> - <ocn>22</ocn> - <text class="indent_bullet"> - markup defines document structure (this may be done once in a header -pattern-match description, or for heading levels individually); basic -text attributes (bold, italics, underscore, strike-through etc.) as -required; and semantic information related to the document (header -information, extended beyond the Dublin core and easily further -extended as required); the headers may also contain processing -instructions. <b>SiSU</b> markup is primarily an abstraction of -document structure and document metadata to permit taking advantage of -the basic strengths of existing alternative practical standard ways of -representing documents [be that browser viewing, paper publication, sql -search etc.] (html, xml, odf, latex, pdf, sql) - </text> -</object> -<object id="23"> - <ocn>23</ocn> - <text class="indent_bullet"> - for output produces reasonably elegant output of established industry -and institutionally accepted open standard formats.[3] takes advantage -of the different strengths of various standard formats for representing -documents, amongst the output formats currently supported are: - </text> -</object> -<object id="24"> - <ocn>24</ocn> - <text class="indent_bullet1"> - html - both as a single scrollable text and a segmented document - </text> -</object> -<object id="25"> - <ocn>25</ocn> - <text class="indent_bullet1"> - xhtml - </text> -</object> -<object id="26"> - <ocn>26</ocn> - <text class="indent_bullet1"> - XML - both in sax and dom style xml structures for further -development as required - </text> -</object> -<object id="27"> - <ocn>27</ocn> - <text class="indent_bullet1"> - ODF - open document format, the iso standard for document storage - </text> -</object> -<object id="28"> - <ocn>28</ocn> - <text class="indent_bullet1"> - LaTeX - used to generate pdf - </text> -</object> -<object id="29"> - <ocn>29</ocn> - <text class="indent_bullet1"> - pdf (via LaTeX) - </text> -</object> -<object id="30"> - <ocn>30</ocn> - <text class="indent_bullet1"> - sql - population of an sql database, (at the same object level -that is used to cite text within a document) - </text> -</object> -<object id="31"> - <ocn>31</ocn> - <text class="norm"> - Also produces: concordance files; document content certificates (md5 or -sha256 digests of headings, paragraphs, images etc.) and html manifests -(and sitemaps of content). (b) takes advantage of the strengths -implicit in these very different output types, (e.g. PDFs produced -using typesetting of LaTeX, databases populated with documents at an -individual object/paragraph level, making possible granular search (and -related possibilities)) - </text> -</object> -<object id="32"> - <ocn>32</ocn> - <text class="indent_bullet"> - ensuring content can be cited in a meaningful way regardless of -selected output format. Online publishing (and publishing in multiple -document formats) lacks a useful way of citing text internally within -documents (important to academics generally and to lawyers) as page -numbers are meaningless across browsers and formats. sisu seeks to -provide a common way of pinpoint the text within a document, (which can -be utilized for citation and by search engines). The outputs share a -common numbering system that is meaningful (to man and machine) across -all digital outputs whether paper, screen, or database oriented, (pdf, -HTML, xml, sqlite, postgresql), this numbering system can be used to -reference content. - </text> -</object> -<object id="33"> - <ocn>33</ocn> - <text class="indent_bullet"> - Granular search within documents. SQL databases are populated at an -object level (roughly headings, paragraphs, verse, tables) and become -searchable with that degree of granularity, the output information -provides the object/paragraph numbers which are relevant across all -generated outputs; it is also possible to look at just the matching -paragraphs of the documents in the database; [output indexing also work -well with search indexing tools like hyperestraier]. - </text> -</object> -<object id="34"> - <ocn>34</ocn> - <text class="indent_bullet"> - long term maintainability of document collections in a world of -changing formats, having a very sparsely marked-up source document -base. there is a considerable degree of future-proofing, output -representations are "upgradeable", and new document formats may be -added. e.g. addition of odf (open document text) module in 2006 and in -future html5 output sometime in future, without modification of -existing prepared texts - </text> -</object> -<object id="35"> - <ocn>35</ocn> - <text class="indent_bullet"> - SQL search aside, documents are generated as required and static once -generated. - </text> -</object> -<object id="36"> - <ocn>36</ocn> - <text class="indent_bullet"> - documents produced are static files, and may be batch processed, this -needs to be done only once but may be repeated for various reasons as -desired (updated content, addition of new output formats, updated -technology document presentations/representations) - </text> -</object> -<object id="37"> - <ocn>37</ocn> - <text class="indent_bullet"> - document source (plaintext utf-8) if shared on the net may be used as -input and processed locally to produce the different document outputs - </text> -</object> -<object id="38"> - <ocn>38</ocn> - <text class="indent_bullet"> - document source may be bundled together (automatically) with associated -documents (multiple language versions or master document with -inclusions) and images and sent as a zip file called a sisupod, if -shared on the net these too may be processed locally to produce the -desired document outputs - </text> -</object> -<object id="39"> - <ocn>39</ocn> - <text class="indent_bullet"> - generated document outputs may automatically be posted to remote sites. - </text> -</object> -<object id="40"> - <ocn>40</ocn> - <text class="indent_bullet"> - for basic document generation, the only software dependency is -<b>Ruby</b>, and a few standard Unix tools (this covers plaintext, -HTML, XML, ODF, LaTeX). To use a database you of course need that, and -to convert the LaTeX generated to pdf, a latex processor like tetex or -texlive. - </text> -</object> -<object id="41"> - <ocn>41</ocn> - <text class="indent_bullet"> - as a developers tool it is flexible and extensible - </text> -</object> -<object id="42"> - <ocn>42</ocn> - <text class="norm"> - Syntax highlighting for <b>SiSU</b> markup is available for a number of -text editors. - </text> -</object> -<object id="43"> - <ocn>43</ocn> - <text class="norm"> - <b>SiSU</b> is less about document layout than about finding a way with -little markup to be able to construct an abstract representation of a -document that makes it possible to produce multiple representations of -it which may be rather different from each other and used for different -purposes, whether layout and publishing, or search of content - </text> -</object> -<object id="44"> - <ocn>44</ocn> - <text class="norm"> - i.e. to be able to take advantage from this minimal preparation -starting point of some of the strengths of rather different established -ways of representing documents for different purposes, whether for -search (relational database, or indexed flat files generated for that -purpose whether of complete documents, or say of files made up of -objects), online viewing (e.g. html, xml, pdf), or paper publication -(e.g. pdf)... - </text> -</object> -<object id="45"> - <ocn>45</ocn> - <text class="norm"> - the solution arrived at is by extracting structural information about -the document (about headings within the document) and by tracking -objects (which are serialized and also given hash values) in the manner -described. It makes possible representations that are quite different -from those offered at present. For example objects could be saved -individually and identified by their hashes, with an index of how the -objects relate to each other to form a document. - </text> -</object> -<object id="0"> - <ocn>0</ocn> - <text class="h4"> - Endnotes - </text> -</object> -</body> -</document> |