From a72e66db913de3a2e508080c8b1fc8d1342a899b Mon Sep 17 00:00:00 2001 From: Ralph Amissah Date: Tue, 25 Sep 2007 23:23:03 +0100 Subject: remove generated output from main package --- .../sisu_manual/sisu_introduction/sax.xml | 599 --------------------- 1 file changed, 599 deletions(-) delete mode 100644 data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml (limited to 'data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml') diff --git a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml b/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml deleted file mode 100644 index 4264c388..00000000 --- a/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml +++ /dev/null @@ -1,599 +0,0 @@ - - - - - - - Title: - - SiSU - Commands - -
- Creator: - - Ralph Amissah - -
- Rights: - - Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3 - -
- Type: - - information - -
- Subject: - - ebook, epublishing, electronic book, electronic publishing, electronic document, electronic citation, data structure, citation systems, search - -
- Date created: - - 2002-08-28 - -
- Date issued: - - 2002-08-28 - -
- Date available: - - 2002-08-28 - -
- Date modified: - - 2007-09-16 - -
- Date: - - 2007-09-16 - -
- - - - 1 - - SiSU - Commands,
Ralph Amissah -
-
- - 2 - - What is SiSU? - - - - 3 - - Description - - - - 4 - - 1. Introduction - What is SiSU? - - - - 5 - - SiSU is a system for document markup, publishing (in multiple -open standard formats) and search - - - - 6 - - SiSU1 is a2 framework for document -structuring, publishing and search, comprising of (a) a lightweight -document structure and presentation markup syntax and (b) an -accompanying engine for generating standard document format outputs -from documents prepared in sisu markup syntax, which is able to produce -multiple standard outputs that (can) share a common numbering system -for the citation of text within a document. - - - 1 - - "SiSU information Structuring Universe" or "Structured -information, Serialized Units".
also chosen for the meaning of -the Finnish term "sisu". -
-
- - 2 - - Unix command line oriented - - -
- - 7 - - SiSU is developed under an open source, software libre license -(GPL3). It has been developed in the context of coping with large -document sets with evolving markup related technologies, for which you -want multiple output formats, a common mechanism for -cross-output-format citation, and search. - - - - 8 - - SiSU both defines a markup syntax and provides an engine that -produces open standards format outputs from documents prepared with -SiSU markup. From a single lightly prepared document sisu custom -builds several standard output formats which share a common (text -object) numbering system for citation of content within a document -(that also has implications for search). The sisu engine works with an -abstraction of the document's structure and content from which it is -possible to generate different forms of representation of the document. -Significantly SiSU markup is more sparse than html and outputs -which include html, LaTeX, landscape and portrait pdfs, Open Document -Format (ODF), all of which can be added to and updated. SiSU is -also able to populate SQL type databases at an object level, which -means that searches can be made with that degree of granularity. -Results of objects (primarily paragraphs and headings) can be viewed -directly in the database, or just the object numbers shown - your -search criteria is met in these documents and at these locations within -each document. - - - - 9 - - Source document preparation and output generation is a two step -process: (i) document source is prepared, that is, marked up in sisu -markup syntax and (ii) the desired output subsequently generated by -running the sisu engine against document source. Output representations -if updated (in the sisu engine) can be generated by re-running the -engine against the prepared source. Using SiSU markup applied to -a document, SiSU custom builds various standard open output -formats including plain text, HTML, XHTML, XML, OpenDocument, LaTeX or -PDF files, and populate an SQL database with objects3 -(equating generally to paragraph-sized chunks) so searches may be -performed and matches returned with that degree of granularity ( e.g. -your search criteria is met by these documents and at these locations -within each document). Document output formats share a common object -numbering system for locating content. This is particularly suitable -for "published" works (finalized texts as opposed to works that are -frequently changed or updated) for which it provides a fixed means of -reference of content. - - - 3 - - objects include: headings, paragraphs, verse, tables, images, but not -footnotes/endnotes which are numbered separately and tied to the object -from which they are referenced. - - - - - 10 - - In preparing a SiSU document you optionally provide semantic -information related to the document in a document header, and in -marking up the substantive text provide information on the structure of -the document, primarily indicating heading levels and footnotes. You -also provide information on basic text attributes where used. The rest -is automatic, sisu from this information custom builds4 the -different forms of output requested. - - - 4 - - i.e. the html, pdf, odf outputs are each built individually and -optimised for that form of presentation, rather than for example the -html being a saved version of the odf, or the pdf being a saved version -of the html. - - - - - 11 - - SiSU works with an abstraction of the document based on its -structure which is comprised of its frame5 and the -objects6 it contains, which enables SiSU to represent -the document in many different ways, and to take advantage of the -strengths of different ways of presenting documents. The objects are -numbered, and these numbers can be used to provide a common base for -citing material within a document across the different output format -types. This is significant as page numbers are not suited to the -digital age, in web publishing, changing a browser's default font or -using a different browser means that text appears on different pages; -and in publishing in different formats, html, landscape and portrait -pdf etc. again page numbers are of no use to cite text in a manner that -is relevant against the different output types. Dealing with documents -at an object level together with object numbering also has implications -for search. - - - 5 - - the different heading levels - - - - 6 - - units of text, primarily paragraphs and headings, also any tables, -poems, code-blocks - - - - - 12 - - One of the challenges of maintaining documents is to keep them in a -format that would allow users to use them without depending on a -proprietary software popular at the time. Consider the ease of dealing -with legacy proprietary formats today and what guarantee you have that -old proprietary formats will remain (or can be read without proprietary -software/equipment) in 15 years time, or the way the way in which html -has evolved over its relatively short span of existence. SiSU -provides the flexibility of outputing documents in multiple -non-proprietary open formats including html, pdf7 and the ISO -standard ODF.8 Whilst SiSU relies on software, the -markup is uncomplicated and minimalistic which guarantees that future -engines can be written to run against it. It is also easily converted -to other formats, which means documents prepared in SiSU can be -migrated to other document formats. Further security is provided by the -fact that the software itself, SiSU is available under GPL3 a -licence that guarantees that the source code will always be open, and -free as in libre which means that that code base can be used updated -and further developed as required under the terms of its license. -Another challenge is to keep up with a moving target. SiSU -permits new forms of output to be added as they become important, (Open -Document Format text was added in 2006), and existing output to be -updated (html has evolved and the related module has been updated -repeatedly over the years, presumably when the World Wide Web -Consortium (w3c) finalises html 5 which is currently under development, -the html module will again be updated allowing all existing documents -to be regenerated as html 5). - - - 7 - - Specification submitted by Adobe to ISO to become a full open ISO -specification
<http://www.linux-watch.com/news/NS7542722606.html> -
-
- - 8 - - ISO/IEC 26300:2006 - - -
- - 13 - - The document formats are written to the file-system and available for -indexing by independent indexing tools, whether off the web like Google -and Yahoo or on the site like Lucene and Hyperestraier. - - - - 14 - - SiSU also provides other features such as concordance files and -document content certificates, and the working against an abstraction -of document structure has further possibilities for the research and -development of other document representations, the availability of -objects is useful for example for topic maps and the commercial law -thesaurus by Vikki Rogers and Al Krtizer, together with the flexibility -of SiSU offers great possibilities. - - - - 15 - - SiSU is primarily for published works, which can take advantage -of the citation system to reliably reference its documents. SiSU -works well in a complementary manner with such collaborative -technologies as Wikis, which can take advantage of and be used to -discuss the substance of content prepared in SiSU. - - - - 16 - - <http://www.jus.uio.no/sisu> - - - - 17 - - 2. How does sisu work? - - - - 18 - - SiSU markup is fairly minimalistic, it consists of: a (largely -optional) document header, made up of information about the document -(such as when it was published, who authored it, and granting what -rights) and any processing instructions; and markup within the -substantive text of the document, which is related to document -structure and typeface. SiSU must be able to discern the -structure of a document, (text headings and their levels in relation to -each other), either from information provided in the document header or -from markup within the text (or from a combination of both). Processing -is done against an abstraction of the document comprising of -information on the document's structure and its objects,[2] which the -program serializes (providing the object numbers) and which are -assigned hash sum values based on their content. This abstraction of -information about document structure, objects, (and hash sums), -provides considerable flexibility in representing documents different -ways and for different purposes (e.g. search, document layout, -publishing, content certification, concordance etc.), and makes it -possible to take advantage of some of the strengths of established ways -of representing documents, (or indeed to create new ones). - - - - 19 - - 3. Summary of features - - - - 20 - - sparse/minimal markup (clean utf-8 source texts). Documents are -prepared in a single UTF-8 file using a minimalistic mnemonic syntax. -Typical literature, documents like "War and Peace" require almost no -markup, and most of the headers are optional. - - - - 21 - - markup is easily readable/parsable by the human eye, (basic markup is -simpler and more sparse than the most basic HTML), [this may also be -converted to XML representations of the same input/source document]. - - - - 22 - - markup defines document structure (this may be done once in a header -pattern-match description, or for heading levels individually); basic -text attributes (bold, italics, underscore, strike-through etc.) as -required; and semantic information related to the document (header -information, extended beyond the Dublin core and easily further -extended as required); the headers may also contain processing -instructions. SiSU markup is primarily an abstraction of -document structure and document metadata to permit taking advantage of -the basic strengths of existing alternative practical standard ways of -representing documents [be that browser viewing, paper publication, sql -search etc.] (html, xml, odf, latex, pdf, sql) - - - - 23 - - for output produces reasonably elegant output of established industry -and institutionally accepted open standard formats.[3] takes advantage -of the different strengths of various standard formats for representing -documents, amongst the output formats currently supported are: - - - - 24 - - html - both as a single scrollable text and a segmented document - - - - 25 - - xhtml - - - - 26 - - XML - both in sax and dom style xml structures for further -development as required - - - - 27 - - ODF - open document format, the iso standard for document storage - - - - 28 - - LaTeX - used to generate pdf - - - - 29 - - pdf (via LaTeX) - - - - 30 - - sql - population of an sql database, (at the same object level -that is used to cite text within a document) - - - - 31 - - Also produces: concordance files; document content certificates (md5 or -sha256 digests of headings, paragraphs, images etc.) and html manifests -(and sitemaps of content). (b) takes advantage of the strengths -implicit in these very different output types, (e.g. PDFs produced -using typesetting of LaTeX, databases populated with documents at an -individual object/paragraph level, making possible granular search (and -related possibilities)) - - - - 32 - - ensuring content can be cited in a meaningful way regardless of -selected output format. Online publishing (and publishing in multiple -document formats) lacks a useful way of citing text internally within -documents (important to academics generally and to lawyers) as page -numbers are meaningless across browsers and formats. sisu seeks to -provide a common way of pinpoint the text within a document, (which can -be utilized for citation and by search engines). The outputs share a -common numbering system that is meaningful (to man and machine) across -all digital outputs whether paper, screen, or database oriented, (pdf, -HTML, xml, sqlite, postgresql), this numbering system can be used to -reference content. - - - - 33 - - Granular search within documents. SQL databases are populated at an -object level (roughly headings, paragraphs, verse, tables) and become -searchable with that degree of granularity, the output information -provides the object/paragraph numbers which are relevant across all -generated outputs; it is also possible to look at just the matching -paragraphs of the documents in the database; [output indexing also work -well with search indexing tools like hyperestraier]. - - - - 34 - - long term maintainability of document collections in a world of -changing formats, having a very sparsely marked-up source document -base. there is a considerable degree of future-proofing, output -representations are "upgradeable", and new document formats may be -added. e.g. addition of odf (open document text) module in 2006 and in -future html5 output sometime in future, without modification of -existing prepared texts - - - - 35 - - SQL search aside, documents are generated as required and static once -generated. - - - - 36 - - documents produced are static files, and may be batch processed, this -needs to be done only once but may be repeated for various reasons as -desired (updated content, addition of new output formats, updated -technology document presentations/representations) - - - - 37 - - document source (plaintext utf-8) if shared on the net may be used as -input and processed locally to produce the different document outputs - - - - 38 - - document source may be bundled together (automatically) with associated -documents (multiple language versions or master document with -inclusions) and images and sent as a zip file called a sisupod, if -shared on the net these too may be processed locally to produce the -desired document outputs - - - - 39 - - generated document outputs may automatically be posted to remote sites. - - - - 40 - - for basic document generation, the only software dependency is -Ruby, and a few standard Unix tools (this covers plaintext, -HTML, XML, ODF, LaTeX). To use a database you of course need that, and -to convert the LaTeX generated to pdf, a latex processor like tetex or -texlive. - - - - 41 - - as a developers tool it is flexible and extensible - - - - 42 - - Syntax highlighting for SiSU markup is available for a number of -text editors. - - - - 43 - - SiSU is less about document layout than about finding a way with -little markup to be able to construct an abstract representation of a -document that makes it possible to produce multiple representations of -it which may be rather different from each other and used for different -purposes, whether layout and publishing, or search of content - - - - 44 - - i.e. to be able to take advantage from this minimal preparation -starting point of some of the strengths of rather different established -ways of representing documents for different purposes, whether for -search (relational database, or indexed flat files generated for that -purpose whether of complete documents, or say of files made up of -objects), online viewing (e.g. html, xml, pdf), or paper publication -(e.g. pdf)... - - - - 45 - - the solution arrived at is by extracting structural information about -the document (about headings within the document) and by tracking -objects (which are serialized and also given hash values) in the manner -described. It makes possible representations that are quite different -from those offered at present. For example objects could be saved -individually and identified by their hashes, with an index of how the -objects relate to each other to form a document. - - - - 0 - - Endnotes - - - -
-- cgit v1.2.3