aboutsummaryrefslogtreecommitdiffhomepage
path: root/data/doc/manuals_generated/sisu_manual/sisu_introduction/sax.xml
blob: 4264c388a382a2f518101d25745890da1d026565 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="../_sisu/css/sax.css"?>
<!-- Document processing information:
     * Generated by: SiSU 0.59.1 of 2007w39/2 (2007-09-25)
     * Ruby version: ruby 1.8.6 (2007-06-07 patchlevel 36) [i486-linux]
     * 
     * Last Generated on: Tue Sep 25 02:52:54 +0100 2007
     * SiSU http://www.jus.uio.no/sisu
-->

<document>
<head>
	<meta>Title:</meta>
	<title class="dc">
		SiSU - Commands
	</title>
	<br />
	<meta>Creator:</meta>
	<creator class="dc">
		Ralph Amissah
	</creator>
	<br />
	<meta>Rights:</meta>
	<rights class="dc">
		Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3
	</rights>
	<br />
	<meta>Type:</meta>
	<type class="dc">
		information
	</type>
	<br />
	<meta>Subject:</meta>
	<subject class="dc">
		ebook, epublishing, electronic book, electronic publishing, electronic document, electronic citation, data structure, citation systems, search
	</subject>
	<br />
	<meta>Date created:</meta>
	<date_created class="extra">
		2002-08-28
	</date_created>
	<br />
	<meta>Date issued:</meta>
	<date_issued class="extra">
		2002-08-28
	</date_issued>
	<br />
	<meta>Date available:</meta>
	<date_available class="extra">
		2002-08-28
	</date_available>
	<br />
	<meta>Date modified:</meta>
	<date_modified class="extra">
		2007-09-16
	</date_modified>
	<br />
	<meta>Date:</meta>
	<date class="dc">
		2007-09-16
	</date>
	<br />
</head>
<body>
<object id="1">
	<ocn>1</ocn>
	<text class="h1">
		SiSU - Commands,<br /> Ralph Amissah
	</text>
</object>
<object id="2">
	<ocn>2</ocn>
	<text class="h2">
		What is SiSU?
	</text>
</object>
<object id="3">
	<ocn>3</ocn>
	<text class="h3">
		Description
	</text>
</object>
<object id="4">
	<ocn>4</ocn>
	<text class="h4">
		1. Introduction - What is SiSU? 
	</text>
</object>
<object id="5">
	<ocn>5</ocn>
	<text class="norm">
		<b>SiSU</b> is a system for document markup, publishing (in multiple
open standard formats) and search
	</text>
</object>
<object id="6">
	<ocn>6</ocn>
	<text class="norm">
		<b>SiSU</b><en>1</en> is a<en>2</en> framework for document
structuring, publishing and search, comprising of (a) a lightweight
document structure and presentation markup syntax and (b) an
accompanying engine for generating standard document format outputs
from documents prepared in sisu markup syntax, which is able to produce
multiple standard outputs that (can) share a common numbering system
for the citation of text within a document.
	</text>
	<endnote notenumber="1">
		<number>1</number>
		<note>
			"<b>SiSU</b> information Structuring Universe" or "Structured
information, Serialized Units".<br /> also chosen for the meaning of
the Finnish term "sisu".
		</note>
	</endnote>
	<endnote notenumber="2">
		<number>2</number>
		<note>
			Unix command line oriented
		</note>
	</endnote>
</object>
<object id="7">
	<ocn>7</ocn>
	<text class="norm">
		<b>SiSU</b> is developed under an open source, software libre license
(GPL3). It has been developed in the context of coping with large
document sets with evolving markup related technologies, for which you
want multiple output formats, a common mechanism for
cross-output-format citation, and search.
	</text>
</object>
<object id="8">
	<ocn>8</ocn>
	<text class="norm">
		<b>SiSU</b> both defines a markup syntax and provides an engine that
produces open standards format outputs from documents prepared with
<b>SiSU</b> markup. From a single lightly prepared document sisu custom
builds several standard output formats which share a common (text
object) numbering system for citation of content within a document
(that also has implications for search). The sisu engine works with an
abstraction of the document's structure and content from which it is
possible to generate different forms of representation of the document.
Significantly <b>SiSU</b> markup is more sparse than html and outputs
which include html, LaTeX, landscape and portrait pdfs, Open Document
Format (ODF), all of which can be added to and updated. <b>SiSU</b> is
also able to populate SQL type databases at an object level, which
means that searches can be made with that degree of granularity.
Results of objects (primarily paragraphs and headings) can be viewed
directly in the database, or just the object numbers shown - your
search criteria is met in these documents and at these locations within
each document.
	</text>
</object>
<object id="9">
	<ocn>9</ocn>
	<text class="norm">
		Source document preparation and output generation is a two step
process: (i) document source is prepared, that is, marked up in sisu
markup syntax and (ii) the desired output subsequently generated by
running the sisu engine against document source. Output representations
if updated (in the sisu engine) can be generated by re-running the
engine against the prepared source. Using <b>SiSU</b> markup applied to
a document, <b>SiSU</b> custom builds various standard open output
formats including plain text, HTML, XHTML, XML, OpenDocument, LaTeX or
PDF files, and populate an SQL database with objects<en>3</en>
(equating generally to paragraph-sized chunks) so searches may be
performed and matches returned with that degree of granularity ( e.g.
your search criteria is met by these documents and at these locations
within each document). Document output formats share a common object
numbering system for locating content. This is particularly suitable
for "published" works (finalized texts as opposed to works that are
frequently changed or updated) for which it provides a fixed means of
reference of content.
	</text>
	<endnote notenumber="3">
		<number>3</number>
		<note>
			objects include: headings, paragraphs, verse, tables, images, but not
footnotes/endnotes which are numbered separately and tied to the object
from which they are referenced.
		</note>
	</endnote>
</object>
<object id="10">
	<ocn>10</ocn>
	<text class="norm">
		In preparing a <b>SiSU</b> document you optionally provide semantic
information related to the document in a document header, and in
marking up the substantive text provide information on the structure of
the document, primarily indicating heading levels and footnotes. You
also provide information on basic text attributes where used. The rest
is automatic, sisu from this information custom builds<en>4</en> the
different forms of output requested.
	</text>
	<endnote notenumber="4">
		<number>4</number>
		<note>
			i.e. the html, pdf, odf outputs are each built individually and
optimised for that form of presentation, rather than for example the
html being a saved version of the odf, or the pdf being a saved version
of the html.
		</note>
	</endnote>
</object>
<object id="11">
	<ocn>11</ocn>
	<text class="norm">
		<b>SiSU</b> works with an abstraction of the document based on its
structure which is comprised of its frame<en>5</en> and the
objects<en>6</en> it contains, which enables <b>SiSU</b> to represent
the document in many different ways, and to take advantage of the
strengths of different ways of presenting documents. The objects are
numbered, and these numbers can be used to provide a common base for
citing material within a document across the different output format
types. This is significant as page numbers are not suited to the
digital age, in web publishing, changing a browser's default font or
using a different browser means that text appears on different pages;
and in publishing in different formats, html, landscape and portrait
pdf etc. again page numbers are of no use to cite text in a manner that
is relevant against the different output types. Dealing with documents
at an object level together with object numbering also has implications
for search.
	</text>
	<endnote notenumber="5">
		<number>5</number>
		<note>
			the different heading levels
		</note>
	</endnote>
	<endnote notenumber="6">
		<number>6</number>
		<note>
			units of text, primarily paragraphs and headings, also any tables,
poems, code-blocks
		</note>
	</endnote>
</object>
<object id="12">
	<ocn>12</ocn>
	<text class="norm">
		One of the challenges of maintaining documents is to keep them in a
format that would allow users to use them without depending on a
proprietary software popular at the time. Consider the ease of dealing
with legacy proprietary formats today and what guarantee you have that
old proprietary formats will remain (or can be read without proprietary
software/equipment) in 15 years time, or the way the way in which html
has evolved over its relatively short span of existence. <b>SiSU</b>
provides the flexibility of outputing documents in multiple
non-proprietary open formats including html, pdf<en>7</en> and the ISO
standard ODF.<en>8</en> Whilst <b>SiSU</b> relies on software, the
markup is uncomplicated and minimalistic which guarantees that future
engines can be written to run against it. It is also easily converted
to other formats, which means documents prepared in <b>SiSU</b> can be
migrated to other document formats. Further security is provided by the
fact that the software itself, <b>SiSU</b> is available under GPL3 a
licence that guarantees that the source code will always be open, and
free as in libre which means that that code base can be used updated
and further developed as required under the terms of its license.
Another challenge is to keep up with a moving target. <b>SiSU</b>
permits new forms of output to be added as they become important, (Open
Document Format text was added in 2006), and existing output to be
updated (html has evolved and the related module has been updated
repeatedly over the years, presumably when the World Wide Web
Consortium (w3c) finalises html 5 which is currently under development,
the html module will again be updated allowing all existing documents
to be regenerated as html 5).
	</text>
	<endnote notenumber="7">
		<number>7</number>
		<note>
			Specification submitted by Adobe to ISO to become a full open ISO
specification <br /> &lt;<link
xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
xlink:href="http://www.linux-watch.com/news/NS7542722606.html">http://www.linux-watch.com/news/NS7542722606.html</link>&gt;
		</note>
	</endnote>
	<endnote notenumber="8">
		<number>8</number>
		<note>
			ISO/IEC 26300:2006
		</note>
	</endnote>
</object>
<object id="13">
	<ocn>13</ocn>
	<text class="norm">
		The document formats are written to the file-system and available for
indexing by independent indexing tools, whether off the web like Google
and Yahoo or on the site like Lucene and Hyperestraier.
	</text>
</object>
<object id="14">
	<ocn>14</ocn>
	<text class="norm">
		<b>SiSU</b> also provides other features such as concordance files and
document content certificates, and the working against an abstraction
of document structure has further possibilities for the research and
development of other document representations, the availability of
objects is useful for example for topic maps and the commercial law
thesaurus by Vikki Rogers and Al Krtizer, together with the flexibility
of <b>SiSU</b> offers great possibilities.
	</text>
</object>
<object id="15">
	<ocn>15</ocn>
	<text class="norm">
		<b>SiSU</b> is primarily for published works, which can take advantage
of the citation system to reliably reference its documents. <b>SiSU</b>
works well in a complementary manner with such collaborative
technologies as Wikis, which can take advantage of and be used to
discuss the substance of content prepared in <b>SiSU</b>.
	</text>
</object>
<object id="16">
	<ocn>16</ocn>
	<text class="norm">
		&lt;<link xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:type="simple"
xlink:href="http://www.jus.uio.no/sisu">http://www.jus.uio.no/sisu</link>&gt;
	</text>
</object>
<object id="17">
	<ocn>17</ocn>
	<text class="h4">
		2. How does sisu work? 
	</text>
</object>
<object id="18">
	<ocn>18</ocn>
	<text class="norm">
		<b>SiSU</b> markup is fairly minimalistic, it consists of: a (largely
optional) document header, made up of information about the document
(such as when it was published, who authored it, and granting what
rights) and any processing instructions; and markup within the
substantive text of the document, which is related to document
structure and typeface. <b>SiSU</b> must be able to discern the
structure of a document, (text headings and their levels in relation to
each other), either from information provided in the document header or
from markup within the text (or from a combination of both). Processing
is done against an abstraction of the document comprising of
information on the document's structure and its objects,[2] which the
program serializes (providing the object numbers) and which are
assigned hash sum values based on their content. This abstraction of
information about document structure, objects, (and hash sums),
provides considerable flexibility in representing documents different
ways and for different purposes (e.g. search, document layout,
publishing, content certification, concordance etc.), and makes it
possible to take advantage of some of the strengths of established ways
of representing documents, (or indeed to create new ones).
	</text>
</object>
<object id="19">
	<ocn>19</ocn>
	<text class="h4">
		3. Summary of features 
	</text>
</object>
<object id="20">
	<ocn>20</ocn>
	<text class="indent_bullet">
		sparse/minimal markup (clean utf-8 source texts). Documents are
prepared in a single UTF-8 file using a minimalistic mnemonic syntax.
Typical literature, documents like "War and Peace" require almost no
markup, and most of the headers are optional.
	</text>
</object>
<object id="21">
	<ocn>21</ocn>
	<text class="indent_bullet">
		markup is easily readable/parsable by the human eye, (basic markup is
simpler and more sparse than the most basic HTML), [this may also be
converted to XML representations of the same input/source document].
	</text>
</object>
<object id="22">
	<ocn>22</ocn>
	<text class="indent_bullet">
		markup defines document structure (this may be done once in a header
pattern-match description, or for heading levels individually); basic
text attributes (bold, italics, underscore, strike-through etc.) as
required; and semantic information related to the document (header
information, extended beyond the Dublin core and easily further
extended as required); the headers may also contain processing
instructions. <b>SiSU</b> markup is primarily an abstraction of
document structure and document metadata to permit taking advantage of
the basic strengths of existing alternative practical standard ways of
representing documents [be that browser viewing, paper publication, sql
search etc.] (html, xml, odf, latex, pdf, sql)
	</text>
</object>
<object id="23">
	<ocn>23</ocn>
	<text class="indent_bullet">
		for output produces reasonably elegant output of established industry
and institutionally accepted open standard formats.[3] takes advantage
of the different strengths of various standard formats for representing
documents, amongst the output formats currently supported are:
	</text>
</object>
<object id="24">
	<ocn>24</ocn>
	<text class="indent_bullet1">
		 html - both as a single scrollable text and a segmented document
	</text>
</object>
<object id="25">
	<ocn>25</ocn>
	<text class="indent_bullet1">
		 xhtml
	</text>
</object>
<object id="26">
	<ocn>26</ocn>
	<text class="indent_bullet1">
		 XML - both in sax and dom style xml structures for further
development as required
	</text>
</object>
<object id="27">
	<ocn>27</ocn>
	<text class="indent_bullet1">
		 ODF - open document format, the iso standard for document storage
	</text>
</object>
<object id="28">
	<ocn>28</ocn>
	<text class="indent_bullet1">
		 LaTeX - used to generate pdf
	</text>
</object>
<object id="29">
	<ocn>29</ocn>
	<text class="indent_bullet1">
		 pdf (via LaTeX)
	</text>
</object>
<object id="30">
	<ocn>30</ocn>
	<text class="indent_bullet1">
		 sql - population of an sql database, (at the same object level
that is used to cite text within a document)
	</text>
</object>
<object id="31">
	<ocn>31</ocn>
	<text class="norm">
		Also produces: concordance files; document content certificates (md5 or
sha256 digests of headings, paragraphs, images etc.) and html manifests
(and sitemaps of content). (b) takes advantage of the strengths
implicit in these very different output types, (e.g. PDFs produced
using typesetting of LaTeX, databases populated with documents at an
individual object/paragraph level, making possible granular search (and
related possibilities))
	</text>
</object>
<object id="32">
	<ocn>32</ocn>
	<text class="indent_bullet">
		ensuring content can be cited in a meaningful way regardless of
selected output format. Online publishing (and publishing in multiple
document formats) lacks a useful way of citing text internally within
documents (important to academics generally and to lawyers) as page
numbers are meaningless across browsers and formats. sisu seeks to
provide a common way of pinpoint the text within a document, (which can
be utilized for citation and by search engines). The outputs share a
common numbering system that is meaningful (to man and machine) across
all digital outputs whether paper, screen, or database oriented, (pdf,
HTML, xml, sqlite, postgresql), this numbering system can be used to
reference content.
	</text>
</object>
<object id="33">
	<ocn>33</ocn>
	<text class="indent_bullet">
		Granular search within documents. SQL databases are populated at an
object level (roughly headings, paragraphs, verse, tables) and become
searchable with that degree of granularity, the output information
provides the object/paragraph numbers which are relevant across all
generated outputs; it is also possible to look at just the matching
paragraphs of the documents in the database; [output indexing also work
well with search indexing tools like hyperestraier].
	</text>
</object>
<object id="34">
	<ocn>34</ocn>
	<text class="indent_bullet">
		long term maintainability of document collections in a world of
changing formats, having a very sparsely marked-up source document
base. there is a considerable degree of future-proofing, output
representations are "upgradeable", and new document formats may be
added. e.g. addition of odf (open document text) module in 2006 and in
future html5 output sometime in future, without modification of
existing prepared texts
	</text>
</object>
<object id="35">
	<ocn>35</ocn>
	<text class="indent_bullet">
		SQL search aside, documents are generated as required and static once
generated.
	</text>
</object>
<object id="36">
	<ocn>36</ocn>
	<text class="indent_bullet">
		documents produced are static files, and may be batch processed, this
needs to be done only once but may be repeated for various reasons as
desired (updated content, addition of new output formats, updated
technology document presentations/representations)
	</text>
</object>
<object id="37">
	<ocn>37</ocn>
	<text class="indent_bullet">
		document source (plaintext utf-8) if shared on the net may be used as
input and processed locally to produce the different document outputs
	</text>
</object>
<object id="38">
	<ocn>38</ocn>
	<text class="indent_bullet">
		document source may be bundled together (automatically) with associated
documents (multiple language versions or master document with
inclusions) and images and sent as a zip file called a sisupod, if
shared on the net these too may be processed locally to produce the
desired document outputs
	</text>
</object>
<object id="39">
	<ocn>39</ocn>
	<text class="indent_bullet">
		generated document outputs may automatically be posted to remote sites.
	</text>
</object>
<object id="40">
	<ocn>40</ocn>
	<text class="indent_bullet">
		for basic document generation, the only software dependency is
<b>Ruby</b>, and a few standard Unix tools (this covers plaintext,
HTML, XML, ODF, LaTeX). To use a database you of course need that, and
to convert the LaTeX generated to pdf, a latex processor like tetex or
texlive.
	</text>
</object>
<object id="41">
	<ocn>41</ocn>
	<text class="indent_bullet">
		as a developers tool it is flexible and extensible
	</text>
</object>
<object id="42">
	<ocn>42</ocn>
	<text class="norm">
		Syntax highlighting for <b>SiSU</b> markup is available for a number of
text editors.
	</text>
</object>
<object id="43">
	<ocn>43</ocn>
	<text class="norm">
		<b>SiSU</b> is less about document layout than about finding a way with
little markup to be able to construct an abstract representation of a
document that makes it possible to produce multiple representations of
it which may be rather different from each other and used for different
purposes, whether layout and publishing, or search of content
	</text>
</object>
<object id="44">
	<ocn>44</ocn>
	<text class="norm">
		i.e. to be able to take advantage from this minimal preparation
starting point of some of the strengths of rather different established
ways of representing documents for different purposes, whether for
search (relational database, or indexed flat files generated for that
purpose whether of complete documents, or say of files made up of
objects), online viewing (e.g. html, xml, pdf), or paper publication
(e.g. pdf)...
	</text>
</object>
<object id="45">
	<ocn>45</ocn>
	<text class="norm">
		the solution arrived at is by extracting structural information about
the document (about headings within the document) and by tracking
objects (which are serialized and also given hash values) in the manner
described. It makes possible representations that are quite different
from those offered at present. For example objects could be saved
individually and identified by their hashes, with an index of how the
objects relate to each other to form a document.
	</text>
</object>
<object id="0">
	<ocn>0</ocn>
	<text class="h4">
		Endnotes
	</text>
</object>
</body>
</document>