aboutsummaryrefslogtreecommitdiffhomepage
path: root/data/doc/manuals_generated/sisu_manual/man/sisu_introduction.1
blob: 22e04ea02fec433b0798857fcfb5872b85e93bcf (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
.TH "sisu_introduction" "1" "2007-09-16" "0.59.0" "SiSU"
.SH
SISU \- COMMANDS \ [0.58],
RALPH AMISSAH
.BR

.SH
WHAT IS SISU?
.BR

.SH
DESCRIPTION
.BR

.SH
1. INTRODUCTION \- WHAT IS SISU?
.BR

.BR
.B SiSU
is a system for document markup, publishing (in multiple open standard
formats) and search

.BR
.B SiSU
[^1] is a[^2] framework for document structuring, publishing and search,
comprising of (a) a lightweight document structure and presentation markup
syntax and (b) an accompanying engine for generating standard document format
outputs from documents prepared in sisu markup syntax, which is able to produce
multiple standard outputs that (can) share a common numbering system for the
citation of text within a document.

.BR
.B SiSU
is developed under an open source, software libre license (GPL3). It has been
developed in the context of coping with large document sets with evolving
markup related technologies, for which you want multiple output formats, a
common mechanism for cross\-output\-format citation, and search.

.BR
.B SiSU
both defines a markup syntax and provides an engine that produces open
standards format outputs from documents prepared with
.B SiSU
markup. From a single lightly prepared document sisu custom builds several
standard output formats which share a common (text object) numbering system for
citation of content within a document (that also has implications for search).
The sisu engine works with an abstraction of the document\'s structure and
content from which it is possible to generate different forms of representation
of the document. Significantly
.B SiSU
markup is more sparse than html and outputs which include html, LaTeX,
landscape and portrait pdfs, Open Document Format (ODF), all of which can be
added to and updated.
.B SiSU
is also able to populate SQL type databases at an object level, which means
that searches can be made with that degree of granularity. Results of objects
(primarily paragraphs and headings) can be viewed directly in the database, or
just the object numbers shown \- your search criteria is met in these documents
and at these locations within each document.

.BR
Source document preparation and output generation is a two step process: (i)
document source is prepared, that is, marked up in sisu markup syntax and (ii)
the desired output subsequently generated by running the sisu engine against
document source. Output representations if updated (in the sisu engine) can be
generated by re\-running the engine against the prepared source. Using
.B SiSU
markup applied to a document,
.B SiSU
custom builds various standard open output formats including plain text,
HTML, XHTML, XML, OpenDocument, LaTeX or PDF files, and populate an SQL
database with objects[^3] (equating generally to paragraph\-sized chunks) so
searches may be performed and matches returned with that degree of granularity
( e.g. your search criteria is met by these documents and at these locations
within each document). Document output formats share a common object numbering
system for locating content. This is particularly suitable for \"published\"
works (finalized texts as opposed to works that are frequently changed or
updated) for which it provides a fixed means of reference of content.

.BR
In preparing a
.B SiSU
document you optionally provide semantic information related to the document
in a document header, and in marking up the substantive text provide
information on the structure of the document, primarily indicating heading
levels and footnotes. You also provide information on basic text attributes
where used. The rest is automatic, sisu from this information custom builds[^4]
the different forms of output requested.

.BR
.B SiSU
works with an abstraction of the document based on its structure which is
comprised of its frame[^5] and the objects[^6] it contains, which enables
.B SiSU
to represent the document in many different ways, and to take advantage of
the strengths of different ways of presenting documents. The objects are
numbered, and these numbers can be used to provide a common base for citing
material within a document across the different output format types. This is
significant as page numbers are not suited to the digital age, in web
publishing, changing a browser\'s default font or using a different browser
means that text appears on different pages; and in publishing in different
formats, html, landscape and portrait pdf etc. again page numbers are of no use
to cite text in a manner that is relevant against the different output types.
Dealing with documents at an object level together with object numbering also
has implications for search.

.BR
One of the challenges of maintaining documents is to keep them in a format that
would allow users to use them without depending on a proprietary software
popular at the time. Consider the ease of dealing with legacy proprietary
formats today and what guarantee you have that old proprietary formats will
remain (or can be read without proprietary software/equipment) in 15 years
time, or the way the way in which html has evolved over its relatively short
span of existence.
.B SiSU
provides the flexibility of outputing documents in multiple non\-proprietary
open formats including html, pdf[^7] and the ISO standard ODF.[^8] Whilst
.B SiSU
relies on software, the markup is uncomplicated and minimalistic which
guarantees that future engines can be written to run against it. It is also
easily converted to other formats, which means documents prepared in
.B SiSU
can be migrated to other document formats. Further security is provided by
the fact that the software itself,
.B SiSU
is available under GPL3 a licence that guarantees that the source code will
always be open, and free as in libre which means that that code base can be
used updated and further developed as required under the terms of its license.
Another challenge is to keep up with a moving target.
.B SiSU
permits new forms of output to be added as they become important, (Open
Document Format text was added in 2006), and existing output to be updated
(html has evolved and the related module has been updated repeatedly over the
years, presumably when the World Wide Web Consortium (w3c) finalises html 5
which is currently under development, the html module will again be updated
allowing all existing documents to be regenerated as html 5).

.BR
The document formats are written to the file\-system and available for indexing
by independent indexing tools, whether off the web like Google and Yahoo or on
the site like Lucene and Hyperestraier.

.BR
.B SiSU
also provides other features such as concordance files and document content
certificates, and the working against an abstraction of document structure has
further possibilities for the research and development of other document
representations, the availability of objects is useful for example for topic
maps and the commercial law thesaurus by Vikki Rogers and Al Krtizer, together
with the flexibility of
.B SiSU
offers great possibilities.

.BR
.B SiSU
is primarily for published works, which can take advantage of the citation
system to reliably reference its documents.
.B SiSU
works well in a complementary manner with such collaborative technologies as
Wikis, which can take advantage of and be used to discuss the substance of
content prepared in
.B SiSU
.

.BR
<http://www.jus.uio.no/sisu>

.SH
2. HOW DOES SISU WORK?
.BR

.BR
.B SiSU
markup is fairly minimalistic, it consists of: a (largely optional) document
header, made up of information about the document (such as when it was
published, who authored it, and granting what rights) and any processing
instructions; and markup within the substantive text of the document, which is
related to document structure and typeface.
.B SiSU
must be able to discern the structure of a document, (text headings and their
levels in relation to each other), either from information provided in the
document header or from markup within the text (or from a combination of both).
Processing is done against an abstraction of the document comprising of
information on the document\'s structure and its objects,[2] which the program
serializes (providing the object numbers) and which are assigned hash sum
values based on their content. This abstraction of information about document
structure, objects, (and hash sums), provides considerable flexibility in
representing documents different ways and for different purposes (e.g. search,
document layout, publishing, content certification, concordance etc.), and
makes it possible to take advantage of some of the strengths of established
ways of representing documents, (or indeed to create new ones).

.SH
3. SUMMARY OF FEATURES
.BR

.BR
* sparse/minimal markup (clean utf\-8 source texts). Documents are prepared in
a single UTF\-8 file using a minimalistic mnemonic syntax. Typical literature,
documents like \"War and Peace\" require almost no markup, and most of the
headers are optional.

.BR
* markup is easily readable/parsable by the human eye, (basic markup is simpler
and more sparse than the most basic HTML), \ [this \ may \ also \ be \
converted \ to \ XML \ representations \ of \ the \ same \ input/source \
document].

.BR
* markup defines document structure (this may be done once in a header
pattern\-match description, or for heading levels individually); basic text
attributes (bold, italics, underscore, strike\-through etc.) as required; and
semantic information related to the document (header information, extended
beyond the Dublin core and easily further extended as required); the headers
may also contain processing instructions.
.B SiSU
markup is primarily an abstraction of document structure and document
metadata to permit taking advantage of the basic strengths of existing
alternative practical standard ways of representing documents \ [be \ that \
browser \ viewing, \ paper \ publication, \ sql \ search \ etc.] (html, xml,
odf, latex, pdf, sql)

.BR
* for output produces reasonably elegant output of established industry and
institutionally accepted open standard formats.[3] takes advantage of the
different strengths of various standard formats for representing documents,
amongst the output formats currently supported are:

.BR
  * html \- both as a single scrollable text and a segmented document

.BR
  * xhtml

.BR
  * XML \- both in sax and dom style xml structures for further development as
  required

.BR
  * ODF \- open document format, the iso standard for document storage

.BR
  * LaTeX \- used to generate pdf

.BR
  * pdf (via LaTeX)

.BR
  * sql \- population of an sql database, (at the same object level that is
  used to cite text within a document)

.BR
Also produces: concordance files; document content certificates (md5 or sha256
digests of headings, paragraphs, images etc.) and html manifests (and sitemaps
of content). (b) takes advantage of the strengths implicit in these very
different output types, (e.g. PDFs produced using typesetting of LaTeX,
databases populated with documents at an individual object/paragraph level,
making possible granular search (and related possibilities))

.BR
* ensuring content can be cited in a meaningful way regardless of selected
output format. Online publishing (and publishing in multiple document formats)
lacks a useful way of citing text internally within documents (important to
academics generally and to lawyers) as page numbers are meaningless across
browsers and formats. sisu seeks to provide a common way of pinpoint the text
within a document, (which can be utilized for citation and by search engines).
The outputs share a common numbering system that is meaningful (to man and
machine) across all digital outputs whether paper, screen, or database
oriented, (pdf, HTML, xml, sqlite, postgresql), this numbering system can be
used to reference content.

.BR
* Granular search within documents. SQL databases are populated at an object
level (roughly headings, paragraphs, verse, tables) and become searchable with
that degree of granularity, the output information provides the
object/paragraph numbers which are relevant across all generated outputs; it is
also possible to look at just the matching paragraphs of the documents in the
database; \ [output \ indexing \ also \ work \ well \ with \ search \ indexing
\ tools \ like \ hyperestraier].

.BR
* long term maintainability of document collections in a world of changing
formats, having a very sparsely marked\-up source document base. there is a
considerable degree of future\-proofing, output representations are
\"upgradeable\", and new document formats may be added. e.g. addition of odf
(open document text) module in 2006 and in future html5 output sometime in
future, without modification of existing prepared texts

.BR
* SQL search aside, documents are generated as required and static once
generated.

.BR
* documents produced are static files, and may be batch processed, this needs
to be done only once but may be repeated for various reasons as desired
(updated content, addition of new output formats, updated technology document
presentations/representations)

.BR
* document source (plaintext utf\-8) if shared on the net may be used as input
and processed locally to produce the different document outputs

.BR
* document source may be bundled together (automatically) with associated
documents (multiple language versions or master document with inclusions) and
images and sent as a zip file called a sisupod, if shared on the net these too
may be processed locally to produce the desired document outputs

.BR
* generated document outputs may automatically be posted to remote sites.

.BR
* for basic document generation, the only software dependency is
.B Ruby
, and a few standard Unix tools (this covers plaintext, HTML, XML, ODF,
LaTeX). To use a database you of course need that, and to convert the LaTeX
generated to pdf, a latex processor like tetex or texlive.

.BR
* as a developers tool it is flexible and extensible

.BR
Syntax highlighting for
.B SiSU
markup is available for a number of text editors.

.BR
.B SiSU
is less about document layout than about finding a way with little markup to
be able to construct an abstract representation of a document that makes it
possible to produce multiple representations of it which may be rather
different from each other and used for different purposes, whether layout and
publishing, or search of content

.BR
i.e. to be able to take advantage from this minimal preparation starting point
of some of the strengths of rather different established ways of representing
documents for different purposes, whether for search (relational database, or
indexed flat files generated for that purpose whether of complete documents, or
say of files made up of objects), online viewing (e.g. html, xml, pdf), or
paper publication (e.g. pdf)...

.BR
the solution arrived at is by extracting structural information about the
document (about headings within the document) and by tracking objects (which
are serialized and also given hash values) in the manner described. It makes
possible representations that are quite different from those offered at
present. For example objects could be saved individually and identified by
their hashes, with an index of how the objects relate to each other to form a
document.

.SH
DOCUMENT INFORMATION (METADATA)
.BR

.SH
METADATA
.BR

.BR
Document Manifest @
<http://www.jus.uio.no/sisu/sisu_manual/sisu_introduction/sisu_manifest.html>

.BR
.B Dublin Core
(DC)

.BR
.I DC tags included with this document are provided here.

.BR
DC Title:
.I SiSU \- Commands \ [0.58]

.BR
DC Creator:
.I Ralph Amissah

.BR
DC Rights:
.I Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL
3

.BR
DC Type:
.I information

.BR
DC Date created:
.I 2002\-08\-28

.BR
DC Date issued:
.I 2002\-08\-28

.BR
DC Date available:
.I 2002\-08\-28

.BR
DC Date modified:
.I 2007\-09\-16

.BR
DC Date:
.I 2007\-09\-16

.BR
.B Version Information

.BR
Sourcefile:
.I sisu_introduction.sst

.BR
Filetype:
.I SiSU text 0.58

.BR
Sourcefile Digest, MD5(sisu_introduction.sst)=
.I b2a6da5bd22fa1eaa92a08d81f11d1c7

.BR
Skin_Digest:
MD5(/home/ralph/grotto/theatre/dbld/sisu\-dev/sisu/data/doc/sisu/sisu_markup_samples/sisu_manual/_sisu/skin/doc/skin_sisu_manual.rb)=
.I 20fc43cf3eb6590bc3399a1aef65c5a9

.BR
.B Generated

.BR
Document (metaverse) last generated:
.I Sun Sep 23 04:13:42 +0100 2007

.BR
Generated by:
.I SiSU
.I 0.59.0
of 2007w38/0 (2007\-09\-23)

.BR
Ruby version:
.I ruby 1.8.6 (2007\-06\-07 patchlevel 36) \ [i486\-linux]

.TP
.BI 1.
\"
.B SiSU
information Structuring Universe\" or \"Structured information, Serialized
Units\".
 also chosen for the meaning of the Finnish term "sisu".
.TP
.BI 2.
Unix command line oriented
.TP
.BI 3.
objects include: headings, paragraphs, verse, tables, images, but not
footnotes/endnotes which are numbered separately and tied to the object from
which they are referenced.
.TP
.BI 4.
i.e. the html, pdf, odf outputs are each built individually and optimised for
that form of presentation, rather than for example the html being a saved
version of the odf, or the pdf being a saved version of the html.
.TP
.BI 5.
the different heading levels
.TP
.BI 6.
units of text, primarily paragraphs and headings, also any tables, poems,
code-blocks
.TP
.BI 7.
Specification submitted by Adobe to ISO to become a full open ISO
specification
 <http://www.linux-watch.com/news/NS7542722606.html>
.TP
.BI 8.
ISO/IEC 26300:2006

.TP
Other versions of this document:
.TP
manifest: <http://www.jus.uio.no/sisu/sisu_introduction/sisu_manifest.html>
.TP
html: <http://www.jus.uio.no/sisu/sisu_introduction/toc.html>
.TP
pdf: <http://www.jus.uio.no/sisu/sisu_introduction/portrait.pdf>
.TP
pdf: <http://www.jus.uio.no/sisu/sisu_introduction/landscape.pdf>
." .TP
." manpage: http://www.jus.uio.no/sisu/sisu_introduction/sisu_introduction.1
.TP
at: <http://www.jus.uio.no/sisu>
.TP
.TP
* Generated by: SiSU 0.59.0 of 2007w38/0 (2007-09-23)
.TP
* Ruby version: ruby 1.8.6 (2007-06-07 patchlevel 36) [i486-linux]
.TP
* Last Generated on: Sun Sep 23 04:13:49 +0100 2007
.TP
* SiSU http://www.jus.uio.no/sisu