aboutsummaryrefslogtreecommitdiffhomepage
path: root/data/doc/sisu/org/sisu.org
blob: e7f3e7e94a28d466000cb5aa7efe78800fbcedc5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
#+TITLE: SiSU
#+AUTHOR: Ralph Amissah
#+EMAIL: ralph.amissah@gmail.com
#+STARTUP: content
#+LANGUAGE: en
#+OPTIONS: H:3 num:nil toc:t \n:nil @:t ::t |:t ^:nil _:nil -:t f:t *:t <:t
#+OPTIONS: TeX:t LaTeX:t skip:nil d:nil todo:t pri:nil tags:not-in-toc
#+OPTIONS: author:nil email:nil creator:nil timestamp:nil
#+PRIORITIES: A F E
#+EXPORT_SELECT_TAGS: export
#+EXPORT_EXCLUDE_TAGS: noexport
#+FILETAGS: :sisu:notes:

* What is SiSU?

SiSU is a document generator that run against (sisu) marked up documents
produces multiple document formats with a nod to the strengths of each document
format while making them easily citable across available formats.

** debian/control desc

documents - structuring, publishing in multiple formats and search
 SiSU is a lightweight markup based, command line oriented, document
 structuring, publishing and search, static content tool for document
 collections.
 .
 With minimal preparation of a plain-text (UTF-8) file, using sisu markup syntax
 in your text editor of choice, SiSU can generate various document formats, most
 of which share a common object numbering system for locating content, including
 plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODF:ODT), LaTeX, PDF
 files, and populate an SQL database with objects (roughly paragraph-sized
 chunks) so searches may be performed and matches returned with that degree of
 granularity. Think of being able to finely match text in documents, using
 common object numbers, across different output formats and across languages if
 you have translations of the same document.  For search, your criteria is met
 by these documents at these locations within each document (equally relevant
 across different output formats and languages). To be clear (if obvious) page
 numbers provide none of this functionality. Object numbering is particularly
 suitable for "published" works (finalized texts as opposed to works that are
 frequently changed or updated) for which it provides a fixed means of reference
 of content. Document outputs can also share provided semantic meta-data.
 .
 SiSU also provides concordance files, document content certificates and
 manifests of generated output and the means to make book indexes that make use
 of its object numbering.
 .
 Syntax highlighting and folding (outlining) files are provided for the Vim and
 Emacs editors.
 .
 Dependencies for various features are taken care of in sisu related packages.
 The package sisu-complete installs the whole of SiSU.
 .
 Additional document markup samples are provided in the package
 sisu-markup-samples which is found in the non-free archive. The licenses for
 the substantive content of the marked up documents provided is that provided
 by the author or original publisher.
 .
 SiSU uses utf-8 & parses left to right. Currently supported languages:
 am bg bn br ca cs cy da de el en eo es et eu fi fr ga gl he hi hr hy ia is it
 ja ko la lo lt lv ml mr nl nn no oc pl pt pt_BR ro ru sa se sk sl sq sr sv ta
 te th tk tr uk ur us vi zh (see XeTeX polyglossia & cjk)
 .
 SiSU works well under po4a translation management, for which an administrative
 sample Rakefile is provided with sisu_manual under markup-samples.

** take two

SiSU may be regarded as an open access document publishing platform, applicable
to a modest but substantial domain of documents (typically law and literature,
but also some forms of technical writing), that is tasked to address certain
challenges I identified as being of interest to me over the years in open
publishing.

The idea and implementation may be of interest to consider as some of the
issues encountered and that it seeks to address are known and common to such
endeavors. Amongst them:

 * how do you ensure what you do now can be read in decades?
 * how do you keep up with new changing and technologies?
 * do you select a canonical format to represent your documents, if so
   what?
 * how do you reliably cite (locate) material in different document
   representations?
 * how do you deal with multilingual texts?
 * what of search?
 * how are documents contributed to the collection?

(these questions are selected in to help describe the direction of efforts with
regard to sisu).

My Dabblings in the Domain of Open Publishing
---------------------------------------------

The system is called SiSU, it is an offshoot of my early efforts at finding out
what to make of the web, that started at the University of Tromsø in 1993 (an
early law website Ananse/ International Trade Law Project / Lex Mercatoria). I
have worked on SiSU continually since 1997 and it has been open source in 2005
(under a license called GPL3+), though I remain its developer.

In working in this field I have had to address some of the common issues.

So how do you ensure what you do now can be read in decades to come? There are
alternative solutions. (i) stick with a widely used and not overly complicated
well document open standard, and for that the likes of odf is an excellent
choice (ii) alternatively go for the most basic representation of a document
that meets your needs, in my case based on UTF-8 text and some markup tags,
fairly easily parsable by the human eye and as long as utf8 is in use it will
always be possible to extract the information

How do you keep up with new changing and technologies? Here my solution has
been to generate new versions of the substantive content so as to always have
the latest document representations available e.g. HTML has changed a lot over
the years, different specifications come out for various formats including ODF,
electronic readers have become an important viewing alternative, introducing
the open reader format EPUB. Output representations are generated from source
documents.  Different open document file formats can be produced and databases
and search engines populated. (The source documents and interpreter are all
that are required to re-create site content. Source documents can be made
public or retained privately). The strict separation of a simple source
document from the output produced, means that with updates to SiSU (the
interpreter/processor/generator), outputs can be updated technically as
necessary, and new output formats added when needed. Amongst the output formats
currently supported are HTML, LaTeX generated Pdfs (A4, letter, other;
landscape, portrait), Epub, Open Document Format text. Returning to HTML as an
example, it has changed a lot over the years I have worked with it, this way of
working has meant it is possible to keep producing current versions of HTML,
retaining the original substantive document... and new formats have been added
as thought desired. There is no attempt to make output in different document
formats/ representations look alike let alone identical. Rather the attempt is
to optimize output for the particular document filetype, (there is no reason
why an epub document would look or behave like an open document text or that a
Pdf would look like HTML output; rather PDF is optimized for paper viewing,
HTML for screen etc.)  Wherever possible features associated with the
particular output type are taken advantage of. This freedom is made possible to
a large extent by the answer to the question that follows.

How do you reliably cite (locate) material in different document
representations? The traditional answer has been to have a canonical
publication, and resulting fixed page numbers. This was not a viable solution
for HTML (which changes from one viewer to another and with selectable font
faces & size etc.); nor is it otherwise ideal in an electronic age with the
possibility of presenting/interacting with material/documents in so many
different ways. Why be so restricted? Here my solution has been "object
citation numbering".  What the various generated document formats have in
common is a shared object numbering system that identifies the location of text
and that is available for citation purposes. Object numbers are: sequential
numbers assigned to each identified object in a document. Objects are logical
units of text (or equivalent parts of a document), usually paragraphs, but also
document headings, tables, images, in a poem a verse etc.  [In an electronic
publishing age are page numbers the best we can come up with?  Change font
type, font size, page orientation, paper size (sometimes even the viewer) and
where are you with them? And paper though a favorite medium of mine is no
longer the sole (or sometimes primary) means of interacting with documents/text
or of sharing knowledge]

What object numbers mean (unlike page numbers) is e.g.

 * if you cite text in any format, the resulting output can be reliably located
   in any other document format type. Cite HTML and the reader can choose to
   view in Epub or Pdf (the PDFs being an independent output, generated by
   book publishing software XeTeX/LaTeX).

 * if you do a search, you can be given a result "index" indicating that your
   search criteria is met by these documents, and at these specific locations
   within each document, and the "index" is relevant not only for content
   within the database, but for all document formats.

 * if you have a translated text prepared for sisu, then your citations are
   relevant across languages e.g. you can specify exactly where in a Chinese
   document text is to be found.

 * generated document index references & concordance list references etc. are
   relevant across all output formats.

What of search? For search, see the implications of object numbers for search
mentioned above. The system currently loads an SQL server (Postgresql) with
object sized text chunks. It could just as well populate an analytical engine
with larger sections or chapters of text for analytical purposes (such as the
currently popular Elasticsearch), whilst availing itself also of the concept of
objects and object numbers in search results.

How do you deal with multilingual texts? If you have translated text prepared
for sisu, then your citations are relevant across languages.  Object numbers
also provide an easy way to compare, discuss text (translations) across
languages. Text found/cited in one language has the same object number in its
translations, a given paragraph will be the same in another language, just
change the language code. (documents are prepared in UTF-8, current language
restrictions are: through use of LaTeX tools, Polyglosia & CJK (Chinese,
Japanese & Korean), and from the fact that sisu parses left to right)

How are materials prepared for contribution to the collection? (a) The easiest
solution if the system allows is for submission in the format in which work is
authored, usually a word processor, for which odf may be a decent selection.
(b) I have stuck with enhanced plaintext, UTF-8 with minimal markup.  Source
documents are prepared in UTF-8 text, with a minimalist native markup to
indicate the document structure (headings and their relative levels),
footnotes, and other document "features". This markup is easily parsable to the
human eye, and plays well with version control systems. Documents are prepared
in a text editor. Front ends such as markup assistants in a word processor that
can save to sisu text format or other tool whist possible do not exist. [(c)
yet another form of submission for collaborative work are wikis which have
shown their strength in efforts such as Wikipedia.]

The system has proven to be a good testing ground for ideas and is flexible and
extensible. (things that could usefully be done: apart from a front end for
simpler user interaction; feed text to an analytical search engine, like
Elasticsearch/Lucene; it still needs a bibliography parser (auto-generation of
a bibliography from footnotes); and it might be useful to allow rough auto
translation documents on the fly by passing text through a translator (such as
Google translate)).

In any event, my resulting technical opinions (in my modest domain of
action) may be regarded as encapsulated within SiSU
[http://www.sisudoc.org/]

http://www.sisudoc.org/
http://www.jus.uio.no/sisu/

git clone git://git.sisudoc.org/git/code/sisu.git --branch upstream
http://git.sisudoc.org/gitweb/?p=code/sisu.git;a=summary
(there may be additional commits in the upstream branch)
git clone --depth 1 git://git.sisudoc.org/git/code/sisu.git --branch upstream

git clone git://git.sisudoc.org/git/doc/sisu-markup-samples.git --branch upstream
git clone --depth 1 git://git.sisudoc.org/git/doc/sisu-markup-samples.git --branch upstream
Development work is on Linux and the easiest way to install it is through the
Debian Linux package as this takes care of optional external dependencies such
as XeTeX for PDF output and Postgresql or Sqlite for search.

** multiple document formats

Text can be represented in multiple output formats with different
characteristics that are (or may be) regarded as strengths/advantages and
therefore preferred in different contexts.

Given the different strengths and characteristics of various output formats, it
makes little sense to try too hard to make different representations of a
document look the same. More interesting is have document representations that
take advantage of each given outputs strengths. As valuable if not more so is
the ability to cite, find, discuss text with ease, across the different output
formats.

For citation across output formats, SiSU uses object citation numbers.

** document structure and document objects

SiSU breaks marked up text into document structure and objects

Document structure being the document heading hierarchy (having separated out
the document header).

*** What are document objects?

An object is an identified meaningful unit of a document, most commonly a
paragraph of text, but also for example a table, code block, verse or image.

SiSU tracks these substantive document units as document objects (and their
relationship to the document structure).

** object citation numbers

*** What are object citation numbers?

An object citation number is a sequential number assigned to a document object.

In sisu output documents share this common object numbering system (dubbed
"object citation numbering" (ocn)) that is meaningful (machine & human readable)
across various digital outputs whether paper, screen, or database oriented,
(PDF, html, XML, EPUB, sqlite, postgresql), and across multilingual content if
prepared appropriately. This numbering system can be used to reference content
across output types.

*** Why might I want object citation numbering?

The ability to cite and quickly locate text can be invaluable if not essential.
 (whether for instruction or discussion).

In this digital & Internet age we have multiple ways to represent documents and
multiple document output formats as options with different characteristics,
strengths/advantages etc. We need a way to cite text that works and is relevant
independent of the document format used.

I want to discuss (cite) html text how do I do this?
how do I refer to / cite / discuss text in html?
Issue: html may be viewed online or printed, it is not tied to paper (as
e.g. pdf) and prints differently depending on selected font face and font size.

I want to discuss (cite) text that is available in multiple formats (e.g. pdf,
epub, html) without having to worry about the output format that is referred
to.
How do I refer to / discuss text that is available in more than one format,
uncertain of what format is preferred, used or available to my colleagues?
e.g. html and epub or pdf have rather different text representations, how do I
discuss ...

I would like to have a book index that is relevant (can be used) across multiple
output formats (e.g. pdf, epub, html)

How do I make a book index (or a concordance file) that works across multiple
output formats?

I would like to have search results indicating where in a document matches are
found and I would like it to be relevant across available output formats (e.g.
pdf, epub, html)
How do I get search results for locations of text within each relevant document

I would like to be able to discuss a text that has been translated ...
how do I find text across languages?
Where I have a nicely translated document, how do I point to or discuss with my
foreign language counterpart some detail of the text, or, how do I point my
foreign language counterpart to the text I would like to bring to his
attention.

** "Granular" Search

Of interest is the ease of streaming documents to a relational database, at an
object (roughly paragraph) level and the potential for increased precision in
the presentation of matches that results thereby. The ability to serialize
html, LaTeX, XML, SQL, (whatever) is also inherent in / incidental to the
design.

** Summary

SiSU information Structuring Universe
Structured information, Serialized Units    <www.sisudoc.org>   or
<www.jus.uio.no/sisu/> software for electronic texts, document collections,
books, digital libraries, and search, with "atomic search" and text positioning
system (shared text citation numbering: "ocn")
outputs include: plaintext, html, XHTML, XML, ODF (OpenDocument), EPUB, LaTeX,
PDF, SQL (PostgreSQL and SQLite)

** SiSU Short Description

SiSU is a comprehensive future-resilient electronic document management system.
Built-in search capabilities allow you to search across multiple documents and
highlight matches in an easy-to-follow format. Paragraph numbering system
allows you to cite your electronic documents in a consistent manner across
multiple file formats. Multiple format outputs allow you to display your
documents in plain text, PDF (portrait and horizontal), OpenDocument format,
HTML, or e-book reading format (EPUB). Word mapping allows you to easily create
word indexes for your documents. Future-resilient flexibility allows you to
quickly adapt your documents to newer output formats as needed. All these and
many other features are achieved with little or no additional work on your
documents - by marking up the documents with a super simplistic markup
language, leaving the SiSU engine to handle the heavy-lifting processing.

Potential users of SiSU include individual authors who want to publish their
books or articles electronically to reach a broad audience, web publishers who
want to provide multiple channels of access to their electronic documents, or
any organizations which centrally manage a medium or large set of electronic
documents, especially governmental organizations which may prefer to keep their
documents in easily accessible yet non-proprietary formats.

SiSU is an Open Source project initiated and led by Ralph Amissah
<ralph.amissah@gmail.com> and can be contacted via mailing list
<http://lists.sisudoc.org/listinfo/sisu> at <sisu@lists.sisudoc.org>. SiSU is
licensed under the GNU General Public License.

*** notes

For less markup than the most elementary HTML you can have more.  SiSU -
Structured information, Serialized Units for electronic documents, is an
information structuring, transforming, publishing and search framework with the
following features:

(i) markup syntax: (a) simpler than html, (b) mnemonic, influenced by
mail/messaging/wiki markup practices, (c) human readable, and easily writable,

(ii) (a) minimal markup requirement, (b) single file marked up for multiple outputs,

 * documents are prepared in a single UTF-8 file using a minimalistic mnemonic
syntax. Typical literature, documents like "War and Peace" require almost no
markup, and most of the headers are optional.

 * markup is easily readable/parsed by the human eye, (basic markup is simpler
and more sparse than the most basic html), [this may also be converted to XML
representations of the same input/source document].

 * markup defines document structure (this may be done once in a header
pattern-match description, or for heading levels individually); basic text
attributes (bold, italics, underscore, strike-through etc.) as required; and
semantic information related to the document (header information, extended
beyond the Dublin core and easily further extended as required); the headers
may also contain processing instructions.

(iii) (a) multiple output formats, including amongst others: plaintext (UTF-8);
html; (structured) XML; ODF (Open Document text); EPUB; LaTeX; PDF (via LaTeX);
SQL type databases (currently PostgreSQL and SQLite). SiSU produces:
concordance files; document content certificates (md5 or sha256 digests of
headings, paragraphs, images etc.) and html manifests (and sitemaps of
content). (b) takes advantage of the strengths implicit in these very different
output types, (e.g. PDFs produced using typesetting of LaTeX, databases
populated with documents at an individual object/paragraph level, making
possible granular search (and related possibilities))

(iv) outputs share a common numbering system (dubbed "object citation
numbering" (ocn)) that is meaningful (to man and machine) across various
digital outputs whether paper, screen, or database oriented, (PDF, html, XML,
EPUB, sqlite, postgresql), this numbering system can be used to reference
content.

(v) SQL databases are populated at an object level (roughly headings,
paragraphs, verse, tables) and become searchable with that degree of
granularity, the output information provides the object/paragraph numbers which
are relevant across all generated outputs; it is also possible to look at just
the matching paragraphs of the documents in the database; [output indexing also
work well with search indexing tools like hyperesteier].

(vi) use of semantic meta-tags in headers permit the addition of semantic
information on documents, (the available fields are easily extended)

(vii) creates organised directory/file structure for (file-system) output,
easily mapped with its clearly defined structure, with all text objects
numbered, you know in advance where in each document output type, a bit of text
will be found (e.g. from an SQL search, you know where to go to find the
prepared html output or PDF etc.)... there is more; easy directory management
and document associations, the document preparation (sub-)directory may be used
to determine output (sub-)directory, the skin used, and the SQL database used,

(viii) "Concordance file" wordmap, consisting of all the words in a document
and their (text/ object) locations within the text, (and the possibility of
adding vocabularies),

(ix) document content certification and comparison considerations: (a) the
document and each object within it stamped with an sha256 hash making it
possible to easily check or guarantee that the substantive content of a document
is unchanged, (b) version control, documents integrated with time based source
control system, default RCS or CVS with use of $Id$ tag, which SiSU checks

(x) SiSU's minimalist markup makes for meaningful "diffing" of the substantive
content of markup-files,

(xi) easily skinnable, document appearance on a project/site wide, directory
wide, or document instance level easily controlled/changed,

(xii) in many cases a regular expression may be used (once in the document
header) to define all or part of a documents structure obviating or reducing
the need to provide structural markup within the document,

(xiii) prepared files may be batch process, documents produced are static files
so this needs to be done only once but may be repeated for various reasons as
desired (updated content, addition of new output formats, updated technology
document presentations/representations)

(xiv) possible to pre-process, which permits: the easy creation of standard
form documents, and templates/term-sheets, or; building of composite documents
(master documents) from other sisu marked up documents, or marked up parts,
i.e. import documents or parts of text into a main document should this be
desired

there is a considerable degree of future-resilience, output representations are
"upgradeable", and new document formats may be added.

(xv) there is a considerable degree of future-resilience, output representations
are "upgradeable", and new document formats may be added: (a) modular, (thanks
in no small part to Ruby) another output format required, write another
module.... (b) easy to update output formats (eg html, XHTML, LaTeX/PDF
produced can be updated in program and run against whole document set), (c)
easy to add, modify, or have alternative syntax rules for input, should you
need to,

(xvi) scalability, dependent on your file-system (ext3, Reiserfs, XFS,
whatever) and on the relational database used (currently Postgresql and
SQLite), and your hardware,

(xvii) only marked up files need be backed up, to secure the larger document
set produced,

(xviii) document management,

(xix) Syntax highlighting for SiSU markup is available for a number of text
editors.

(xx) remote operations: (a) run SiSU on a remote server, (having prepared sisu
markup documents locally or on that server, i.e. this solution where sisu is
installed on the remote server, would work whatever type of machine you chose
to prepare your markup documents on), (b) generated document outputs may be
posted by sisu to remote sites (using rsync/scp) (c) document source (plaintext
utf-8) if shared on the net may be identified by its url and processed locally
to produce the different document outputs.

(xxi) document source may be bundled together (automatically) with associated
documents (multiple language versions or master document with inclusions) and
images and sent as a zip file called a sisupod, if shared on the net these too
may be processed locally to produce the desired document outputs, these may be
downloaded, shared as email attachments, or processed by running sisu against
them, either using a url or the filename.

(xxii) for basic document generation, the only software dependency is Ruby, and
a few standard Unix tools (this covers plaintext, html, XML, ODF, EPUB, LaTeX).
To use a database you of course need that, and to convert the LaTeX generated
to PDF, a LaTeX processor like tetex or texlive.

as a developers tool it is flexible and extensible

** description

SiSU ("SiSU information Structuring Universe" or "Structured information,
Serialized Units"),1 is a Unix command line oriented framework for document
structuring, publishing and search. Featuring minimalistic markup, multiple
standard outputs, a common citation system, and granular search.  Using markup
applied to a document, SiSU can produce plain text, HTML, XHTML, XML,
OpenDocument, LaTeX or PDF files, and populate an SQL database with objects2
(equating generally to paragraph-sized chunks) so searches may be performed and
matches returned with that degree of granularity (e.g. your search criteria is
met by these documents and at these locations within each document). Document
output formats share a common object numbering system for locating content.
This is particularly suitable for "published" works (finalized texts as opposed
to works that are frequently changed or updated) for which it provides a fixed
means of reference of content.  How it works

SiSU markup is fairly minimalistic, it consists of: a (largely optional)
document header, made up of information about the document (such as when it was
published, who authored it, and granting what rights) and any processing
instructions; and markup within text which is related to document structure and
typeface. SiSU must be able to discern the structure of a document, (text
headings and their levels in relation to each other), either from information
provided in the instruction header or from markup within the text (or from a
combination of both). Processing is done against an abstraction of the document
comprising of information on the document's structure and its objects,2 which
the program serializes (providing the object numbers) and which are assigned
hash sum values based on their content. This abstraction of information about
document structure, objects, (and hash sums), provides considerable flexibility
in representing documents different ways and for different purposes (e.g.
search, document layout, publishing, content certification, concordance etc.),
and makes it possible to take advantage of some of the strengths of established
ways of representing documents, (or indeed to create new ones).

1. also chosen for the meaning of the Finnish term "sisu".

2 objects include: headings, paragraphs, verse, tables, images, but not
footnotes/endnotes which are numbered separately and tied to the object from
which they are referenced.

More information on SiSU provided at: <www.sisudoc.org/sisu/SiSU>

SiSU was developed in relation to legal documents, and is strong across a wide
variety of texts (law, literature...(humanities, law and part of the social
sciences)). SiSU handles images but is not suitable for formulae/ statistics,
or for technical writing at this time.

SiSU has been developed and has been in use for several years. Requirements to
cover a wide range of documents within its use domain have been explored.

<ralph@amissah.com>
<ralph.amissah@gmail.com>
<sisu@lists.sisudoc.org>
<http://lists.sisudoc.org/listinfo/sisu>
2010
w3 since October 3 1993

* Finding SiSU
** source

http://git.sisudoc.org/gitweb/

*** sisu

sisu git repo:
http://git.sisudoc.org/gitweb/?p=code/sisu.git;a=summary

**** most recent source without repo history

git clone --depth 1 git://git.sisudoc.org/git/code/sisu.git --branch upstream

**** full clone

git clone git://git.sisudoc.org/git/code/sisu.git --branch upstream

*** sisu-markup-samples git repo:

http://git.sisudoc.org/gitweb/?p=doc/sisu-markup-samples.git;a=summary

** mailing list

sisu at lists.sisudoc.org
http://lists.sisudoc.org/listinfo/sisu

** irc oftc #sisu

** home pages
  <http://www.sisudoc.org/>
  <http://search.sisudoc.org/>
  <http://www.jus.uio.no/sisu>

* Installation

** where you take responsibility for having the correct dependencies

Provided you have *Ruby*, *SiSU* can be run.

SiSU should be run from the directory containing your sisu marked up document
set.

This works fine so long as you already have sisu external dependencies in
place. For many operations such as html, epub, odt this is likely to be fine.
Note however, that additional external package dependencies, such as texlive
(for pdfs), sqlite3 or postgresql (for search) should you desire to use them
are not taken care of for you.

*** run off the source tarball without installation

RUN OFF SOURCE PACKAGE DIRECTORY TREE (WITHOUT INSTALLING)
..........................................................

**** 1. Obtain the latest sisu source

using git:

http://git.sisudoc.org/gitweb/?p=code/sisu.git;a=summary
http://git.sisudoc.org/gitweb/?p=code/sisu.git;a=log

  git clone git://git.sisudoc.org/git/code/sisu.git --branch upstream
  git clone --depth 1 git://git.sisudoc.org/git/code/sisu.git --branch upstream

or, identify latest available source:

https://packages.debian.org/sid/sisu
http://packages.qa.debian.org/s/sisu.html
http://qa.debian.org/developer.php?login=sisu@lists.sisudoc.org

http://sisudoc.org/sisu/archive/pool/main/s/sisu/

and download the:

  sisu_5.4.5.orig.tar.xz

using debian tool dget:

The dget tool is included within the devscripts package
https://packages.debian.org/search?keywords=devscripts
to install dget install devscripts:

  apt-get install devscripts

and then you can get it from Debian:
  dget -xu http://ftp.fi.debian.org/debian/pool/main/s/sisu/sisu_5.4.5-1.dsc

or off sisu repos
  dget -x http://www.jus.uio.no/sisu/archive/pool/main/s/sisu/sisu_5.4.5-1.dsc
or
  dget -x http://sisudoc.org/sisu/archive/pool/main/s/sisu/sisu_5.4.5-1.dsc

**** 2. Unpack the source

Provided you have *Ruby*, *SiSU* can be run without installation straight from
the source package directory tree.

Run ruby against the full path to bin/sisu (in the unzipped source package
directory tree). SiSU should be run from the directory containing your sisu
marked up document set.

  ruby ~/sisu-5.4.5/bin/sisu --html -v document_name.sst

This works fine so long as you already have sisu external dependencies in
place. For many operations such as html, epub, odt this is likely to be fine.
Note however, that additional external package dependencies, such as texlive
(for pdfs), sqlite3 or postgresql (for search) should you desire to use them
are not taken care of for you.

*** gem install (with rake)

(i) create the gemspec; (ii) build the gem (from the gemspec); (iii) install
the gem

Provided you have ruby & rake, this can be done with the single command:

  rake gem_create_build_install

to build and install sisu v5 & sisu v6, alias gemcbi

separate gems are made/installed for sisu v5 & sisu v6 contained in source.

to build and install sisu v5, alias gem5cbi:

  rake gem_create_build_install_stable

to build and install sisu v6, alias gem6cbi:

  rake gem_create_build_install_unstable

for individual steps (create, build, install) see rake options, rake -T to
specify sisu version for sisu installed via gem

  gem search sisu

  sisu _5.4.5_ --version

  sisu _6.0.11_ --version

to uninstall sisu installed via gem

  sudo gem uninstall --verbose sisu

For a list of alternative actions you may type:

  rake help

  rake -T

Rake: <http://rake.rubyforge.org/> <http://rubyforge.org/frs/?group_id=50>

*** installation with setup.rb

this is a three step process, in the root directory of the unpacked *SiSU* as
root type:

ruby setup.rb config
ruby setup.rb setup
#[as root:]
ruby setup.rb install

further information:
<http://i.loveruby.net/en/projects/setup/>
<http://i.loveruby.net/en/projects/setup/doc/usage.html>

  ruby setup.rb config && ruby setup.rb setup && sudo ruby setup.rb install

** Debian install

*SiSU* is available off the *Debian* archives. It should necessary only to run
as root, Using apt-get:

  apt-get update

  apt get install sisu-complete

(all sisu dependencies should be taken care of)

If there are newer versions of *SiSU* upstream, they will be available by
adding the following to your sources list /etc/apt/sources.list

#/etc/apt/sources.list

deb http://www.jus.uio.no/sisu/archive unstable main non-free
deb-src http://www.jus.uio.no/sisu/archive unstable main non-free

The non-free section is for sisu markup samples provided, which contain
authored works the substantive text of which cannot be changed, and which as a
result do not meet the debian free software guidelines.

*SiSU* is developed on *Debian*, and packages are available for *Debian* that
take care of the dependencies encountered on installation.

The package is divided into the following components:

  *sisu*, the base code, (the main package on which the others depend), without
  any dependencies other than ruby (and for convenience the ruby webrick web
  server), this generates a number of types of output on its own, other
  packages provide additional functionality, and have their dependencies

  *sisu-complete*, a dummy package that installs the whole of greater sisu as
  described below, apart from sisu -examples

  *sisu-pdf*, dependencies used by sisu to produce pdf from /LaTeX/ generated

  *sisu-postgresql*, dependencies used by sisu to populate postgresql database
  (further configuration is necessary)

  *sisu-sqlite*, dependencies used by sisu to populate sqlite database

  *sisu-markup-samples*, sisu markup samples and other miscellany (under
  *Debian* Free Software Guidelines non-free)

  *SiSU* is available off Debian Unstable and Testing [link:
  <http://packages.debian.org/cgi-bin/search_packages.pl?searchon=names&subword=1&version=all&release=all&keywords=sisu>]
  [^1] install it using apt-get, aptitude or alternative *Debian* install tools.

** Arch Linux

* sisu markup                                                          :sisu:

** markup                                                            :markup:

*** sisu document parts

- header
  - metadata
  - make instructionS
- substantive (& other) content
  (sisu markup)
- endnotes
  (markup within substantive content)
- glossary
  (section, special markup)
- bibliography
  (section, special markup)
- book index
  (markup attached to substantive content objects)

|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| header              | sisu /header markup/                                                    | markup                 |        |
| - metadata          |                                                                       |                        |        |
| - make instructions |                                                                       |                        |        |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| substantive content | sisu /content markup/                                                   | markup                 | output |
|                     | headings (providing document structure), paragraphs,                  | (regular content)      |        |
|                     | blocks (code, poem, group, table)                                     |                        |        |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| endnotes            | markup within substantive content                                     | markup                 | output |
|                     | (extracted from sisu /content markup/)                                  | (from regular content) |        |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| glossary            | identify special section, regular /content markup/                      | markup                 | output |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| bibliography        | identify section, special /bibliography markup/                         | markup                 | output |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| book index          | extracted from markup attached to related substantive content objects | markup                 | output |
|                     | (special tags in sisu /content markup/)                                 | (from regular content) |        |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|
| metadata            |                                                                       | (from regular header)  | output |
|---------------------+-----------------------------------------------------------------------+------------------------+--------|

*** structure - headings, levels

- headings (A-D, 1-3)

'A~ ' NOTE title level

'B~ ' NOTE optional
'C~ ' NOTE optional
'D~ ' NOTE optional

'1~ ' NOTE chapter level
'2~ ' NOTE optional
'3~ ' NOTE optional

  * node
    * parent
    * children

***  font face NOTE open & close marks, inline within paragraph

  * emphasize '*{ ... }*' NOTE configure whether bold italics or underscore, default bold
  * bold '!{ ... }!'
  * italics '/{ ... }/'
  * underscore '_{ ... }_'
  * superscript '^{ ... }^'
  * subscript ',{ ... },'
  * strike '-{ ... }-'
  * add '+{ ... }+'
  * monospace '#{ ... }#'

*** para

NOTE paragraph controls are at the start of a paragraph
  * a para is a block of text separated from others by an empty line
  * indent
    * default, all '_1 ' up to '_9 '
    * first line hang '_1_0 '
    * first line indent further '_0_1 '
  * bullet
    [levels 1-6]
      '_* '
      '_1* '
      '_2* '
  * numbered list
    [levels 1-3]
      '# '

*** blocks

NOTE text blocks that are not to be treated in the way that ordinary paragraphs would be
  * code
    * [type of markup if any]
  * poem
  * group
  * alt
  * tables

*** notes (footnotes/ endnotes)

 NOTE inline within paragraph at the location where the note reference is to occur
  * footnotes '~{ ... }~'
  * [bibliography] [NB N/A not implemented]

*** links, linking

  * links - external, web, url
  * links - internal

*** images [multimedia?]

  * images
  * [base64 inline] [N/A not implemented]

*** object numbers

  * ocn (object numbers)
    automatically attributed to substantive objects, paragraphs, tables, blocks, verse (unless exclude marker provided)

*** contents

  * toc (table of contents)
    autogenerated from structure/headings information
  * index (book index)
    built from hints in newline text following a paragraph and starting with ={} has identifying rules for main and subsidiary text

*** breaks
  * line break ' \\ ' inline
  * page break, column break ' -\\- ' start of line, breaks a column, starts a new column, if using columns, else breaks the page, starts a new page.
  * page break, page new ' =\\= ' start of line, breaks the page, starts a new page.
  * horizontal '-..-' start of line, rule page (break) line across page (dividing paragraphs)

*** book type index

built from hints in newline text following a paragraph and starting with ={} has
identifying rules for main and subsidiary text

#% comment
  * comment

#% misc
  * term & definition

** syntax highlighting                                  :syntax:highlighting:

*** vim

data/sisu/conf/editor-syntax-etc/vim/
data/sisu/conf/editor-syntax-etc/vim/syntax/sisu.vim

*** emacs

data/sisu/conf/editor-syntax-etc/emacs/
data/sisu/conf/editor-syntax-etc/emacs/sisu-mode.el

* todo

sisu_todo.org