Extract/Transform Document Structure API

This API was added in Collabora Online version 24.04.5.3

Collabora Online allows you to extract the structure of a Writer or an Impress document, and transform it. The structure is extracted as a JSON document for easy parsing with JavaScript and other languages. The transformation is done following a recipe that is also uploaded as a JSON document.

Currently Collabora Online supports those features only in:

Writer documents:

  • Content controls

  • Charts

  • Document properties

Impress documents (presentations):

  • Slides

Important

Some of the JSON data is actually not valid JSON. This will be fixed soon.

Data types

Further on we’ll use the following conventions to specify the type of values and data.

  • <num> An integer number value.

  • <string> A string value. When it is the value in JSON it is surrounded by quotes ".

  • <text> A text value.

  • <boolean> A boolean value. The literals are true and false.

  • [<TYPE>,] An array of <TYPE> of any length. Can be any of the known data type.

  • [<TYPE>,<TYPE>] An fixed size array of <TYPE> value. Example: [<num>,<num>] for an array of two integer.

  • <value> A JSON value of any type.

  • NONE There is no value. To satisfy JSON syntax substitute with "".

  • | Indicate alternative when several types are possible.

Not all types are valid in every situation, but when used as a JSON value, the proper JSON syntax should be used: strings and text are quoted, arrays are JSON arrays, etc.

General Usage

This API is accessed through HTTP POST, where the document, and the eventual transform are sent to the Collabora Online server and the result is returned. The examples here are shown using both curl as a simple tool to interact with the server, and as a web form. Instead of curl, any HTTP client can be used.

No specific configuration is necessary.

General Extraction

The extraction API work simply by sending the document, and setting an optional filter specifying the data you are interested in. In return you receive a JSON document decribing the document.

API: HTTP POST to /cool/extract-document-structure

Example:

curl -k -F "data=@test.docx" -F "filter=contentcontrol" https://localhost:9980/cool/extract-document-structure > out.json

Important

The -k in this example disables certificate validation as it is unlikely you have one in that situation. On a production system, please make sure you have valid certificates and not disabled the validation with -k.

Or in HTML:

1<form action="https://localhost:9980/cool/extract-document-structure" enctype="multipart/form-data" method="post">
2 File: <input type="file" name="data"><br/>
3 Filter: <input type="text" name="filter" value="contentcontrol"><br/>
4 <input type="submit" value="Extract">
5</form>

Query parameter

Description

data

The document file itself in the payload.

filter

The filter parameter (optional, recommended, see note) sets what to extract.

Currently supported filter values are:

  • contentcontrol: data for the content controls

  • charts: data for the chart

  • docprops: document properties

  • slides: presentation content

lang

The language parameter (optional) sets the default format language, useful for date type cells. If passed, the load language is used and it determines the display/output format. Example: lang=fr-FR

Attention

Without setting filter, the API will extract everything. Avoid using it without filters. As the API is expanded, the extracted data will be expanded too, that may cause unexpected problems.

Items

The extracted document content JSON is structured in the following way. The top-level object contains a DocStructure object. It itself contains a bunch of objects, the items.

{
    "DocStructure": {
        "Charts.ByEmbedIndex.0": {
            "..."
        },
        "ContentControls.ByIndex.0": {
            "..."
        },
        "DocumentProperties": {
            "..."
        }
    }
}

Each item is addressed by a selector.

Item selectors are used to represent an item in the document structure. They are specified by a string composed of up to three dot separated components.

  1. The type of the object. Current possible values are Charts, ContentControls and DocumentProperties. Document properties are a singleton set so there is no need to address it.

  2. The addressing method. ByIndex, ById, ByAlias, ByTag are valid for ContentControls, ByEmbedIndex, ByEmbedName, ByTitle, BySubTitle as valid for Charts.

  3. The “index”, the parameter to the addressing. It is a string or a number depending on the method.

Examples:

Charts.ByTitle.Untitled Chart
Charts.ByEmbedIndex.2
ContentControls.ByTag.machine-readable
ContentControls.ByIndex.4

Each item is described by properties. The property names and their values are described in the corresponding section. They can be numbers, strings of text, arrays, or just another object. Transform involve changing these properties.

General Transform

Much like the extract API, the transform API work simply by sending the document, and in addition, a transform reciepe. The result is a new document.

API: HTTP POST to /cool/transform-document-structure/<format>&<lang=xx-XX>

format or lang parameters can be ommitted

API: HTTP POST to /cool/transform-document-structure

Tip

You can provide format as another parameter. If no format defined, it will use the input file format.

Example:

curl -v -k -F "data=@test.docx" -F "format=docx" -F "transform=$(cat transform.JSON)" https://localhost:9980/cool/transform-document-structure > out.docx
curl -k -F "data=@test.docx" -F "format=odt" -F "transform={\"Transforms\":{\"ContentControls.ByIndex.1\":{\"content\":\"Short text\"},\"ContentControls.ByIndex.6\":{\"content\":\"5/14/2024\",\"alias\":\"date\"}}}}" https://localhost:9980/cool/transform-document-structure > out.odt

Or in HTML:

1<form action="https://localhost:9980/cool/transform-document-structure" enctype="multipart/form-data" method="post">
2 File: <input type="file" name="data"><br/>
3 Format: <input type="text" name="format"><br/>
4 Transform: <input type="text" name="transform"><br/>
5 <input type="submit" value="Transform">
6</form>

Query parameter

Description

data

The document file itself in the payload.

format

The format parameter (optional) is the format of the output document file. e.g. odt, docx

transform

The transform parameter (required) a JSON formatted string that contains the transformation commands.

lang

The language parameter (optional) sets the default format language, useful for date type cells. If passed, the load language is used and it determines the display/output format. Example: lang=fr-FR

Commands

The JSON structure for transformations is close the extracted document structure. The top-level object contains a Transforms object which itself contains a bunch of objects, the commands to transform.

Each command addresses a items by a selector and contains a bunch of properties describing the actions to take on the items. The transform commands are executed in the order they are listed in the transform JSON. An item selector (for example Charts.ByTitle.CommonName) may match more than one item, and thus may transform more that one item. In that case the items will be handled one after the other, based on an item order (ByIndex, ByEmbedIndex, etc.). Once all of the matching items are processed, the next command is run, and so on. This behaviour can be useful in many cases, but be aware with complex cases and its side effects. Commands may overwrite, not only the previous commands results, but alter future commands.

Content Controls

Content controls, are items in document that have special UI optimized for user input, primarily for form filling purposes. (e.g. date-picker, checkbox, combo-box, … ) This API can extract or overwrite the values of these content control items.

Extract

Use filter=contentcontrol to extract the content controls.

Example output (pretty printed):

{
    "DocStructure": {
        "ContentControls.ByIndex.0": {
            "id": -428815899,
            "tag": "machine-readable",
            "alias": "Human Readable",
            "content": "some text",
            "type": "plain-text"
        },
        "ContentControls.ByIndex.1": {
            "id": -325123259,
            "tag": "",
            "alias": "cb1",
            "content": "☒",
            "type": "checkbox",
            "Checked": "true"
        },
        "ContentControls.ByIndex.2": {
            "id": -1652436866,
            "tag": "",
            "alias": "",
            "content": "text1",
            "type": "drop-down-list"
        },
        "ContentControls.ByIndex.3": {
            "id": 659509202,
            "tag": "",
            "alias": "date",
            "content": "7\/7\/2024",
            "type": "date",
            "DateFormat": "M\/d\/yyyy",
            "DateLanguage": "en-US",
            "CurrentDate": "2024-07-07T00:00:00Z"
        }
    }
}

Each control is defined by a bunch of properties. id, tag, alias, content, type a present for every control:

Property

Description

id

ID number of content controls, it may not unique.

alias

Users may (or may not) set a short alias text to identify the content control.

tag

Longer text that may (or may not) be generated by some software, or set by users.

content

Its content as string

type

Type of the content control. See table below.

checked

Only in case of a checkbox, its state, “true” or “false”

DateFormat

Only in case of a date, format of the date like “M/d/yyyy”

DateLanguage

Only in case of a date, language of the date like “en-US”

CurrentDate

Only in case of a date, ISO 8601 formatted date value, independent from language.

Content control types:

Type

Content properties

plain-text

content is the text

rich-text

The text content is extracted without format.

checkbox

  • checked: “true” or “false”

  • content: “☒” or “☐”

The checked and content properties are in sync

drop-down-list

content is the selected item text

combo-box

date

  • content: the text representation of the date

  • DateFormat: the text of the date format

  • DateLanguage: the language used for the date format

  • CurrentDate: give the current date in ISO 8601 format

picture

not supported yet

Transform

Example Transform:

{
    "Transforms": {
        "ContentControls.ByIndex.1": {
            "content":"Short text"
        },
        "ContentControls.ByTag.datetag": {
            "content":"5/14/2024",
            "date":"2024-05-14",
            "alias":"date"
        }
    }
}

To select which content controls you want to transform, you can use the following selectors:

Selector

Description

ByIndex.<num>

Index of the contentcontrol counted from 0. It is always unique.

ById.<num>

ID number of content control, it may not be unique

ByAlias.<string>

Users may (or may not) set a short text alias to identify the content control.

ByTag.<string>

Longer text that may (or may not) be generated by some software, or set by users.

Note

Most of the selectors may identify more than one content control. In that case, transform will change each of the selected content controls.

To transform a content control you can set the following properties:

Property

Value

content

<text>

checked

<boolean>

alias

<string>

date

<string> Only for date content controls. The date value. Does not change the displayed text. Always use ISO 8601 format like “2024-03-22”.

content is a <string>, it can be used to set data of any type of content control. “picture” controls are not yet supported.

In case of a checkbox:

  • Setting content to “☐” will also set the checked to false.

  • Setting checked to true will also set content to “☒”

And vice-versa.

In case of a date:

  • Setting content only sets the displayed text like “1999.dec.12” the actual date value will not be changed.

  • Setting date only sets date value but does not change the displayed text.

In case of a rich-text only unformatted text can be set like if it was plain-text.

Transform will make the changes in content controls, in the same order as ContentControls listed in the transform JSON string. In case of the example transformation:

First it sets the content control that has an index of 1, setting its text to “Short text”.
Then it finds all the content controls where tag is equal to datetag and sets their value to “5/14/2024”.

Note

If you set content to an empty string, it may change back to Placeholder value. (e.g. “Click here to enter text” or “Choose a date” or “Choose an item”…)

Screenshot

../_images/ContentControlScreenshot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@contentControlsExampleOriginal.odt" -F "transform=$(cat contentControlsTransform.JSON)" https://localhost:9980/cool/transform-document-structure > contentControlsResult.odt

Command for extract:

curl -k -F "data=@contentControlsExampleOriginal.odt" -F "filter=contentcontrol" https://localhost:9980/cool/extract-document-structure > contentControlsOriginalExtract.JSON
curl -k -F "data=@contentControlsExampleResult.odt" -F "filter=contentcontrol" https://localhost:9980/cool/extract-document-structure > contentControlsResultExtract.JSON

Extracted JSON from result odt

Extracted JSON from result odt, Pretty printed

Extracted JSON from original odt, Pretty printed

Transformed Result Document

Screenshot

Charts

Extract

Use filter=charts to extract the charts.

Example output: (it is pretty printed here):

{
    "DocStructure": {
        "Charts.ByEmbedIndex.0": {
            "name": "Object1",
            "title": "Paid leave days",
            "subtitle": "Subtitle2",
            "RowDescriptions": [ "James", "Mary", "Patricia", "David"],
            "ColumnDescriptions": [ "2022", "2023"],
            "DataValues": [
                "Row.0": [ "22", "24"],
                "Row.1": [ "18", "16"],
                "Row.2": [ "32", "32"],
                "Row.3": [ "25", "23"]
            ]
        }
    }
}

Data it extracts:

Property

Description

name

Name of the embedded object of the chart, can be used as a filter in transform to select the needed chart.

title

The title of the chart, as a simple string.

subtitle

The Subtitle of the chart, as a simple string.

RowDescriptions

Array of strings, containing the descriptions of the rows.

ColumnDescriptions

Array of strings, containing the descriptions of the columns.

DataValues

Matrix of numbers, containing every cells data.

Note

Some of the data values can be “NaN”, this means they are not set.

Transform

Example Transform:

{
    "Transforms": {
        "Charts.ByEmbedIndex.0": {
            "modifyrow.1": [ 19, 15 ],
            "datayx.3.1": 37,
            "deleterow.0": "",
            "insertrow.0": [ 15, 17 ],
            "setrowdesc.0": "Paul",
            "insertcolumn.1": [ 1,2,3,4,5,6 ],
            "setcolumndesc.0": "c0",
            "deletecolumn.3": ""
        },
        "Charts.ByEmbedName.Object3": {
            "resize": [ 3, 3 ],
            "setrowdesc": [ "a", "b", "c"]
        }
        "Charts.ByTitle.Fixed issues": {
            "data": [ [ 3,1 ],
                      [ 2,0,1 ],
                      [ 3 ] ],
            "setrowdesc": ["2023.01",".02",".03"],
            "setcolumndesc": ["Jennifer", "Charles", "Thomas"]
        }
    }
}

To select which chart you want to transform, you can use these selectors:

Selector

Description

ByEmbedIndex.<num>

Index of the embedded object counted from 0. It is always unique, but the index may reference embed other than charts.

ByEmbedName.<string>

The unique name of the embedded object of the chart.

ByTitle.<string>

Title of the chart. (Title is optional)

BySubTitle.<string>

Subtitle of the chart. (Subtitle is optional)

To transform a chart you can use these commands:

Command

Value

Description

deletecolumn.<num>

NONE

Delete the <num> column

deleterow.<num>

NONE

Delete the <num> row

modifycolumn.<num>

[<num>,]

Set the column <num> data to the values

modifyrow.<num>

[<num>,]

Set the row <num> data to the values

insertcolumn.<num>

NONE

Insert an empty column before column <num>

insertcolumn.<num>

[<num>,]

Insert a column before column <num>, with values

insertrow.<num>

NONE

Insert an empty row before row <num>

insertrow .<num>

[<num>,]

Insert a row before row <number>, with values

setcolumndesc.<num>

<text>

Set column <num> description to <text>

setcolumndesc

[<text>,]

Set the column description to the values from the first.

setrowdesc.<num>

<text>

Set the row <num> description to <text>

setrowdesc

[<text>,]

Set the row description to the values from the first.

resize

[<num>,<num>]

Resize data table <num> row and <num> column. Both numbers are required, and must be greater then 1.

datayx.<num>.<num>

<num>

Set the value of the cell row <num> and column <num> to the specified value.

data

[[<num>,],]

Set values of the data table to the values. The table size will grow as needed.

Note

Commands that needs an array of values can be used with less values than the destination array. In that case it will only change the provided elements and leave the remaining one untouched.

Screenshot

../_images/ChartsScreenshot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@docStructureChartExampleOriginal.odt" -F "transform=$(cat ChartsTransform.JSON)" https://localhost:9980/cool/transform-document-structure > docStructureChartResult.odt

Command for extract:

curl -k -F "data=@docStructureChartExampleOriginal.odt" -F "filter=charts" https://localhost:9980/cool/extract-document-structure > ChartsExtractOriginal.JSON

Extracted JSON

Extracted JSON Pretty printed

Transformed Result Document

Screenshot

Document Properties

You can extract and modify document properties. These properties include meta data and statistics. You can add arbitrary named meta data properties. Most statistics are recalculated when the document is opened and can’t be modified.

Extract

Use filter=docprops to extract the document properties.

Example output: (it is pretty printed here):

{
    "DocStructure": {
        "DocumentProperties": {
            "Author": "Author TxT",
            "Generator": "Generator TxT",
            "CreationDate": "2024-01-21T14:45:00",
            "Title": "Title TxT",
            "Subject": "Subject TxT",
            "Description": "Description TxT",
            "Keywords": [ ],
            "Language": "en-GB",
            "ModifiedBy": "ModifiedBy TxT",
            "ModificationDate": "2024-05-23T10:05:50.159530766",
            "PrintedBy": "PrintedBy TxT",
            "PrintDate": "0000-00-00T00:00:00",
            "TemplateName": "TemplateName TxT",
            "TemplateURL": "TemplateURL TxT",
            "TemplateDate": "0000-00-00T00:00:00",
            "AutoloadURL": "",
            "AutoloadSecs": 0,
            "DefaultTarget": "DefaultTarget TxT",
            "DocumentStatistics": {
                "PageCount": 300,
                "TableCount": 60,
                "ImageCount": 10,
                "ObjectCount": 0,
                "ParagraphCount": 2880,
                "WordCount": 78680,
                "CharacterCount": 485920,
                "NonWhitespaceCharacterCount": 411520
            },
            "EditingCycles": 12,
            "EditingDuration": 12345,
            "Contributor": [ "Contributor1 TxT", "Contributor2 TXT"],
            "Coverage": "Coverage TxT",
            "Identifier": "Identifier TxT",
            "Publisher": [ "Publisher TxT", "Publisher2 TXT"],
            "Relation": [ "Relation TxT", "Relation2 TXT"],
            "Rights": "Rights TxT",
            "Source": "Source TxT",
            "Type": "Type TxT",
            "UserDefinedProperties": {
                "NewPropName Bool": {
                    "type": "boolean",
                    "value": true
                },
                "NewPropName Numb": {
                    "type": "long",
                    "value": 1245
                },
                "NewPropName Str": {
                    "type": "string",
                    "value": "this is a string"
                },
                "NewPropName float": {
                    "type": "float",
                    "value": 12.45
                }
            }
        }
    }
}

The following properties are extracted:

Property

Description

Author

The user name who saved the file first time.

Generator

Identifies which application was used to create or last modify the document.

CreationDate

The date and time when file was first saved.

Title

Title of the document.

Subject

Subject of the document. Can be used to group documents with similar contents.

Description

Comments to help identify the document.

Keywords

[string,] Words used to index the content of the document. Can contain white spaces.

Language

the default language of the document.

ModifiedBy

The user name when the file was last saved in a LibreOffice file format.

ModificationDate

The date and time when the file was last saved in a LibreOffice file format.

PrintedBy

The user name who printed the file last time.

PrintDate

The date and time when the file was last printed.

TemplateName

The template that was used to create the file.

TemplateURL

The URL of the template from which the document was created. The value is an empty string if the document was not created from a template or if it was detached from the template.

TemplateDate

The date and time of when the document was created or updated from the template.

AutoloadURL

The URL to load automatically at a specified time after the document is loaded into a desktop frame. An empty URL is valid and describes a case where the document shall be reloaded from its original location. An empty URL together with an AutoloadSecs value of 0 describes a case where no autoload is specified.

AutoloadSecs

The number of seconds after which a specified URL is to be loaded after the document is loaded into a desktop. A value of 0 is valid and describes a redirection. A value of 0 together with an empty string as AutoloadURL describes a case where no autoload is specified.

DefaultTarget

The name of the default frame into which links should be loaded if no target is specified.

DocumentStatistics

Statistics about the document, as separate properties. They will be recalculated and overwritten at document open.

  • PageCount

  • TableCount

  • ImageCount

  • ObjectCount

  • ParagraphCount

  • WordCount

  • CharacterCount

  • NonWhitespaceCharacterCount

EditingCycles

The number of times that the file has been saved.

EditingDuration

The amount of time that the file has been open for editing since the file was created. The editing time is updated when file saved.

Contributor

[string,] Names of the people, organizations, or other entities that have made contributions to the document.

Coverage

Time, place, or jurisdiction that the document is relevant to. For example, a range of dates, a place, or an institution that the document applies to.

Identifier

Some unique identifier like ISBN.

Publisher

[string,] Name of the entity that is making the document available. For example, a company, university, or government body.

Relation

[string,] Resources related to the document. For example, a set of volumes the document is part of, or the document’s edition number.

Rights

Intellectual property rights associated with the document. For example, a copyright statement, or information about who has permission to access the document.

Source

Information about other resources from which the document is derived. For example, the name or identifier of a hard copy that the document was scanned from, or a URL that the document was downloaded from.

Type

Information about the category or format of the document. For example, whether the document is a text document, image, or multimedia presentation.

UserDefinedProperties

List of user defined properties. Date/Time related types are not supported yet. Their Names, and types will be extracted but their values will not.

Note

Extraction of UserDefinedProperties may retrieve other types, based on what type of document properties it has. Unfortunatelly the different parts of the LibreOffice have a bit different limitations for these types:

  • With the recent LibreOffice these 6 types are found within the UI dialog: string, boolean, double, com.sun.star.util.Date, com.sun.star.util.DateTime, and com.sun.star.util.Duration

  • There are other ways to make document properties, that can add different types. And probably older versions of LibreOffice allow to add different (deprecated) types as well, that can still be extracted from old documents.

  • Unfortunatelly the exact limitation for the possible document property types aren’t well documented. Checking from the source code (when a property is added) hints at what types can be expected in some special cases. Seven more types have been identified: typelib_TypeClass_FLOAT, typelib_TypeClass_HYPER, typelib_TypeClass_LONG, typelib_TypeClass_SHORT, Time, DateTimeWithTimezone, and DateWithTimezone.

Transform

Example Transform:

{
    "Transforms": {
        "DocumentProperties": {
            "Author":"Author TxT",
            "Generator":"Generator TxT",
            "CreationDate":"2024-01-21T14:45:00",
            "Title":"Title TxT",
            "Subject":"Subject TxT",
            "Description":"Description TxT",
            "Keywords": [ ],
            "Language":"en-GB",
            "ModifiedBy":"ModifiedBy TxT",
            "ModificationDate":"2024-05-23T10:05:50.159530766",
            "PrintedBy":"PrintedBy TxT",
            "PrintDate":"0000-00-00T00:00:00",
            "TemplateName":"TemplateName TxT",
            "TemplateURL":"TemplateURL TxT",
            "TemplateDate":"0000-00-00T00:00:00",
            "AutoloadURL":"",
            "AutoloadSecs": 0,
            "DefaultTarget":"DefaultTarget TxT",
            "DocumentStatistics": {
                "PageCount": 300,
                "TableCount": 60,
                "ImageCount": 10,
                "ObjectCount": 0,
                "ParagraphCount": 2880,
                "WordCount": 78680,
                "CharacterCount": 485920,
                "NonWhitespaceCharacterCount": 411520
            },
            "EditingCycles":12,
            "EditingDuration":12345,
            "Contributor":["Contributor1 TxT","Contributor2 TXT"],
            "Coverage":"Coverage TxT",
            "Identifier":"Identifier TxT",
            "Publisher":["Publisher TxT","Publisher2 TXT"],
            "Relation":["Relation TxT","Relation2 TXT"],
            "Rights":"Rights TxT",
            "Source":"Source TxT",
            "Type":"Type TxT",
            "UserDefinedProperties":{
                "Add.NewPropName Str": {
                    "type": "string",
                    "value": "this is a string"
                },
                "Add.NewPropName Str": {
                    "type": "boolean",
                    "value": false
                },
                "Add.NewPropName Bool": {
                    "type": "boolean",
                    "value": true
                },
                "Add.NewPropName Numb": {
                    "type": "long",
                    "value": 1245
                },
                "Add.NewPropName float": {
                    "type": "float",
                    "value": 12.45
                },
                "Add.NewPropName Double": {
                    "type": "double",
                    "value": 124.578
                },
                "Delete": "NewPropName Double"
            }
        }
    }
}

To transform a document property you can use the same named commands as the extracted data was named. There are some additional commands for UserDefinedProperties to add a remove properties:

Command

Description

Delete

<string> It will delete the user defined property.

Add.<string>

{"type":<string>,"value":<value>} It adds a new named user defined property overwriting the existing value when already present. The new property will have a type and value. Types are limited, string, boolean, long, float. Date / time related types are not supported.

Note

Some property values are overwritten when the document is opened:

  • ModifiedBy and ModificationDate are overwritten by any save. (That is why in the screenshot they have wrong values)

  • DocumentStatistics are recalculated and overwritten when the document is opened, but it does not recalculated on extract.

Screenshot

../_images/DocPropertiesExampleScreenShot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@docStructureChartExampleOriginal.odt" -F "transform=$(cat DocPropTransform.JSON)" https://localhost:9980/cool/transform-document-structure > DocPropResult.odt

Command for extract:

curl -k -F "data=@temp2.odt" -F "filter=docprops" https://localhost:9980/cool/extract-document-structure > DocPropExtract.JSON

Extracted JSON

Extracted JSON Pretty printed

Transformed Result Document

Screenshot

Slides

Can be used only on Impress documents (presentations).

Slides are individual pages in presentations that can contain various elements, including text, images, videos, audio, shapes, and more. Master slides, are template pages used for creating slides. Layouts are templates for elements on the slide: type, position, size.

This API can extract the slides and master slides structure, and transform slides in the document. It can create, delete and reorder slides, change their layout, and change text of text based elements.

Extract

Use filter=slides to extract the slides.

Example output (pretty printed):

{
    "DocStructure": {
        "SlideCount": 7,
        "MasterSlideCount": 8,
        "MasterSlides": [
            "MasterSlide 0": {
                "Name": "Topic_Separator_Purple"
            },
            "MasterSlide 1": {
                "Name": "Content_sidebar_White"
            },
            "MasterSlide 2": {
                "Name": "Topic Separator white"
            },
            "MasterSlide 3": {
                "Name": "Content_sidebar_White_"
            },
            "MasterSlide 4": {
                "Name": "Topic_Separator_Purple_"
            },
            "MasterSlide 5": {
                "Name": "Content_White_Purple_Sidebar"
            }
        ],
        "Slides": [
            "Slide 0": {
                "SlideName": "Slide3-Renamed",
                "MasterSlideName": "Content_White_Purple_Sidebar",
                "LayoutId": 3,
                "LayoutName": "AUTOLAYOUT_TITLE_2CONTENT",
                "ObjectCount": 4,
                "Objects": [
                    "Objects 0": {
                        "TextCount": 1,
                        "Texts": [
                            "Text 0": {
                                "ParaCount": 1,
                                "Paragraphs": [
                                    "Friendly Open Source Project"
                                ]
                            }
                        ]
                    },
                    "Objects 1": {},
                    "Objects 2": {
                        "TextCount": 1,
                        "Texts": [
                            "Text 0": {
                                "ParaCount": 9,
                                "Paragraphs": [
                                    "Real Open Source",
                                    "100% open-source code",
                                    "Built with LibreOffice technology",
                                    "Built with Free Software technology stacks: primarily C++",
                                    "Runs best on Linux",
                                    "Open Development",
                                    "Anyone can contribute & participate",
                                    "Follow commits and tickets",
                                    "Public community calls - forum has details"
                                ]
                            }
                        ]
                    },
                    "Objects 3": {
                        "TextCount": 1,
                        "Texts": [
                            "Text 0": {
                                "ParaCount": 5,
                                "Paragraphs": [
                                    "Focus:",
                                    "a non-renewable resource.",
                                    "Office Productivity & Documents",
                                    "Excited about migrating your\u0001documents",
                                    "Grateful to our partners for solving\u0001other problems."
                                ]
                            }
                        ]
                    }
                ]
            },
            "Slide 1": {
                "SlideName": "Slide 2",
                "MasterSlideName": "Topic_Separator_Purple",
                "LayoutId": 3,
                "LayoutName": "AUTOLAYOUT_TITLE_2CONTENT",
                "ObjectCount": 1,
                "Objects": [
                    "Objects 0": {
                        "TextCount": 1,
                        "Texts": [
                            "Text 0": {
                                "ParaCount": 3,
                                "Paragraphs": [
                                    "Collabora Online",
                                    "",
                                    "Powerful Online Collaboration"
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

Extracted properties from the Impress presentation:

Property

Description

SlideCount

Number of slides in the presentation.

MasterSlideCount

Number of master slides in the presentation. These are real pages in the presentation, only used as template for slides.

MasterSlides

List of all the master slides, and some of their data. Currently only extract their name and ID.

Slides

List of all the slides, and some of their data. See table below.

Extracted properties from a slide:

Property

Description

SlideName

Name of the slide. If a slide doesn’t have a unique name they are named dynamically like “Slide 1”, “Slide 2”, etc.

MasterSlideName

Name of the master slide, this slide is made from.

LayoutId

The ID number of the actual layout used.

LayoutName

Name of the Layout.

ObjectCount

Number of elements in the slide. An elemet can be text, image, video, audio, shape and more…

Objects

List of all the elements and some of their data. See table below.

Extracted properties from an object. Currently only text based information can be extracted:

Property

Description

TextCount

Number of texts in this object. For example table objects can have more texts.

Texts

List of all the texts. See table below.

Extracted properties from a text object. Currenbtly only text based information can be extracted:

Property

Description

ParaCount

Number of paragraphs in this text object.

Paragraphs

Array of all its paragraphs, as simple strings.

Transform

Example Transform:

{
    "Transforms": {
        "SlideCommands": [
            {"JumpToSlideByName": "Slide 3"},
            {"MoveSlide": 0},
            {"RenameSlide": "Slide3-Renamed"},
            {"DeleteSlide": 2},
            {"JumpToSlide": 2},
            {"DeleteSlide": ""},
            {"JumpToSlide": 1},
            {"DuplicateSlide": ""},
            {"RenameSlide": "Slide1-Duplicated"},
            {"InsertMasterSlide": 1},
            {"RenameSlide": "SlideInserted-1"},
            {"ChangeLayout": 18},
            {"JumpToSlide": "last"},
            {"InsertMasterSlideByName": "Topic Separator white"},
            {"RenameSlide": "SlideInserted-Name"},
            {"ChangeLayoutByName": "AUTOLAYOUT_TITLE_2CONTENT"},
            {"SetText.0": "first"},
            {"SetText.1": "second"},
            {"SetText.2": "third"},
            {"DuplicateSlide": 1},
            {"MoveSlide.2": 6}
        ]
    }
}

There is always a current slide that most commands do act on, and some commands that change the current slide. By default the current slide is the slide at index 0.

To transform a slide you can use these commands:

Command

Value

Description

JumpToSlide

<num> | "last"

Jump to the slide a index, or to the last slide. The index is 0 based. Using last is useful to just add new slides at the end.

JumpToSlideByName

<string>

Jump to the named slide. Be careful with default slide names like “Slide 1”. Those names can change during slide deletion or insertion.

InsertMasterSlide

<num>

Insert a new slide after the current slide, based on the master slide at index. Jump to the newly created slide, setting the current slide.

InsertMasterSlideByName

<string>

Insert a new slide after the current slide, based on the named master slide. Jump to the newly created slide, setting the current slide.

DeleteSlide

NONE | <num>

Delete the slide at index or if none, the current slide. Will jump to the previous slide, or to the new first slide if this was the first slide. If this is the current slide then it will jump to the previous slide. If needed the index of the current slide will be readjusted so the current slide is unchanged. As there must always be one slide left in the presentation, the last remainind slide can not be deleted.

MoveSlide

<num>

Move the the current slide to the new positon. The index the current slide is readjusterd to follow.

MoveSlide.<num>

<num>

Move the slide at index to a new position. If the index is the one of the current slide then it is like the previous command. Otherwise the current slide will be unchanged, but its index may be adjusted as needed.

DuplicateSlide

NONE | <num>

Duplicate the slide at index, or if none, the current slide, and jump to this new slide.

ChangeLayout

<num> | <string>

Change the layout of the current slide to the layout with the index or the named layout. For Layout names, you can use these:

  • AUTOLAYOUT_TITLE_CONTENT

  • AUTOLAYOUT_TITLE_CONTENT_OVER_CONTENT

  • AUTOLAYOUT_TITLE_CONTENT_2CONTENT

  • AUTOLAYOUT_TITLE_4CONTENT

  • AUTOLAYOUT_ONLY_TEXT

  • AUTOLAYOUT_TITLE_ONLY

  • AUTOLAYOUT_TITLE_6CONTENT

  • AUTOLAYOUT_TITLE

  • AUTOLAYOUT_TITLE_2CONTENT_CONTENT

  • AUTOLAYOUT_TITLE_2CONTENT_OVER_CONTENT

  • AUTOLAYOUT_TITLE_2CONTENT

  • AUTOLAYOUT_VTITLE_VCONTENT

  • AUTOLAYOUT_VTITLE_VCONTENT_OVER_VCONTENT

  • AUTOLAYOUT_TITLE_VCONTENT

  • AUTOLAYOUT_TITLE_2VTEXT

RenameSlide

<string>

Rename the current slide. Use unique names. Two slides cannot have the same name. Default names like “Slide 1”, “Slide 43”, cannot be set.

SetText.<num>

<text>

Set the object text as index to text. Supported only on text based objects.

MarkObject

<num>

Mark (select) the object at index on the current slide. This allows to use UNO commands that work on selected objects.

UnMarkObject

<num>

Unmark (deselect) the object at index on the current slide.

UnoCommand

<string>

Call the UNO command. For example .uno:DefaultBullet will toggle the selected paragraphs bullets, to on/off. There are many more uno commands… Not checked yet wich works here and wich not. Be careful with these, some may even break the mechanism of transform.

Note

Have to check which UNO commands works here, and how to use tham. Some may need parameters. For now, only .uno:DefaultBullet checked and tested to work.

To obtain the full list of enabled uno commands, you can check: sfx2/source/control/unoctitm.cxx under:

const std::map<std::u16string_view, KitUnoCommand>& GetKitUnoCommandList()

Screenshot

../_images/SlidesScreenshot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@SlidesExampleOriginal.odp" -F "transform=$(cat SlidesTransform.JSON)" https://localhost:9980/cool/transform-document-structure > SlidesResult.odp

Command for extract:

curl -k -F "data=@SlidesExampleOriginal.odp" -F "filter=slides" https://localhost:9980/cool/extract-document-structure > SlidesExtractOriginal.JSON

Extracted JSON

Extracted JSON Pretty printed

Transformed Result Document

Screenshot

Version updates

24.04.5.3: Added Content Controls.
24.04.8-1: Updated Content Controls: Added “CurrentDate”, and “date” to get/set language independent date.
24.04.8-1: Added Charts.
24.04.9-1: Added Document Properties.
Not released yet: Added Slides.

Important

The extract-document-structure and transform-document-structure endpoints are restricted to the allowed host addresses that can be set in the coolwsd.xml configuration file. The IP addresses have to be added as dot-escaped net.post_allow.host entries.