Extract/Transform Document Structure API

Requirement: This API was added in Collabora Online version 24.04.5.3

Collabora Online allows you to extract and transform document structure. Currently Collabora Online supports those features only in writer documents, and only on:

  • Content controls

  • Charts

  • Document properties.

General Usage:

General Extract:

API: HTTP POST to /cool/extract-document-structure

Example:

curl -k -F "data=@test.docx" -F "filter=contentcontrol" https://localhost:9980/cool/extract-document-structure > out.json

Note: The -k in this example disables certificate validation as it is unlikely you have one in that situation. On a production system, please make sure you have valid certificates and to not disable the validation with -k.

  • or in HTML:

1<form action="https://localhost:9980/cool/extract-document-structure" enctype="multipart/form-data" method="post">
2 File: <input type="file" name="data"><br/>
3 Filter: <input type="text" name="filter" value="contentcontrol"><br/>
4 <input type="submit" value="Extract">
5</form>
  • data: the document file itself in the payload.

  • filter: the filter parameter (optional, recommended, see Note) sets what to extract. contentcontrol, charts, and docprops filters are currently supported.

  • lang: the language parameter (optional) sets the default format language, useful for date type cells. If passed, the load language is used and it determines the display/output format. Example: lang=fr-FR

Note: Without setting filter, the API will extract everything. Please do not use it (without filter) in final (long-term) solutions, because when the API is expanded, the extracted data will be expanded too, that may cause unexpected problems.

General Transform:

API: HTTP POST to /cool/transform-document-structure/<format>&<lang=xx-XX>

<format> or lang parameters can be ommitted

API: HTTP POST to /cool/transform-document-structure

Note: you can provide format as another parameter, or if no format defined, it will use the input file format.

Example:

curl -v -k -F "data=@test.docx" -F "format=docx" -F "transform=$(cat transform.JSON)" https://localhost:9980/cool/transform-document-structure > out.docx
curl -k -F "data=@test.docx" -F "format=odt" -F "transform={\"Transforms\":{\"ContentControls.ByIndex.1\":{\"content\":\"Short text\"},\"ContentControls.ByIndex.6\":{\"content\":\"5/14/2024\",\"alias\":\"date\"}}}}" https://localhost:9980/cool/transform-document-structure > out.odt
  • or in HTML:

1<form action="https://localhost:9980/cool/transform-document-structure" enctype="multipart/form-data" method="post">
2 File: <input type="file" name="data"><br/>
3 Format: <input type="text" name="format"><br/>
4 Transform: <input type="text" name="transform"><br/>
5 <input type="submit" value="Transform">
6</form>
  • data: the document file itself in the payload.

  • format: the format parameter (optional) is the format of the output document file. e.g. odt, docx

  • transform: the transform parameter (required) a JSON formatted string that contains the transformation commands.

  • lang: the language parameter (optional) sets the default format language, useful for date type cells. If passed, the load language is used and it determines the display/output format. Example: lang=fr-FR

The order of the commands:

  • The transform commands are executed in the order as they listed in the transform JSON.

  • A filter command may match to more items, and it may transform more items. (for example: “Charts.ByTitle.CommonName”)
    • In that case the items will be handled after each other, based on an item order (ByIndex, ByEmbedIndex, …). Once all sub commands of a filter is executed on the 1. item, then it will execute the same sub-commands on the 2. item …

  • It can be useful in many cases, but be aware with complex cases, because commands may overwrite not only the previous commands results, but they can even alter future commands.

Content Controls:

Content controls, are items in document that have special UI optimized for user input, primarily for form filling purposes. (e.g. date-picker, checkbox, combo-box … ) This API can extract, or overwrite the values of these content control items.

Extract

use filter=contentcontrol

Example output: (it is pretty printed here):

{
    "DocStructure": {
        "ContentControls.ByIndex.0": {
            "id": -428815899,
            "tag": "machine-readable",
            "alias": "Human Readable",
            "content": "some text",
            "type": "plain-text"
        },
        "ContentControls.ByIndex.1": {
            "id": -325123259,
            "tag": "",
            "alias": "cb1",
            "content": "☒",
            "type": "checkbox",
            "Checked": "true"
        },
        "ContentControls.ByIndex.2": {
            "id": -1652436866,
            "tag": "",
            "alias": "",
            "content": "text1",
            "type": "drop-down-list"
        },
        "ContentControls.ByIndex.3": {
            "id": 659509202,
            "tag": "",
            "alias": "date",
            "content": "7\/7\/2024",
            "type": "date",
            "DateFormat": "M\/d\/yyyy",
            "DateLanguage": "en-US",
            "CurrentDate": "2024-07-07T00:00:00Z"
        }
    }
}

Content control types:

  • “plain-text”

  • “rich-text” (extract only its text without formats)

  • “checkbox” (“checked” = “true” or “false”, content = “☒” or “☐”)

  • “drop-down-list” (content is the selected item text)

  • “combo-box”

  • “date” (content = the text of date, DateFormat = text of the format of the date, DateLanguage = language of the date format)

  • “picture” (not supported yet)

Data it extracts:

  • “id” (ID number of content controls, it may not unique.)

  • “alias” (Users may (or may not) set a short alias text to identify the content control.)

  • “tag” (Longer text that may (or may not) generated by some software, or set by users.)

  • “content” (Its content as string)

  • “type” (Type of the content control)

  • “checked” (Only in case of a checkbox, its state, “true” or “false”)

  • “DateFormat” (Only in case of a date, format of the date like “M/d/yyyy”)

  • “DateLanguage” (Only in case of a date, language of the date like “en-US”)

  • “CurrentDate” (Only in case of a date, ISO8601 formatted date value, independent from language.)

Note: “id”, “tag”, “alias”, “content”, “type” are extracted from every type of content control.

Transform

Example Transform:

{
    "Transforms": {
        "ContentControls.ByIndex.1": {
            "content":"Short text"
        },
        "ContentControls.ByTag.datetag": {
            "content":"5/14/2024",
            "date":"2024-05-14",
            "alias":"date"
        }
    }
}

To select which content controls you want to transform, you can use:

  • ByIndex.<number> Index of the contentcontrol counted from 0. It is always unique.

  • ById.<number> ID number of content controls, it may not unique

  • ByAlias.<string> Users may (or may not) set a short alias text to identify the content control.

  • ByTag.<string> Longer text that may (or may not) generated by some software, or set by users.

Note: most of the identification is not unique, so they can address more content controls at once, that case transform makes the transformation on every addressed content control.

To transform a content control you can set:

  • “content”: <text>

  • “checked”: <true or false>

  • “alias”: <text>

  • “date”: <date text> Only for date content controls. Does not change the displayed text. Always use ISO8601 format like “2024-03-22”

“content” is a generic used string, it can be used to set data of any type of content control. (“picture” is not supported yet)
In case of a “checkbox”:
  • Setting “content” to “☐” will also set the “checked” to false.

  • Setting “checked”:true will also set “content” to “☒”

In case of a “date”: - setting “content” only sets the displayed text like “1999.dec.12” the actual date value will not be changed. - setting “date” only sets date value but not change the displayed text.

In case of a “rich-text” only a simple formatless text can be set. (like “plain-text”)

Transform will make the changes in content controls, in the same order as ContentControls listed in the transform json string. In case of the Example Transform: First it set the content control that have index == 1, setting its text to “Short text”. Then it finds all the content controls where tag == “datetag” and sets their value to “5/14/2024”.

Note: If you set empty string to “content” then it may change back to Placeholder value. ( e.g. “Click here to enter text” or “Choose a date” or “Choose an item”…)

Screenshot

../_images/ContentControlScreenshot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@contentControlsExampleOriginal.odt" -F "transform=$(cat DocPropTransform.JSON)" https://localhost:9980/cool/transform-document-structure > contentControlsExampleResult.odt

Command for extract:

curl -k -F "data=@contentControlsExampleOriginal.odt" -F "filter=contentcontrol" https://localhost:9980/cool/extract-document-structure > contentControlsOriginalExtract.JSON
curl -k -F "data=@contentControlsExampleResult.odt" -F "filter=contentcontrol" https://localhost:9980/cool/extract-document-structure > contentControlsResultExtract.JSON

Extracted JSON from result odt

Extracted JSON from result odt, Pretty printed

Extracted JSON from original odt, Pretty printed

Screenshot

Charts:

Extract

use filter=charts

Example output: (it is pretty printed here):

{
    "DocStructure": {
        "Charts.ByEmbedIndex.0": {
            "name": "Object1",
            "title": "Paid leave days",
            "subtitle": "Subtitle2",
            "RowDescriptions": [ "James", "Mary", "Patricia", "David"],
            "ColumnDescriptions": [ "2022", "2023"],
            "DataValues": [
                "Row.0": [ "22", "24"],
                "Row.1": [ "18", "16"],
                "Row.2": [ "32", "32"],
                "Row.3": [ "25", "23"]
            ]
        }
    }
}

Data it extracts:

  • “name” (Name of the embedded object of the chart, can be used as a filter in transform to select the needed chart.)

  • “title” (The title of the chart, as a simple string.)

  • “subtitle” (The Subtitle of the chart, as a simple string.)

  • “RowDescriptions” (Array of strings, containing the descriptions of the rows.)

  • “ColumnDescriptions” (Array of strings, containing the descriptions of the columns.)

  • “DataValues” (Matrix of numbers, containing every cells data.)

Note: Some of the data values can be “NaN”, this means they are not set.

Transform

Example Transform:

{
    "Transforms": {
        "Charts.ByEmbedIndex.0": {
            "modifyrow.1": [ 19, 15 ],
            "datayx.3.1": 37,
            "deleterow.0": "",
            "insertrow.0": [ 15, 17 ],
            "setrowdesc.0": "Paul",
            "insertcolumn.1": [ 1,2,3,4,5,6 ],
            "setcolumndesc.0": "c0",
            "deletecolumn.3": ""
        },
        "Charts.ByEmbedName.Object3": {
            "resize": [ 3, 3 ],
            "setrowdesc": [ "a", "b", "c"]
        }
        "Charts.ByTitle.Fixed issues": {
            "data": [ [ 3,1 ],
                      [ 2,0,1 ],
                      [ 3 ] ],
            "setrowdesc": ["2023.01",".02",".03"],
            "setcolumndesc": ["Jennifer", "Charles", "Thomas"]
        }
    }
}

To select which chart you want to transform, you can use these filters:

  • ByEmbedIndex.<number> Index of the embedded object counted from 0. It is always unique. Some embedded objects are not chart.

  • ByEmbedName.<string> The unique name of the embedded object of the chart.

  • ByTitle.<string> Title of the chart. (Title is optional)

  • BySubTitle.<string> Subtitle of the chart. (Subtitle is optional)

To transform a chart you can use these commands:

  • deletecolumn.<number>:”” Delete the <number>. column

  • deleterow.<number>:”” Delete the <number>. row

  • modifycolumn.<number>: [<numerArray>] Modify the <number>. column data to these values: <numerArray>

  • modifyrow.<number>: [<numerArray>] Modify the <number>. row data to these values: <numerArray>

  • insertcolumn.<number>: “” Insert a column before column <number>, without setting values to it

  • insertcolumn.<number>: [<numerArray>] Insert a column before column <number>, with setting values <numerArray> to the column

  • insertrow.<number>: “” Insert a row before row <number>, without setting values to it

  • insertrow.<number>: [<numerArray>] Insert a row before row <number>, with setting values <numerArray> to the row

  • setcolumndesc.<number>: “<text>” Set the <number>. column description to <text>

  • setcolumndesc: [<textArray>] Set the first <count of <textArray>> column description to <textArray>

  • setrowdesc.<number>: “<text>” Set the <number>. row description to <text>

  • setrowdesc: [<textArray>] Set the first <count of <textArray>> row description to <textArray>

  • resize: [<numberR>,<numberC>] Resize data table to size: <numberR> row, <numberC> column. Both numbers are required, and must be >= 1.

  • datayx.<numberR>.<numberC>: <numberV> Set the value of the cell (at row=<numberR>,column=<numberC>) to <numberV>

  • data: [ <numberArrayArray> ] Set values of the data table to <numberArrayArray>, Resize the table if the original table was smaller as needed

Note: commands that needs an array of values, can be used with less values as the destination array, in that case it will not transform the last elements.

Screenshot

../_images/ChartsScreenshot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@docStructureChartExampleOriginal.odt" -F "transform=$(cat ChartsTransform.JSON)" https://localhost:9980/cool/transform-document-structure > docStructureChartResult.odt

Command for extract:

curl -k -F "data=@docStructureChartExampleOriginal.odt" -F "filter=charts" https://localhost:9980/cool/extract-document-structure > ChartsExtractOriginal.JSON

Extracted JSON

Extracted JSON Pretty printed

Screenshot

Document Properties:

Extract

use filter=docprops

Example output: (it is pretty printed here):

{
    "DocStructure": {
        "DocPops": {
            "Author": "Author TxT",
            "Generator": "Generator TxT",
            "CreationDate": "2024-01-21T14:45:00",
            "Title": "Title TxT",
            "Subject": "Subject TxT",
            "Description": "Description TxT",
            "Keywords": [ ],
            "Language": "en-GB",
            "ModifiedBy": "ModifiedBy TxT",
            "ModificationDate": "2024-05-23T10:05:50.159530766",
            "PrintedBy": "PrintedBy TxT",
            "PrintDate": "0000-00-00T00:00:00",
            "TemplateName": "TemplateName TxT",
            "TemplateURL": "TemplateURL TxT",
            "TemplateDate": "0000-00-00T00:00:00",
            "AutoloadURL": "",
                            "AutoloadSecs": 0,
            "DefaultTarget": "DefaultTarget TxT",
            "DocumentStatistics": {
                "PageCount": 300,
                "TableCount": 60,
                "ImageCount": 10,
                "ObjectCount": 0,
                "ParagraphCount": 2880,
                "WordCount": 78680,
                "CharacterCount": 485920,
                "NonWhitespaceCharacterCount": 411520
            },
            "EditingCycles": 12,
            "EditingDuration": 12345,
            "Contributor": [ "Contributor1 TxT", "Contributor2 TXT"],
            "Coverage": "Coverage TxT",
            "Identifier": "Identifier TxT",
            "Publisher": [ "Publisher TxT", "Publisher2 TXT"],
            "Relation": [ "Relation TxT", "Relation2 TXT"],
            "Rights": "Rights TxT",
            "Source": "Source TxT",
            "Type": "Type TxT",
            "UserDefinedProperties": {
                "NewPropName Bool": {
                    "type": "boolean",
                    "value": true
                },
                "NewPropName Numb": {
                    "type": "long",
                    "value": 1245
                },
                "NewPropName Str": {
                    "type": "string",
                    "value": "this is a string"
                },
                "NewPropName float": {
                    "type": "float",
                    "value": 12.45
                }
            }
        }
    }
}

Data it extracts:

  • “Author” (The user name who saved the file first time.)

  • “Generator” (Identifies which application was used to create or last modify the document.)

  • “CreationDate” (The date and time when file was first saved.)

  • “Title” (Title of the document.)

  • “Subject” (Subject of the document. Can be used to group documents with similar contents.)

  • “Description” (Comments to help identify the document.)

  • “Keywords” (Array of texts. Words used to index the content of the document. Can contain white spaces.)

  • “Language” (the default language of the document.)

  • “ModifiedBy” (The user name when the file was last saved in a LibreOffice file format.)

  • “ModificationDate” (The date and time when the file was last saved in a LibreOffice file format.)

  • “PrintedBy” (The user name who printed the file last time.)

  • “PrintDate” (The date and time when the file was last printed.)

  • “TemplateName” (The template that was used to create the file.)

  • “TemplateURL” (The URL of the template from which the document was created. The value is an empty string if the document was not created from a template or if it was detached from the template.)

  • “TemplateDate” (The date and time of when the document was created or updated from the template.)

  • “AutoloadURL” (The URL to load automatically at a specified time after the document is loaded into a desktop frame. An empty URL is valid and describes a case where the document shall be reloaded from its original location. An empty URL together with an “AutoloadSecs” value of 0 describes a case where no autoload is specified.)

  • “AutoloadSecs” (The Number of seconds after which a specified URL is to be loaded after the document is loaded into a desktop. A value of 0 is valid and describes a redirection. A value of 0 together with an empty string as “AutoloadURL” describes a case where no autoload is specified.)

  • “DefaultTarget” (The Name of the default frame into which links should be loaded if no target is specified.)

  • “DocumentStatistics” (Statistics about the document, as separate properties. They will be recalculated and overwritten at document open.)
    • “PageCount”

    • “TableCount”

    • “ImageCount”

    • “ObjectCount”

    • “ParagraphCount”

    • “WordCount”

    • “CharacterCount”

    • “NonWhitespaceCharacterCount”

  • “EditingCycles” (The Number of times that the file has been saved.)

  • “EditingDuration” (The amount of time that the file has been open for editing since the file was created. The editing time is updated when file saved.)

  • “Contributor” (Array of texts. Names of the people, organizations, or other entities that have made contributions to the document.)

  • “Coverage” (Time, place, or jurisdiction that the document is relevant to. For example, a range of dates, a place, or an institution that the document applies to.)

  • “Identifier” (Some unique identifier like ISBN)

  • “Publisher” (Array of texts. Name of the entity that is making the document available. For example, a company, university, or government body.)

  • “Relation” (Array of texts. Resources related to the document. For example, a set of volumes the document is part of, or the document’s edition number.)

  • “Rights” (Information intellectual property rights associated with the document. For example, a copyright statement, or information about who has permission to access the document.)

  • “Source” (Information about other resources from which the document is derived. For example, the name or identifier of a hard copy that the document was scanned from, or a url that the document was downloaded from.)

  • “Type” (Information about the category or format of the document. For example, whether the document is a text document, image, or multimedia presentation.)

  • “UserDefinedProperties” (List of user defined properties. Date/Time related types are not supported yet. Their Names, and types will be extracted but their values will not.)

Note: Extract of UserDefinedProperties may retreive other types, based on what type of Document properties it have. Unfortunatelly the different parts of the LibreOffice have a bit different limitations for these types:

  • With the recent LibreOffice, I was able to make only these 6 types with the UI dialog: string, boolean, double, com.sun.star.util.Date, com.sun.star.util.DateTime, com.sun.star.util.Duration

  • There are other ways to make document properties, that can add different types. And probably older version of LibreOffice also was able to add different (outdated) types as well, that we still can extract from old documents.

  • Unfortunatelly I did not find an exact limitation for the possible Document property types in a Document. I found, a type check in the sourcecode (when property is added) that may give us a hint that what types we can expect in special cases. That listed 7 more types: typelib_TypeClass_FLOAT, typelib_TypeClass_HYPER, typelib_TypeClass_LONG, typelib_TypeClass_SHORT. Time, DateTimeWithTimezone, DateWithTimezone

Transform

Example Transform:

{
    "Transforms": {
        "DocumentProperties": {
            "Author":"Author TxT",
            "Generator":"Generator TxT",
            "CreationDate":"2024-01-21T14:45:00",
            "Title":"Title TxT",
            "Subject":"Subject TxT",
            "Description":"Description TxT",
            "Keywords": [ ],
            "Language":"en-GB",
            "ModifiedBy":"ModifiedBy TxT",
            "ModificationDate":"2024-05-23T10:05:50.159530766",
            "PrintedBy":"PrintedBy TxT",
            "PrintDate":"0000-00-00T00:00:00",
            "TemplateName":"TemplateName TxT",
            "TemplateURL":"TemplateURL TxT",
            "TemplateDate":"0000-00-00T00:00:00",
            "AutoloadURL":"",
            "AutoloadSecs": 0,
            "DefaultTarget":"DefaultTarget TxT",
            "DocumentStatistics": {
                "PageCount": 300,
                "TableCount": 60,
                "ImageCount": 10,
                "ObjectCount": 0,
                "ParagraphCount": 2880,
                "WordCount": 78680,
                "CharacterCount": 485920,
                "NonWhitespaceCharacterCount": 411520
            },
            "EditingCycles":12,
            "EditingDuration":12345,
            "Contributor":["Contributor1 TxT","Contributor2 TXT"],
            "Coverage":"Coverage TxT",
            "Identifier":"Identifier TxT",
            "Publisher":["Publisher TxT","Publisher2 TXT"],
            "Relation":["Relation TxT","Relation2 TXT"],
            "Rights":"Rights TxT",
            "Source":"Source TxT",
            "Type":"Type TxT",
            "UserDefinedProperties":{
                "Add.NewPropName Str": {
                    "type": "string",
                    "value": "this is a string"
                },
                "Add.NewPropName Str": {
                    "type": "boolean",
                    "value": false
                },
                "Add.NewPropName Bool": {
                    "type": "boolean",
                    "value": true
                },
                "Add.NewPropName Numb": {
                    "type": "long",
                    "value": 1245
                },
                "Add.NewPropName float": {
                    "type": "float",
                    "value": 12.45
                },
                "Add.NewPropName Double": {
                    "type": "double",
                    "value": 124.578
                },
                "Delete": "NewPropName Double"
            }
        }
    }
}

To transform a document property you can use the same named commands, as the extracted data was named. There are some additional commands for “UserDefinedProperties”:

  • “Delete”.<PropertyName>

    It will delete the user defined property named as <PropertyName>.

  • “Add”.<PropName>:{“type”:<PropType>,”value”:<PropValue>}
    It adds a new user defined property named as <PropertyName>, overwriting existing value when already present. The new property will have a type:<PropType> and value:<PropValue>.
    Types are limited, it should work with string, boolean, long, float
    Date / time related types are not supported.

Note: Some property values are overwritten when the document opened.

  • ModifiedBy and ModificationDate are overwritten by any save. (That is why in the screenshot they have wrong values)

  • DocumentStatistics are recalculated and overwritten when the document is opened, but it does not recalculated on extract.

Screenshot

../_images/DocPropertiesExampleScreenShot.png

Example Files

Original Document

Transform JSON

Command for transform:

curl -v -k -F "data=@docStructureChartExampleOriginal.odt" -F "transform=$(cat DocPropTransform.JSON)" https://localhost:9980/cool/transform-document-structure > temp2.odt

Command for extract:

curl -k -F "data=@temp2.odt" -F "filter=docprops" https://localhost:9980/cool/extract-document-structure > DocPropExtract.JSON

Extracted JSON

Extracted JSON Pretty printed

Screenshot

Version updates:

24.04.5.3: Added Content Controls.
24.04.8-1: Updated Content Controls: Added “CurrentDate”, and “date” to get/set language independent date.
24.04.8-1: Added Charts.
24.04.9-1: Added Document Properties.

Note: the extract-document-structure and transform-document-structure endpoint is restricted to allowed host addresses that can be set in the /etc/coolwsd/coolwsd.xml configuration file. The IP addresses have to be added as dot-escaped net.post_allow.host entries.