What is the encoding used for xml files exported using Export Publication, and can you change it?

The output files contained in the .zip created by the Export Publication feature do not appear to be in ASCII flat-file format. I have a utility I use to do full-text searches, but the utility doesn't appear to recognize the encoding. If I use Notepad++ to create a new file, cut and paste the text from the exported file into the new file, and then save it, my utility has no problem reading the content. First, I need to know what encoding is used for exported files, and second, I would like to know if you can set a particular encoding specification before exporting.

Thanks.

-jean farnsworth
OpenText

emoji
Parents
  • Echoing Mark's reply, yes, it's UTF16-LE, and to the best of my knowledge, there's no configuration setting anywhere in TD14 to change that. (I have spent time trying to find the magic option.) This is basically all due to that being a Microsoft default in XML handling; maybe all text string handling. I also recently learned PowerShell does the same thing by default, even with plain text files, like when writing out to a log file.

    It's been suggested that, in the case of a publish output process, you could create a small publish plug-in to iterate over all the exported XML and convert it to UTF8. The simplest way to do this in the plug-in would probably be to run a "dummy" XSLT (i.e. input=output) with <xsl:output method="xml" encoding-"UTF-8" . . . /> in the XSLT.

    Otherwise, it's up to you to manually convert them, or create a tool that might automate it in whatever processing pipeline you are putting your Export Publication output through. Linux is involved in a lot of our post-processing, so I go to "iconv", as one of the things Mark mentioned. We also do a lot of Perl here, so when writing strings/content to filehandles using open/print/close, you can set the output encoding in the "open" statement with " '>:encoding(UTF-8)' ". Or, run an XSLT.

    emoji
Reply
  • Echoing Mark's reply, yes, it's UTF16-LE, and to the best of my knowledge, there's no configuration setting anywhere in TD14 to change that. (I have spent time trying to find the magic option.) This is basically all due to that being a Microsoft default in XML handling; maybe all text string handling. I also recently learned PowerShell does the same thing by default, even with plain text files, like when writing out to a log file.

    It's been suggested that, in the case of a publish output process, you could create a small publish plug-in to iterate over all the exported XML and convert it to UTF8. The simplest way to do this in the plug-in would probably be to run a "dummy" XSLT (i.e. input=output) with <xsl:output method="xml" encoding-"UTF-8" . . . /> in the XSLT.

    Otherwise, it's up to you to manually convert them, or create a tool that might automate it in whatever processing pipeline you are putting your Export Publication output through. Linux is involved in a lot of our post-processing, so I go to "iconv", as one of the things Mark mentioned. We also do a lot of Perl here, so when writing strings/content to filehandles using open/print/close, you can set the output encoding in the "open" statement with " '>:encoding(UTF-8)' ". Or, run an XSLT.

    emoji
Children
No Data