What is the encoding used for xml files exported using Export Publication, and can you change it?

The output files contained in the .zip created by the Export Publication feature do not appear to be in ASCII flat-file format. I have a utility I use to do full-text searches, but the utility doesn't appear to recognize the encoding. If I use Notepad++ to create a new file, cut and paste the text from the exported file into the new file, and then save it, my utility has no problem reading the content. First, I need to know what encoding is used for exported files, and second, I would like to know if you can set a particular encoding specification before exporting.

Thanks.

-jean farnsworth
OpenText

emoji
  • The encoding for export is UTF16-LE with signature. Not sure if it can be changed/configured. There are a variety of tools available to convert to UTF-8 for compatibility with various text editors, though some text editors handle the UTF-16 just fine (EmEditor, for example). I have used UTFCastExpress by Rotating Screw. iconv seems to be popular, as well -- native to Linux, but there is a Windows library available as well.

    emoji
  • Echoing Mark's reply, yes, it's UTF16-LE, and to the best of my knowledge, there's no configuration setting anywhere in TD14 to change that. (I have spent time trying to find the magic option.) This is basically all due to that being a Microsoft default in XML handling; maybe all text string handling. I also recently learned PowerShell does the same thing by default, even with plain text files, like when writing out to a log file.

    It's been suggested that, in the case of a publish output process, you could create a small publish plug-in to iterate over all the exported XML and convert it to UTF8. The simplest way to do this in the plug-in would probably be to run a "dummy" XSLT (i.e. input=output) with <xsl:output method="xml" encoding-"UTF-8" . . . /> in the XSLT.

    Otherwise, it's up to you to manually convert them, or create a tool that might automate it in whatever processing pipeline you are putting your Export Publication output through. Linux is involved in a lot of our post-processing, so I go to "iconv", as one of the things Mark mentioned. We also do a lot of Perl here, so when writing strings/content to filehandles using open/print/close, you can set the output encoding in the "open" statement with " '>:encoding(UTF-8)' ". Or, run an XSLT.

    emoji