utf8 in fromxsf - inconsistent output

Hello. I have an issue with the -utf8 switch not working consistently for the fromxsf command.

My input data contains a section sign: §.

I am running this command on the DIV composed from that data: fromxsf -utf8 -cd $DIVpath $OUTfile

For one of my DIVs, the section sign outputs as "§". For another DIV, it outputs as unicode - i.e. "§";

If I switch the data around between these DIVs, one still consistently outputs "§" and the other consistently outputs "§".

I can only conclude the issue is not with the data but instead there is something in the "bad" DIV that does not recognize the -utf8 switch.

Any ideas why this is happening?

All help is appreciated.

Thank you.

  • My guess is that you're talking about XML data and in one of your DIVs the data came in (via toxsf) with the character specified as a numeric character reference, namely "§".

    An excerpt from the on-line help file for fromxsf says:
    Characters which are input (via toxsf) as entities or as numeric character references are always output unchanged.

    Specifying the -utf8 option does not change that behavior. XPP is trying its best to maintain the data as originally input when round-tripping the data (at least in this aspect).

    You should be able to tell with the xyview by finding the section sign character and opening the Line Editor and you'll see the character show as the "§" sequence rather than just as the section sign character. If you cursor past the sequence in the Line Editor, you'll see the cursor jump past the whole sequence indicating that it's representing a single character. All of that indicates that the data came in that way.

    I hope that helps explain what you are seeing.

  • Hi, Jonathan. I tried looking at the markup in line editor and it behaved exactly as you said. So, that would support the idea that the section sign is coming into XPP as "§".

    However, looking at our process where we have a lot of temporary files before reaching XPP, I only see "§" and not "§".

    I even went to our database to look at the original file and it only shows "§". As a precaution, I even looked at the source data in notepad (since it acts like a crucible for text and usually removes any "impurities"). In every case, I only see "§" and not "§".

    Still, I trust your information, so I'll look further to see where "§" is coming into play.

    Thank you for your help.