Issue
What exactly is the BOM in a ANSI XML document and should it be removed? Should a XML document be in UTF-8 instead? Can anyone tell me a Java method that will detect the BOM? The BOM consists of the characters EF BB BF .
Solution
For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.
The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.
(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)
Regarding the question on how detect this in java.
Check the following answer to this question: Java : How to determine the correct charset encoding of a stream and if you now want to determine the BOM yourself (at your own risk) check for example this code Java Tip: How to read a file and automatically specify the correct encoding.
Basically just read in the first few bytes yourself and then determine if you may have found a BOM.
Answered By - jitter
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.