Package net.sf.okapi.common
Class BOMNewlineEncodingDetector
- java.lang.Object
-
- net.sf.okapi.common.BOMNewlineEncodingDetector
-
public final class BOMNewlineEncodingDetector extends Object
Helper class to detect byte-order-mark and other easily guessed of encodings, as well as the type of line-break used in a given input. Based on information in: http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info http://www.w3.org/TR/html401/charset.html#h-5.2
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classBOMNewlineEncodingDetector.NewlineTypeDefines type friendly newline types.
-
Field Summary
Fields Modifier and Type Field Description static StringBOCU_1BOCU (Binary Ordered Compression for Unicode)static StringEBCDICJava friendly EBCDIC encoding name..static StringISO_8859_1Java friendly ISO-8859-1 encoding name.static StringSCSUSCSU (Standard Compression Scheme for Unicode)static StringUTF_16Java friendly UTF-16 encoding name.static StringUTF_16BEJava friendly UTF-16 big endian encoding name.static StringUTF_16LEJava friendly UTF-16 little endian encoding name.static StringUTF_32Java friendly UTF-32 encoding name..static StringUTF_32BEJava friendly UTF-32 big endian encoding name..static StringUTF_32LEJava friendly UTF-32 little endian encoding name..static StringUTF_7Java friendly UTF-7 encoding name..static StringUTF_8Java friendly UTF-8 encoding name.static StringUTF_EBCDICJava friendly UTF-EBCDIC encoding name..
-
Constructor Summary
Constructors Constructor Description BOMNewlineEncodingDetector(InputStream inputStream)Create a new BOMNewlineEncodingDetector from anInputStream.BOMNewlineEncodingDetector(InputStream inputStream, String defaultEncoding)Create a new BOMNewlineEncodingDetector from anInputStreamand a user provided encoding.BOMNewlineEncodingDetector(InputStream inputStream, Charset defaultEncoding)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voiddetectAndRemoveBom()voiddetectBom()intgetBomSize()Gets the number of bytes used by the Byte-Order-mark in this document.StringgetDefaultEncoding()Get the defaultEncoding set by the user.StringgetEncoding()Get the guessed encoding or if encoding couldn't be guessed return the user supplied encoding.StringgetEncodingSpecificationInfo()Return a short description of the encoding.InputStreamgetInputStream()Get the input stream pased in to the constructorBOMNewlineEncodingDetector.NewlineTypegetNewlineType()Detects newline type using the inputStream itself.static BOMNewlineEncodingDetector.NewlineTypegetNewlineType(CharSequence text)Static helper method for detecting newline type used in a run of text.booleanhasBom()Does this document have a byte order mark?booleanhasUtf7Bom()Does this document have a UTF-7 byte order mark?booleanhasUtf8Bom()Indicates if the guessed encoding is UTF-8 and this file has a BOM.booleanhasUtf8Encoding()booleanisAutodetected()Indicates if the guessed encoding was auto-detected.booleanisDefinitive()Are we confident of the document encoding?voidsetDefaultEncoding(String defaultEncoding)Set the default encoding.
-
-
-
Field Detail
-
UTF_16
public static final String UTF_16
Java friendly UTF-16 encoding name.- See Also:
- Constant Field Values
-
UTF_16BE
public static final String UTF_16BE
Java friendly UTF-16 big endian encoding name.- See Also:
- Constant Field Values
-
UTF_16LE
public static final String UTF_16LE
Java friendly UTF-16 little endian encoding name.- See Also:
- Constant Field Values
-
UTF_8
public static final String UTF_8
Java friendly UTF-8 encoding name.- See Also:
- Constant Field Values
-
ISO_8859_1
public static final String ISO_8859_1
Java friendly ISO-8859-1 encoding name.- See Also:
- Constant Field Values
-
EBCDIC
public static final String EBCDIC
Java friendly EBCDIC encoding name..- See Also:
- Constant Field Values
-
SCSU
public static final String SCSU
SCSU (Standard Compression Scheme for Unicode)- See Also:
- Constant Field Values
-
UTF_7
public static final String UTF_7
Java friendly UTF-7 encoding name..- See Also:
- Constant Field Values
-
UTF_EBCDIC
public static final String UTF_EBCDIC
Java friendly UTF-EBCDIC encoding name..- See Also:
- Constant Field Values
-
BOCU_1
public static final String BOCU_1
BOCU (Binary Ordered Compression for Unicode)- See Also:
- Constant Field Values
-
UTF_32
public static final String UTF_32
Java friendly UTF-32 encoding name..- See Also:
- Constant Field Values
-
UTF_32BE
public static final String UTF_32BE
Java friendly UTF-32 big endian encoding name..- See Also:
- Constant Field Values
-
UTF_32LE
public static final String UTF_32LE
Java friendly UTF-32 little endian encoding name..- See Also:
- Constant Field Values
-
-
Constructor Detail
-
BOMNewlineEncodingDetector
public BOMNewlineEncodingDetector(InputStream inputStream)
Create a new BOMNewlineEncodingDetector from anInputStream. Cannot detectBOMNewlineEncodingDetector.NewlineTypeunless a valid encoding is detected.- Parameters:
inputStream- the input stream
-
BOMNewlineEncodingDetector
public BOMNewlineEncodingDetector(InputStream inputStream, String defaultEncoding)
Create a new BOMNewlineEncodingDetector from anInputStreamand a user provided encoding. This BOMNewlineEncodingDetector can convert the input bytes to Unicode for detection of theBOMNewlineEncodingDetector.NewlineType- Parameters:
inputStream- the input streamdefaultEncoding- the default encoding
-
BOMNewlineEncodingDetector
public BOMNewlineEncodingDetector(InputStream inputStream, Charset defaultEncoding)
-
-
Method Detail
-
getNewlineType
public static BOMNewlineEncodingDetector.NewlineType getNewlineType(CharSequence text)
Static helper method for detecting newline type used in a run of text.- Parameters:
text- - text which includes newlines.- Returns:
- the detected or guessed
BOMNewlineEncodingDetector.NewlineType
-
getNewlineType
public BOMNewlineEncodingDetector.NewlineType getNewlineType()
Detects newline type using the inputStream itself.- Returns:
- the detected or guessed
BOMNewlineEncodingDetector.NewlineType
-
getInputStream
public InputStream getInputStream()
Get the input stream pased in to the constructor- Returns:
- the
InputStream
-
getEncoding
public String getEncoding()
Get the guessed encoding or if encoding couldn't be guessed return the user supplied encoding. If no user supplied encoding is found use ISO_8859_1.- Returns:
- the guessed or user supplied encoding.
-
getEncodingSpecificationInfo
public String getEncodingSpecificationInfo()
Return a short description of the encoding.- Returns:
- String containing the specification.
-
isDefinitive
public boolean isDefinitive()
Are we confident of the document encoding?- Returns:
- true if the encoding is obvious from the BOM or bytes, false if the encoding must be guessed.
-
detectBom
public void detectBom()
-
detectAndRemoveBom
public void detectAndRemoveBom()
-
getDefaultEncoding
public String getDefaultEncoding()
Get the defaultEncoding set by the user.- Returns:
- String representation of the encoding
-
setDefaultEncoding
public void setDefaultEncoding(String defaultEncoding)
Set the default encoding.- Parameters:
defaultEncoding- default encoding
-
hasBom
public boolean hasBom()
Does this document have a byte order mark?- Returns:
- true if there is a BOM, false otherwise.
-
hasUtf8Bom
public boolean hasUtf8Bom()
Indicates if the guessed encoding is UTF-8 and this file has a BOM.- Returns:
- True if the guessed encoding is UTF-8 and this file has a BOM, false otherwise.
-
hasUtf7Bom
public boolean hasUtf7Bom()
Does this document have a UTF-7 byte order mark?- Returns:
- true if there is a BOM, false otherwise.
-
isAutodetected
public boolean isAutodetected()
Indicates if the guessed encoding was auto-detected. If not it is the default encoding that was provided.- Returns:
- True if the guessed encoding was auto-detected, false if not.
-
getBomSize
public int getBomSize()
Gets the number of bytes used by the Byte-Order-mark in this document.- Returns:
- The byte size of the BOM in this document.
-
hasUtf8Encoding
public boolean hasUtf8Encoding()
-
-