자바에서 파일 읽기를 방해하는 바이트 순서 표시

107

Java를 사용하여 CSV 파일을 읽으려고합니다. 일부 파일에는 처음에 바이트 순서 표시가있을 수 있지만 전부는 아닙니다. 존재하면 바이트 순서가 나머지 첫 번째 줄과 함께 읽혀 지므로 문자열 비교에 문제가 발생합니다.

바이트 순서 표시가있을 때 건너 뛰는 쉬운 방법이 있습니까?

감사!

java utf-8 byte-order-mark

— 톰
소스

어쩌면 : rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

— Chris

114

편집 : GitHub에서 적절한 릴리스를 만들었습니다 : https://github.com/gpakosz/UnicodeBOMInputStream

다음은 얼마 전에 코딩 한 클래스입니다. 붙여 넣기 전에 패키지 이름을 편집했습니다. 특별한 것은 없습니다. SUN의 버그 데이터베이스에 게시 된 솔루션과 매우 유사합니다. 코드에 통합하면 괜찮습니다.

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

그리고 다음과 같이 사용하고 있습니다.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

— 그레고리 파 코스
소스

2

긴 스크롤 영역에 대해 죄송합니다. 첨부 기능이 없습니다.

— Gregory Pakosz 2009

고마워요 그레고리, 그게 제가 찾고있는 것입니다.

— Tom

3

이 핵심 자바 API에 있어야합니다

— 데니스 Kniazhev

7

10 년이 지났고 나는 이것에 대해 여전히 카르마를 받고있다 : D 나는 당신을보고있다 자바!

— Gregory Pakosz

1

답변은 파일 입력 스트림이 기본적으로 BOM을 삭제하는 옵션을 제공하지 않는 이유에 대한 기록을 제공하기 때문에 찬성되었습니다.

— MxLDevs 2018

94

아파치 코 몬즈 입출력 라이브러리는이 InputStream감지 할 수있는 폐기 된 BOM : BOMInputStream(javadoc의) :

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

또한 다른 인코딩을 감지해야하는 경우 다양한 바이트 순서 표시 (예 : 위의 문서 링크에서 UTF-8 대 UTF-16 big + little endian 세부 정보)를 구분할 수도 있습니다. 그런 다음 감지 된 ByteOrderMark을 사용 Charset하여 스트림을 디코딩 하도록 선택할 수 있습니다 . (이 모든 기능이 필요한 경우 더 간소화 된 방법이있을 수 있습니다. 아마도 BalusC의 답변에있는 UnicodeReader일까요?). 일반적으로 일부 바이트의 인코딩을 감지하는 좋은 방법은 없지만 스트림이 BOM으로 시작하는 경우 분명히 도움이 될 수 있습니다.

편집 : UTF-16, UTF-32 등으로 BOM을 감지해야하는 경우 생성자는 다음과 같아야합니다.

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

@ martin-charlesworth의 의견을 공감 :)

— Rescdsk
소스

BOM을 건너 뜁니다. 99 %의 사용 사례에 대한 완벽한 솔루션이어야합니다.

— atamanroman

7

이 답변을 성공적으로 사용했습니다. 그러나 booleanBOM을 포함할지 제외할지 여부를 지정 하는 인수를 정중하게 추가합니다 . 예 :BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM

— Kevin Meredith 2013 년

19

또한 이것은 UTF-8 BOM 만 감지한다고 추가합니다. 모든 utf-X BOM을 감지하려면 BOMInputStream 생성자에 전달해야합니다.

BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, 				ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);

— Martin Charlesworth

@KevinMeredith의 주석에 관해서는 부울이있는 생성자가 더 명확하지만 기본 생성자는 JavaDoc이 제안한대로 이미 UTF-8 BOM을 제거했습니다.BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— WesternGun

건너 뛰면 대부분의 문제가 해결됩니다. 파일이 BOM UTF_16BE로 시작하는 경우 BOM을 건너 뛰고 파일을 UTF_8로 읽어 InputReader를 만들 수 있습니까? 지금까지 작동하지만 엣지 케이스가 있는지 이해하고 싶습니다. 미리 감사드립니다.

— Bhaskar

31

더 간단한 솔루션 :

public class BOMSkipper
{
    public static void skip(Reader reader) throws IOException
    {
        reader.mark(1);
        char[] possibleBOM = new char[1];
        reader.read(possibleBOM);

        if (possibleBOM[0] != '\ufeff')
        {
            reader.reset();
        }
    }
}

사용 샘플 :

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

5 개의 UTF 인코딩 모두에서 작동합니다!

1

아주 좋은 안드레이. 하지만 왜 작동하는지 설명해 주시겠습니까? 패턴 0xFEFF가 다른 패턴과 2가 아닌 3 바이트를 갖는 것처럼 보이는 UTF-8 파일과 어떻게 성공적으로 일치합니까? 그리고 그 패턴이 UTF16과 UTF32의 엔디안과 어떻게 일치 할 수 있습니까?

— Vahid Pazirandeh

1

보시다시피-바이트 스트림을 사용하지 않지만 예상되는 문자 집합으로 문자 스트림이 열립니다. 따라서이 스트림의 첫 번째 문자가 BOM이면 건너 뜁니다. BOM은 각 인코딩에 대해 다른 바이트 표현을 가질 수 있지만 이것은 하나의 문자입니다. 이 기사를 읽으십시오. 도움이됩니다. joelonsoftware.com/articles/Unicode.html

좋은 해결책은 읽기 전에 skip 메소드에서 IOException을 피하기 위해 파일이 비어 있지 않은지 확인하십시오. if (reader.ready ()) {reader.read (possibleBOM) ...}

— Snow

UTF-16BE의 Byte order Mark 인 0xFE 0xFF를 다루었습니다. 그러나 처음 3 바이트가 0xEF 0xBB 0xEF이면 어떻게 될까요? (UTF-8의 바이트 순서 표시). 이것이 모든 UTF-8 형식에서 작동한다고 주장합니다. 어느 것이 사실 일 수 있지만 (코드를 테스트하지 않았습니다) 어떻게 작동합니까?

— bvdb

1

Vahid에 대한 내 대답을 참조하십시오 : 바이트 스트림이 아니라 문자 스트림을 열고 문자 하나를 읽습니다. UTF 파일에 사용되는 인코딩하는 것을 신경 쓰지 마 - BOM 접두사 바이트의 서로 다른 개수로 표현하지만, 문자의 관점에서 그것은 단지 하나 개의 문자의 수

24

Google 데이터 API 에는 UnicodeReader인코딩을 자동으로 감지하는이 있습니다.

대신 사용할 수 있습니다 InputStreamReader. 다음은 매우 간단한 소스의-약간 압축 된-추출입니다.

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

— BalusC
소스

링크에 Google 데이터 API가 더 이상 사용되지 않는다고 표시되는 것 같습니다. 지금 어디에서 Google 데이터 API를 찾아야합니까?

— SOUser

1

@XichenLi : GData API는 의도 된 목적으로 더 이상 사용되지 않습니다. 나는 GData API를 직접 사용하도록 제안하지는 않았지만 (OP는 GData 서비스를 사용하지 않습니다), 여러분의 구현을위한 예제로 소스 코드를 인수하려고합니다. 그렇기 때문에 복사 붙여 넣기 준비가 된 내 답변에 포함했습니다.

— BalusC

여기에 버그가 있습니다. UTF-32LE 케이스에 연결할 수 없습니다. 위해서는 (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)사실로, 다음 UTF-16LE의 경우는 ( (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) 이미 일치하는 것이다.

— Joshua Taylor

이 코드는 Google 데이터 API에서 가져온 것이므로 이에 대한 문제 471을 게시 했습니다.

— Joshua Taylor

13

Apache Commons IO도서관의 BOMInputStream은 이미 @rescdsk 언급되었지만, 나는 그것이 활용하는 방법을 언급 보지 못했다 InputStream 없이 BOM을.

Scala에서 한 방법은 다음과 같습니다.

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM

— 케빈 메러디스
소스

단일 인수 생성자가 수행합니다 public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }.. UTF-8 BOM기본적으로 제외 됩니다.

— Vladimir Vagaytsev

좋은 지적이야, 블라디미르. 나는 그것의 문서에 그 참조 - commons.apache.org/proper/commons-io/javadocs/api-2.2/org/... :Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— 케빈 메러디스에게

4

파일에서 BOM 문자를 간단히 제거하려면 Apache Common IO를 사용하는 것이 좋습니다.

public BOMInputStream(InputStream delegate,
              boolean include)
Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
Parameters:
delegate - the InputStream to delegate to
include - true to include the UTF-8 BOM or false to exclude it

include를 false로 설정하면 BOM 문자가 제외됩니다.

— 안드레아스 바세 루드
소스

2

안타깝게도 아닙니다. 자신을 식별하고 건너 뛰어야합니다. 이 페이지 는 주시해야 할 사항을 자세히 설명합니다. 자세한 내용 은 이 SO 질문 을 참조하십시오.

— 브라이언 애그뉴
소스

1

나는 똑같은 문제가 있었고 많은 파일을 읽지 않았기 때문에 더 간단한 해결책을 찾았습니다. 이 페이지의 도움을 받아 문제가되는 문자를 인쇄했을 때 내 인코딩이 UTF-8이라고 생각합니다. 문자의 유니 코드 값 가져 오기 가 \ufeff. 코드 System.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) );를 사용하여 문제가되는 유니 코드 값을 인쇄했습니다.

문제가되는 유니 코드 값이 있으면 읽기를 시작하기 전에 파일의 첫 번째 줄에서이를 대체했습니다. 해당 섹션의 비즈니스 로직 :

String str = reader.readLine().trim();
str = str.replace("\ufeff", "");

이것은 내 문제를 해결했습니다. 그런 다음 문제없이 파일 처리를 계속할 수있었습니다. 나는 trim()선행 또는 후행 공백의 경우에 추가했습니다 . 특정 요구 사항에 따라 그렇게 할 수 있습니다.

— 에이미 B 히긴스
소스

1

그것은 나를 위해 작동하지 않았지만 .replaceFirst ( "\ u00EF \ u00BB \ u00BF", "") 사용했습니다.

— StackUMan