mingw-w64でワイド文字列(ユニコード文字列)をウィンドウズコマンドプロンプトへC++ストリームで標準入出力(wcin/wcout)しても、ASCII文字に限定されそれ以外(例えば日本語)は出力されない。標準入出力がcodecvt<wchar_t,char,mbstate_t>ファセットで文字コードを変換するにはC言語標準入出力の非同期化が必要である事に加え、C++ファイルストリーム一般の問題としてそもそもcodecvtが文字コードを正しく変換する必要がある。デフォルトの標準codecvtにその変換を行わせるには、C++標準ライブラリlocale::globalスタティックメンバ関数でグローバルC++ロケールを設定するのではなく、C言語標準ライブラリsetlocale関数でC言語ロケールを設定しなければならない。本記事はその理由を説明してソースコードを確認する。

本サイトの誤解について

本サイトはバージョン1.3.1.Xに至るまでmingw-w64が用意する標準codecvt<wchar_t,char,mbstate_t>がASCIIしか扱えないとしていたが、最近これが誤りである事に気付いた。本サイト以前よりそのように信じて不明は10年に達し、本サイトで公開して汗顔の至りである。本記事の追加と共に関連記事を全面的に書き換えた。

エクスキューズする。C++ストリームは当然C++標準ライブラリの範疇で、まさかC言語標準ライブラリの関数によるC言語ロケール設定が必要とは想像もしなかった。以下が誤解に至った概略である。

規格はcodecvt<wchar_t,char,mbstate_t>を実装依存とする(JTC1/SC22/WG21 N4659 25.4.1.4/p3)。mingw-w64もウィンドウズシステムロケールに従い日本語環境デフォルトでシフトJISとUTF-16の相互変換を期待したが、そうならずASCII以外を扱えない。
C++標準ライブラリlocale::globalスタティックメンバ関数でcodecvt<wchar_t,char,mbstate_t>のロケールを設定できる事を期待した。しかし"ja_JP.sjis"、"japanese_japan.932"、""などそれらしいロケール名を与えても変わらず、ASCII以外を扱えない。
結論としてmingw-w64のcodecvt<wchar_t,char,mbstate_t>はASCII以外を扱えない。ASCII以外を扱うには自らcodecvtを作成するしかない。

以下に新たに得た知見をまとめる。

mingw-w64のcodecvt<wchar_t,char,mbstate_t>は文字コード変換にC言語標準ライブラリmbsrtowcs/wcrtomb関数(JTC1/SC22/WG14 N1570 7.29.6.3.2、7.29.6.3.3)を用いる。そのロケール(C言語ロケール)はC言語標準ライブラリsetlocale関数(7.11.1.1)で設定する。
mingw-w64はこれらの関数はmsvcランタイム(msvcrt.dll)から利用する。ウィンドウズ定義のロケール名(例えば"japanese_japan.932")をsetlocaleに与えれはC言語ロケールを設定できる。""を与えればシステムロケールを設定する。codecvt<wchar_t,char,mbstate_t>はC言語ロケールに従い動作する。
規格はlocale::globalでロケール名を与えれば同時にsetlocaleも設定する(N4659 25.3.1.5/p2)。残念ながらmingw-w64実装はそのようにならない。
- libstdc++ のロケール問題

wxWidgetsライブラリに関する追記

この問題はC++標準入出力(wcin/wcout)(ただしC言語標準入出力との同期解除が前提)に限らないC++ファイルストリーム(wifstream/wofstream)一般で、wxWidgetsライブラリ利用のデスクトップアプリケーションがファイル入出力をwifstream/wofstreamで行う場合にも影響する。GNU gettextによる国際化機能が利用するwxLocaleクラスはInitメンバ関数からwxLanguageInfo::TrySetLocaleメンバ関数とwxSetlocale関数を経由してsetlocale関数をシステムロケール名でコールする。つまりwxWidgetsライブラリを利用する際、wxLocale国際化機能の有無で標準codecvtによるwifstream/wofstreamがASCII以外を変換できるかどうか、日本語環境であればユニコード文字列からシフトJISへ変換できるかどうかが変わる。

ダウンロードリンク

GCCソースはGitHubパブリックミラーで最新バージョンを参照できる。mingw-w64はGCCの一バージョンなのでソースは共通のはずだ。codecvt<wchar_t,char,mbstate_t>はC++標準ライブラリ(libstdc++-v3)に存在する。そのメンバ関数を定義するソースコードファイル(codecvt_members.cc)は--enable-clocaleコンフィグ(GCC 14.1.0 Standard C++ Library Manual 2.2 Configure)に依存してgnu、generic、dragonfly、vxworksの四つが存在する。dragonflyとvxworksは特殊ターゲットで、gnuはGNU供給のC言語標準ライブラリglibcを用いるバージョンで、mingw-w64が用いるのはgenericである。

gcc/libstdc++-v3/config/locale/generic/codecvt_members.cc

ソースコード

ソースコードでcodecvt<wchar_t,char,mbstate_t>のメンバ関数定義を確認する。do_outメンバ関数は内部ユニコード文字列を外部マルチバイト文字列に変換するためC言語標準ライブラリwcrtomb関数をコールする。do_inメンバ関数は外部マルチバイト文字列を内部ユニコード文字列に変換するためmbrtowc関数をコールする。

なお本記事の目的にはdo_out/do_inメンバ関数のみで十分であるが、自作codecvtとの比較に供するためフルソースコード(2022年1月3日コミット)を掲示する。

codecvt_members.cc

// std::codecvt implementation details, generic version -*- C++ -*-
 
// Copyright (C) 2002-2022 Free Software Foundation, Inc.
//
// This file is part of the GNU ISO C++ Library.  This library is free
// software; you can redistribute it and/or modify it under the
// terms of the GNU General Public License as published by the
// Free Software Foundation; either version 3, or (at your option)
// any later version.
 
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
 
// Under Section 7 of GPL version 3, you are granted additional
// permissions described in the GCC Runtime Library Exception, version
// 3.1, as published by the Free Software Foundation.
 
// You should have received a copy of the GNU General Public License and
// a copy of the GCC Runtime Library Exception along with this program;
// see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
// <http://www.gnu.org/licenses/>.
 
//
// ISO C++ 14882: 22.2.1.5 - Template class codecvt
//
 
// Written by Benjamin Kosnik <bkoz@redhat.com>
 
#include <locale>
#include <cstdlib>  // For MB_CUR_MAX
#include <climits>  // For MB_LEN_MAX
#include <cstring>
 
namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION
 
  // Specializations.
#ifdef _GLIBCXX_USE_WCHAR_T
  codecvt_base::result
  codecvt<wchar_t, char, mbstate_t>::
  do_out(state_type& __state, const intern_type* __from,
     const intern_type* __from_end, const intern_type*& __from_next,
     extern_type* __to, extern_type* __to_end,
     extern_type*& __to_next) const
  {
    result __ret = ok;
    // The conversion must be done using a temporary destination buffer
    // since it is not possible to pass the size of the buffer to wcrtomb
    state_type __tmp_state(__state);
 
    // The conversion must be done by calling wcrtomb in a loop rather
    // than using wcsrtombs because wcsrtombs assumes that the input is
    // zero-terminated.
 
    // Either we can upper bound the total number of external characters to
    // something smaller than __to_end - __to or the conversion must be done
    // using a temporary destination buffer since it is not possible to
    // pass the size of the buffer to wcrtomb
    if (MB_CUR_MAX * (__from_end - __from) - (__to_end - __to) <= 0)
      while (__from < __from_end)
    {
      const size_t __conv = wcrtomb(__to, *__from, &__tmp_state);
      if (__conv == static_cast<size_t>(-1))
        {
          __ret = error;
          break;
        }
      __state = __tmp_state;
      __to += __conv;
      __from++;
    }
    else
      {
    extern_type __buf[MB_LEN_MAX];
    while (__from < __from_end && __to < __to_end)
      {
        const size_t __conv = wcrtomb(__buf, *__from, &__tmp_state);
        if (__conv == static_cast<size_t>(-1))
          {
        __ret = error;
        break;
          }
        else if (__conv > static_cast<size_t>(__to_end - __to))
          {
        __ret = partial;
        break;
          }
 
        memcpy(__to, __buf, __conv);
        __state = __tmp_state;
        __to += __conv;
        __from++;
      }
      }
 
    if (__ret == ok && __from < __from_end)
      __ret = partial;
 
    __from_next = __from;
    __to_next = __to;
    return __ret;
  }
 
  codecvt_base::result
  codecvt<wchar_t, char, mbstate_t>::
  do_in(state_type& __state, const extern_type* __from,
    const extern_type* __from_end, const extern_type*& __from_next,
    intern_type* __to, intern_type* __to_end,
    intern_type*& __to_next) const
  {
    result __ret = ok;
    // This temporary state object is necessary so __state won't be modified
    // if [__from, __from_end) is a partial multibyte character.
    state_type __tmp_state(__state);
 
    // Conversion must be done by calling mbrtowc in a loop rather than
    // by calling mbsrtowcs because mbsrtowcs assumes that the input
    // sequence is zero-terminated.
    while (__from < __from_end && __to < __to_end)
      {
    size_t __conv = mbrtowc(__to, __from, __from_end - __from,
                &__tmp_state);
    if (__conv == static_cast<size_t>(-1))
      {
        __ret = error;
        break;
      }
    else if (__conv == static_cast<size_t>(-2))
      {
        // It is unclear what to return in this case (see DR 382).
        __ret = partial;
        break;
      }
    else if (__conv == 0)
      {
        // XXX Probably wrong for stateful encodings
        __conv = 1;
        *__to = L'\0';
      }
 
    __state = __tmp_state;
    __to++;
    __from += __conv;
      }
 
    // It is not clear that __from < __from_end implies __ret != ok
    // (see DR 382).
    if (__ret == ok && __from < __from_end)
      __ret = partial;
 
    __from_next = __from;
    __to_next = __to;
    return __ret;
  }
 
  int
  codecvt<wchar_t, char, mbstate_t>::
  do_encoding() const throw()
  {
    // XXX This implementation assumes that the encoding is
    // stateless and is either single-byte or variable-width.
    int __ret = 0;
    if (MB_CUR_MAX == 1)
      __ret = 1;
    return __ret;
  }
 
  int
  codecvt<wchar_t, char, mbstate_t>::
  do_max_length() const throw()
  {
    // XXX Probably wrong for stateful encodings.
    int __ret = MB_CUR_MAX;
    return __ret;
  }
 
  int
  codecvt<wchar_t, char, mbstate_t>::
  do_length(state_type& __state, const extern_type* __from,
        const extern_type* __end, size_t __max) const
  {
    int __ret = 0;
    state_type __tmp_state(__state);
 
    while (__from < __end && __max)
      {
    size_t __conv = mbrtowc(0, __from, __end - __from, &__tmp_state);
    if (__conv == static_cast<size_t>(-1))
      {
        // Invalid source character
        break;
      }
    else if (__conv == static_cast<size_t>(-2))
      {
        // Remainder of input does not form a complete destination
        // character.
        break;
      }
    else if (__conv == 0)
      {
        // XXX Probably wrong for stateful encodings
        __conv = 1;
      }
 
    __state = __tmp_state;
    __from += __conv;
    __ret += __conv;
    __max--;
      }
 
    return __ret;
  }
#endif
 
_GLIBCXX_END_NAMESPACE_VERSION
} // namespace

目次

本サイトの誤解について

wxWidgetsライブラリに関する追記

ダウンロードリンク

ソースコード

codecvt_members.cc