Skip to content

Add an iconv-based codec engine to support all system locale encodings #152997

Description

@serhiy-storchaka

Feature or enhancement

Python ships a fixed set of built-in codecs.
When the system locale uses an encoding none of them covers, codecs.lookup() / str.encode() / bytes.decode() raise LookupError: unknown encoding, and Python may fail to even start, aborting in init_fs_encoding when it cannot find a codec for the filesystem encoding.

These are ordinary, standard locales — zh_TW.euctw, hy_AM.armscii8, ka_GE.georgianps, and others ship with glibc and work in every other program.
Every such encoding is already known to the C library's iconv().

Proposal

Add an iconv-based codec engine, so Python can use any encoding iconv() knows, even without a dedicated built-in codec.

  • A last-resort search function, so it never shadows a built-in codec — it only catches otherwise-unknown names.
  • An explicit iconv: prefix (e.g. "iconv:latin1") forces the engine even when a built-in exists.
  • Full codec API: stateless encode/decode with all error handlers, incremental codecs, and stream reader/writer.

Design

  • C: _PyUnicode_DecodeIconv / _PyUnicode_EncodeIconv in Objects/unicodeobject.c, exposed as _codecs.iconv_decode / _codecs.iconv_encode.
    The pivot is native-endian UTF-32 (a raw array of Py_UCS4), so the conversion is independent of wchar_t's representation (correct even where wchar_t is not Unicode, e.g. Solaris/AIX) and keeps a 1:1 code-point/position mapping for error handlers.
  • Python: Lib/encodings/_iconv_codecs.py plus a search function in Lib/encodings/__init__.py, mirroring _win_cp_codecs.py.
  • Build: configure gains an iconv check defining HAVE_ICONV.

Prior art

gh-37800 ("Universal Unicode Codec for POSIX iconv", 2003) added an iconvcodec module, but it was removed after causing crashes (gh-37835, gh-37840, gh-37869) and build failures across Unixes (gh-38012 IRIX, gh-38019/38020 Solaris, gh-38033 Tru64, gh-38068 HP-UX, gh-38240 OpenBSD, gh-37865 Cygwin); follow-ups gh-38224/gh-38255/gh-38297 stalled.
This proposal avoids those pitfalls: no separate extension module (the conversion lives in the core unicodeobject.c), gated on a configure check, and pivoting through UTF-32 rather than wchar_t.

It is the POSIX counterpart of gh-123803, which made arbitrary Windows code pages usable as cpXXX encodings by delegating to the OS API.

Availability

Which encodings iconv provides is platform-dependent (glibc, musl, the citrus *BSDs and macOS/GNU libiconv all differ); OpenBSD, whose iconv has only UTF-8, gains nothing.
The feature is active only where iconv() is present, and transparent when absent.

Issues this resolves

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions