Feature or enhancement
Python ships a fixed set of built-in codecs.
When the system locale uses an encoding none of them covers, codecs.lookup() / str.encode() / bytes.decode() raise LookupError: unknown encoding, and Python may fail to even start, aborting in init_fs_encoding when it cannot find a codec for the filesystem encoding.
These are ordinary, standard locales — zh_TW.euctw, hy_AM.armscii8, ka_GE.georgianps, and others ship with glibc and work in every other program.
Every such encoding is already known to the C library's iconv().
Proposal
Add an iconv-based codec engine, so Python can use any encoding iconv() knows, even without a dedicated built-in codec.
- A last-resort search function, so it never shadows a built-in codec — it only catches otherwise-unknown names.
- An explicit
iconv: prefix (e.g. "iconv:latin1") forces the engine even when a built-in exists.
- Full codec API: stateless encode/decode with all error handlers, incremental codecs, and stream reader/writer.
Design
- C:
_PyUnicode_DecodeIconv / _PyUnicode_EncodeIconv in Objects/unicodeobject.c, exposed as _codecs.iconv_decode / _codecs.iconv_encode.
The pivot is native-endian UTF-32 (a raw array of Py_UCS4), so the conversion is independent of wchar_t's representation (correct even where wchar_t is not Unicode, e.g. Solaris/AIX) and keeps a 1:1 code-point/position mapping for error handlers.
- Python:
Lib/encodings/_iconv_codecs.py plus a search function in Lib/encodings/__init__.py, mirroring _win_cp_codecs.py.
- Build:
configure gains an iconv check defining HAVE_ICONV.
Prior art
gh-37800 ("Universal Unicode Codec for POSIX iconv", 2003) added an iconvcodec module, but it was removed after causing crashes (gh-37835, gh-37840, gh-37869) and build failures across Unixes (gh-38012 IRIX, gh-38019/38020 Solaris, gh-38033 Tru64, gh-38068 HP-UX, gh-38240 OpenBSD, gh-37865 Cygwin); follow-ups gh-38224/gh-38255/gh-38297 stalled.
This proposal avoids those pitfalls: no separate extension module (the conversion lives in the core unicodeobject.c), gated on a configure check, and pivoting through UTF-32 rather than wchar_t.
It is the POSIX counterpart of gh-123803, which made arbitrary Windows code pages usable as cpXXX encodings by delegating to the OS API.
Availability
Which encodings iconv provides is platform-dependent (glibc, musl, the citrus *BSDs and macOS/GNU libiconv all differ); OpenBSD, whose iconv has only UTF-8, gains nothing.
The feature is active only where iconv() is present, and transparent when absent.
Issues this resolves
Linked PRs
Feature or enhancement
Python ships a fixed set of built-in codecs.
When the system locale uses an encoding none of them covers,
codecs.lookup()/str.encode()/bytes.decode()raiseLookupError: unknown encoding, and Python may fail to even start, aborting ininit_fs_encodingwhen it cannot find a codec for the filesystem encoding.These are ordinary, standard locales —
zh_TW.euctw,hy_AM.armscii8,ka_GE.georgianps, and others ship with glibc and work in every other program.Every such encoding is already known to the C library's
iconv().Proposal
Add an
iconv-based codec engine, so Python can use any encodingiconv()knows, even without a dedicated built-in codec.iconv:prefix (e.g."iconv:latin1") forces the engine even when a built-in exists.Design
_PyUnicode_DecodeIconv/_PyUnicode_EncodeIconvinObjects/unicodeobject.c, exposed as_codecs.iconv_decode/_codecs.iconv_encode.The pivot is native-endian UTF-32 (a raw array of
Py_UCS4), so the conversion is independent ofwchar_t's representation (correct even wherewchar_tis not Unicode, e.g. Solaris/AIX) and keeps a 1:1 code-point/position mapping for error handlers.Lib/encodings/_iconv_codecs.pyplus a search function inLib/encodings/__init__.py, mirroring_win_cp_codecs.py.configuregains aniconvcheck definingHAVE_ICONV.Prior art
gh-37800 ("Universal Unicode Codec for POSIX iconv", 2003) added an
iconvcodecmodule, but it was removed after causing crashes (gh-37835, gh-37840, gh-37869) and build failures across Unixes (gh-38012 IRIX, gh-38019/38020 Solaris, gh-38033 Tru64, gh-38068 HP-UX, gh-38240 OpenBSD, gh-37865 Cygwin); follow-ups gh-38224/gh-38255/gh-38297 stalled.This proposal avoids those pitfalls: no separate extension module (the conversion lives in the core
unicodeobject.c), gated on aconfigurecheck, and pivoting through UTF-32 rather thanwchar_t.It is the POSIX counterpart of gh-123803, which made arbitrary Windows code pages usable as
cpXXXencodings by delegating to the OS API.Availability
Which encodings
iconvprovides is platform-dependent (glibc, musl, the citrus *BSDs and macOS/GNU libiconv all differ); OpenBSD, whoseiconvhas only UTF-8, gains nothing.The feature is active only where
iconv()is present, and transparent when absent.Issues this resolves
Linked PRs