yt-dlp not properly internally handling unicode in titles.
See original GitHub issueDO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
- I understand that I will be blocked if I remove or skip any mandatory* field
Checklist
- I’m reporting a bug unrelated to a specific site
- I’ve verified that I’m running yt-dlp version 2022.10.04 (update instructions) or later (specify commit)
- I’ve checked that all provided URLs are playable in a browser with the same IP and same login details
- I’ve checked that all URLs and arguments with special characters are properly quoted or escaped
- I’ve searched the bugtracker for similar issues including closed ones. DO NOT post duplicates
- I’ve read the guidelines for opening an issue
Provide a description that is worded well enough to be understood
yt-dlp not properly internally handling unicode in titles.
Trying to download audio from a video. There is a unicode character in the title and yt-dlp errors out on this character. If the error was related to trying to open a file with an invalid character and hitting a filesystem limitation, I would not consider this a bug but it seems that it is happening internal to yt-dlp so I figure it is.
Running with --restrict-filenames bypasses this issue.
If this is expected behavior and marked as a wontfix, that’s fine (as a developer, I would attempt to fix it. Though my unicode experience is weak). This may help someone else running into the same issue.
Video is: https://www.youtube.com/watch?v=45Tqi-KYzDE
ERROR: ‘latin-1’ codec can’t encode character ‘\uff1a’ in position 24: ordinal not in range(256)
Provide verbose output that clearly demonstrates the problem
- Run your yt-dlp command with -vU flag added (
yt-dlp -vU <your command line>) - Copy the WHOLE output (starting with
[debug] Command-line config) and insert it below
Complete Verbose Output
$ yt-dlp https://www.youtube.com/watch?v=45Tqi-KYzDE -f 139 -vU
[debug] Command-line config: ['https://www.youtube.com/watch?v=45Tqi-KYzDE', '-f', '139', '-vU']
[debug] Encodings: locale ISO-8859-1, fs iso8859-1, pref ISO-8859-1, out ISO-8859-1, error ISO-8859-1, screen ISO-8859-1
[debug] yt-dlp version 2022.10.04 [4e0511f] (zip)
[debug] Python 3.7.2 (CPython 64bit) - Linux-3.10.17-x86_64-AMD_A6-5400K_APU_with_Radeon-tm-_HD_Graphics-with-slackware-14.1 (glibc 2.2.5)
[debug] Checking exe version: ffmpeg -bsfs
[debug] Checking exe version: ffprobe -bsfs
[debug] exe versions: ffmpeg 3.2.4, ffprobe 3.2.4, rtmpdump 2.4
[debug] Optional libraries: sqlite3-2.6.0
[debug] Proxy map: {}
[debug] Loaded 1690 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: 2022.10.04, Current version: 2022.10.04
yt-dlp is up to date (2022.10.04)
[debug] [youtube] Extracting URL: https://www.youtube.com/watch?v=45Tqi-KYzDE
[youtube] 45Tqi-KYzDE: Downloading webpage
[youtube] 45Tqi-KYzDE: Downloading android player API JSON
[youtube] 45Tqi-KYzDE: Downloading MPD manifest
[youtube] 45Tqi-KYzDE: Downloading MPD manifest
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, filesize, fs_approx, tbr, vbr, abr, asr, vext, aext, hasaud, id
[info] 45Tqi-KYzDE: Downloading 1 format(s): 139
ERROR: 'latin-1' codec can't encode character '\uff1a' in position 24: ordinal not in range(256)
Traceback (most recent call last):
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 1477, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 1574, in __extract_info
return self.process_ie_result(ie_result, download, extra_info)
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 1632, in process_ie_result
ie_result = self.process_video_result(ie_result, download=download)
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 2733, in process_video_result
self.process_info(new_info)
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 3192, in process_info
dl_filename = existing_video_file(full_filename, temp_filename)
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 3087, in existing_video_file
default_overwrite=False)
File "/usr/local/bin/yt-dlp/yt_dlp/YoutubeDL.py", line 2923, in existing_file
existing_files = list(filter(os.path.exists, orderedSet(filepaths)))
File "/usr/lib64/python3.7/genericpath.py", line 19, in exists
os.stat(path)
UnicodeEncodeError: 'latin-1' codec can't encode character '\uff1a' in position 24: ordinal not in range(256)
Issue Analytics
- State:
- Created a year ago
- Comments:19 (2 by maintainers)
Top Related StackOverflow Question
The second character presented is
\uff1a.:in filenames are getting replaced by\uff1aunless--restrict-filenamesis provided if I understand correctly.Is
LC_CTYPEset and, if so, what value does it hold? Explicitly setting it toen_US.UTF-8could work in that case as a workaround.I would try
touch '1:1'from within shell andopen('2\uff1a2')(maybeos.stat('2\uff1a2')as well) from within python and see if those are working as expected before and after setting the variable.I think I know what you’re trying to get at. Unfortunately, I can’t devote much time to this in the immediate future but when I can, I’ll see if I can get it working.
Edit: It looks like using those escapes requires the double quotes. Presumably single quotes strings are literals like with Perl.
Edit2: Need to check something that may have a bearing on my testing. More to come.
Edit3: Having read more docs, I can see that this might be a core python issue. When I get some time, I’ll look into that.