Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on JSONDecodeError from body of YouTube page #171

Open
wjdp opened this issue Jan 26, 2021 · 2 comments
Open

Crash on JSONDecodeError from body of YouTube page #171

wjdp opened this issue Jan 26, 2021 · 2 comments

Comments

@wjdp
Copy link

wjdp commented Jan 26, 2021

I have some code to pull metadata from YouTube

response = requests.get(video_url)
metadata = extruct.extract(response.text, base_url="https://youtube.com")

Have noticed some recent crashing, but only on some videos.

No crash: https://www.youtube.com/watch?v=ZY48KUAZKhM https://www.youtube.com/watch?v=ZlVI7YJGHq0
Crash: https://www.youtube.com/watch?v=987wzJ2NHBE https://www.youtube.com/watch?v=0-EF60neguk

Common factor among those that crash is apostrophes in the channel name!

Traceback (most recent call last):
  File "/home/will/local/breda/src/dredger/ingest/tests/test_youtube.py", line 72, in test_one
    youtube.get_video_data("https://www.youtube.com/watch?v=987wzJ2NHBE")
  File "/home/will/local/breda/src/dredger/ingest/youtube.py", line 46, in get_video_data
    metadata = extruct.extract(response.text, base_url="https://youtube.com")
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/_extruct.py", line 108, in extract
    output[syntax] = list(extract(document, base_url=base_url))
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in extract_items
    return [
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 25, in <listcomp>
    return [
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/extruct/jsonld.py", line 38, in _extract_items
    data = jstyleson.loads(HTML_OR_JS_COMMENTLINE.sub('', script),strict=False)
  File "/home/will/.virtualenvs/breda/lib/python3.8/site-packages/jstyleson.py", line 123, in loads
    return json.loads(dispose(text), **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 211 (char 210)

Haven't had a chance today to dig into much beyond triaging the above.

@udit19281
Copy link

I haven't been able to replicate the issue. Your Crash video links point to the video that has been removed. Maybe that is the reason why you are getting this error. I suggest you check the video links before passing them to the extract.
Here is the code that I used:

Code:
import extruct
import requests
from w3lib.html import get_base_url

crash_links=['https://www.youtube.com/watch?v=987wzJ2NHBE','https://www.youtube.com/watch?v=0-EF60neguk']

for video_url in crash_links:
response = requests.get(video_url)
base_url = get_base_url(response.text, response.url)
metadata=extruct.extract(response.text, base_url=base_url, uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph'])
print(metadata)

Output:
{'microdata': [], 'json-ld': [], 'opengraph': []}
{'microdata': [], 'json-ld': [], 'opengraph': []}

@AbhinavSE
Copy link

I replicated the issue using these YouTube links, https://www.youtube.com/watch?v=-J2e8OlBdPs, https://www.youtube.com/watch?v=qP07oyFTRXc, https://www.youtube.com/watch?v=BUrnfkxwozM.

As @wjdp suggested, it is because of the apostrophe in the channel name. json.loads() throws an error when the input contains hex codes like "\x27" (which is the apostrophe). I created a pull request #195 where I replace the hex code with the special characters themselves before passing to the json.loads() function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants