Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring recurring ld+json blocks #207

Closed
2 tasks done
slavaGanzin opened this issue Sep 11, 2019 · 1 comment
Closed
2 tasks done

Ignoring recurring ld+json blocks #207

slavaGanzin opened this issue Sep 11, 2019 · 1 comment

Comments

@slavaGanzin
Copy link
Contributor

slavaGanzin commented Sep 11, 2019

Prerequisites

  • My node version is the same as declared as package.json.
  • I'm using the last version.

Subject of the issue

Hello! Some pages (https://bykvu.com/ru/bukvy/uchenye-nazvali-depressiju-prichinoj-22-opasnyh-zaboleva~/w/readstreams) has multiple ld+json blocks so why metascrapper use only first one?

https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-helpers/index.js#L230-L231

It's obviously by design, but real life differs from our expectations...

Steps to reproduce

Parse this:

<script type="application/ld+json">
	{
		"@context": "http://schema.org",
		"@type": "Organization",
		"url": "https://bykvu.com/ru",
		"logo": "https://bykvu.com/wp-content/themes/bykvu/img/logo.svg"
	}	
</script>
<script type="application/ld+json">
		{
		  "@context": "http://schema.org",
		  "@type": "NewsArticle",
		  "mainEntityOfPage": {
			"@type": "WebPage",
			"@id": "https://bykvu.com/ru/bukvy/uchenye-nazvali-depressiju-prichinoj-22-opasnyh-zabolevanij/"
		  },
		  "headline": "Ученые назвали депрессию причиной 22 опасных заболеваний",
		  "image": [
			"https://bykvu.com/wp-content/themes/bykvu/includes/images/noimage_large.jpg"
		   ],
		  "datePublished": "2019-09-09T00:29:09+02:00",
		  "dateModified": "2019-09-09T00:29:09+02:00",
		  "author": {
			"@type": "Person",
			"name": "Буквы"
		  },
		   "publisher": {
			"@type": "Organization",
			"name": "Буквы",
			"logo": {
			  "@type": "ImageObject",
			  "url": "https://bykvu.com/wp-content/themes/bykvu/img/apple-icon-180x180.png"
			}
		  },
		  "description": "Ученые австралийского центра точного здравоохранения при Университете Южной Австралии выяснили, что депрессия является причиной 22 различных заболеваний."
		}
	</script>
<script type="application/ld+json">
				{
				 "@context": "https://schema.org",
				 "@type": "BreadcrumbList",
				 "itemListElement":
				 [
				  {
				   "@type": "ListItem",
				   "position": 1,
				   "item":
				   {
					"@id": "https://bykvu.com/ru",
					"name": "Буквы"
					}
				  },
				  {
				   "@type": "ListItem",
				  "position": 2,
				  "item":
				   {
					"@id": "https://bykvu.com/ru/category/bukvy/",
					"name": "Новости"
				   }
				  },			  
				  {
				   "@type": "ListItem",
				  "position": 3,
				  "item":
				   {
					 "@id": "https://bykvu.com/ru/bukvy/uchenye-nazvali-depressiju-prichinoj-22-opasnyh-zabolevanij/",
					 "name": "Ученые назвали депрессию причиной 22 опасных заболеваний"
				   }
				  }
				 ]
				}
			</script>

Expected behaviour

Result should contain 'datePublished'

Actual behaviour

Result contains data only from first <script> tag

@Kikobeats
Copy link
Member

Right now we are extracting just the first time because major of the data is living on other HTML tags that are simple to extract (open graph, twitter, itemprops,...)

I suppose a better effort could be done there to try to extract as much information possible, PRs are very open for this 🙂

slavaGanzin added a commit to slavaGanzin/metascraper that referenced this issue Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants