Ignoring recurring ld+json blocks #207

slavaGanzin · 2019-09-11T12:09:30Z

Prerequisites

My node version is the same as declared as package.json.
I'm using the last version.

Subject of the issue

Hello! Some pages (https://bykvu.com/ru/bukvy/uchenye-nazvali-depressiju-prichinoj-22-opasnyh-zaboleva~/w/readstreams) has multiple ld+json blocks so why metascrapper use only first one?

https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-helpers/index.js#L230-L231

It's obviously by design, but real life differs from our expectations...

Steps to reproduce

Parse this:

<script type="application/ld+json">
	{
		"@context": "http://schema.org",
		"@type": "Organization",
		"url": "https://bykvu.com/ru",
		"logo": "https://bykvu.com/wp-content/themes/bykvu/img/logo.svg"
	}	
</script>
<script type="application/ld+json">
		{
		  "@context": "http://schema.org",
		  "@type": "NewsArticle",
		  "mainEntityOfPage": {
			"@type": "WebPage",
			"@id": "https://bykvu.com/ru/bukvy/uchenye-nazvali-depressiju-prichinoj-22-opasnyh-zabolevanij/"
		  },
		  "headline": "Ученые назвали депрессию причиной 22 опасных заболеваний",
		  "image": [
			"https://bykvu.com/wp-content/themes/bykvu/includes/images/noimage_large.jpg"
		   ],
		  "datePublished": "2019-09-09T00:29:09+02:00",
		  "dateModified": "2019-09-09T00:29:09+02:00",
		  "author": {
			"@type": "Person",
			"name": "Буквы"
		  },
		   "publisher": {
			"@type": "Organization",
			"name": "Буквы",
			"logo": {
			  "@type": "ImageObject",
			  "url": "https://bykvu.com/wp-content/themes/bykvu/img/apple-icon-180x180.png"
			}
		  },
		  "description": "Ученые австралийского центра точного здравоохранения при Университете Южной Австралии выяснили, что депрессия является причиной 22 различных заболеваний."
		}
	</script>
<script type="application/ld+json">
				{
				 "@context": "https://schema.org",
				 "@type": "BreadcrumbList",
				 "itemListElement":
				 [
				  {
				   "@type": "ListItem",
				   "position": 1,
				   "item":
				   {
					"@id": "https://bykvu.com/ru",
					"name": "Буквы"
					}
				  },
				  {
				   "@type": "ListItem",
				  "position": 2,
				  "item":
				   {
					"@id": "https://bykvu.com/ru/category/bukvy/",
					"name": "Новости"
				   }
				  },			  
				  {
				   "@type": "ListItem",
				  "position": 3,
				  "item":
				   {
					 "@id": "https://bykvu.com/ru/bukvy/uchenye-nazvali-depressiju-prichinoj-22-opasnyh-zabolevanij/",
					 "name": "Ученые назвали депрессию причиной 22 опасных заболеваний"
				   }
				  }
				 ]
				}
			</script>

Expected behaviour

Result should contain 'datePublished'

Actual behaviour

Result contains data only from first <script> tag

The text was updated successfully, but these errors were encountered:

Kikobeats · 2019-09-11T12:30:42Z

Right now we are extracting just the first time because major of the data is living on other HTML tags that are simple to extract (open graph, twitter, itemprops,...)

I suppose a better effort could be done there to try to extract as much information possible, PRs are very open for this 🙂

Kikobeats added the enhancement label Sep 11, 2019

slavaGanzin added a commit to slavaGanzin/metascraper that referenced this issue Sep 11, 2019

feat: fixes microlinkhq#207: parse multiple json-ld blocks

ae63f19

Kikobeats closed this as completed in 0c8ee94 Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignoring recurring ld+json blocks #207

Ignoring recurring ld+json blocks #207

slavaGanzin commented Sep 11, 2019 •

edited

Loading

Kikobeats commented Sep 11, 2019

Ignoring recurring ld+json blocks #207

Ignoring recurring ld+json blocks #207

Comments

slavaGanzin commented Sep 11, 2019 • edited Loading

Prerequisites

Subject of the issue

Steps to reproduce

Expected behaviour

Actual behaviour

Kikobeats commented Sep 11, 2019

slavaGanzin commented Sep 11, 2019 •

edited

Loading