Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an extract method #2523

Open
fb55 opened this issue May 6, 2022 · 23 comments
Open

Add an extract method #2523

fb55 opened this issue May 6, 2022 · 23 comments
Projects

Comments

@fb55
Copy link
Member

fb55 commented May 6, 2022

One common use-case for cheerio is to extract multiple values from a document, and store them in an object. Doing so manually currently isn't a great experience. There are several packages built on top of cheerio that improve this: For example https://github.com/matthewmueller/x-ray and https://github.com/IonicaBizau/scrape-it. Commercial scraping providers also allow bulk extractions: https://www.scrapingbee.com/documentation/data-extraction/

We should add an API to make this use-case easier. The API should be along the lines of:

$.extract({
  // To get the `textContent` of an element, we can just supply a selector
  title: "title",

  // If we want to get results for more than a single element, we can use an array
  headings: ["h1, h2, h3, h4, h5, h6"],
  // If an array has more than one child, all of them will be queried for.
  // Complication: This should follow document-order.
  headings: ["h1", "h2", "h3", "h4", "h5", "h6"], // (equivalent to the above)

  // To get a property other than the `textContent`, we can pass a string to `out`. This will be passed to the `prop` method.
  links: [{ selector: "a", out: "href" }],

  // We can map over the received elements to customise the behaviour
  links: [
    {
      selector: "a",
      out(el, key, obj) {
        const $el = $(el);
        return { name: $el.text(), href: $el.prop("href") };
      },
    },
  ],

  // To get nested elements, we can pass a nested extract object. Selectors inside the nested extract object will be relative to the current scope.
  posts: [
    {
      selector: ".post",
      out: {
        title: ".title",
        body: ".body",
        link: { selector: "a :has(> .title)", out: "href" },
      },
    },
  ],

  // We can skip the selector in nested extract objects to reference the current scope.
  links: [
    {
      selector: ".post a",
      out: {
        href: { out: "href" },
        name: { out: "textContent" },
        // Equivalent — get the text content of the current scope
        name: "*",
      },
    },
  ],
});
@fb55 fb55 added this to Backlog in v1.0 via automation May 6, 2022
@fb55 fb55 moved this from Backlog to To do in v1.0 May 11, 2022
@mikestopcontinues
Copy link

I'm building something to do this right now. I dig the direction you're going. Some ideas...

What if an array meant that an array should be returned? In addition to being more ergonomic, I think it will also ensure that the result's type is properly inferred. Also, I think there's a pretty easy way to allow objects to mean objects, even when there are config objects in the mix...

const res = $.extract({
  singleStr: 'h1', // throws if more than one element selected
  arrayOfStr: ['h1, h2'], // uses multiple selectors
  tupleOfStr: ['h1', 'h2'], // literally a tuple, throws if either selector returns more than one item
  arrayOfObj: [{
    // an object is a config if it exactly matches config type, otherwise object return is expected
    int: {selector: '#length'},
  }],
});

That object idea does leave room for ambiguity, and it will be a bit annoying to type. What about support for nested $.extract()? Also, I really like your ideas on scoped sub-selectors ({out: {prop, prop, prop}}), so what about $.extract('#scope', {})?

const res = $.extract({
  // nested object
  meta: $.extract('#scope', {
    deep: meta: $.extract({}),
  }),
});

Regarding scoping for performance, do you think there's major gains to be had from scanning the entire extract tree to optimize all the selectors automatically? In my scraping, I trawl every bit of the dom for data redundancy, but the selectors are grouped by the desired bit of data, not their position in the dom. I often wonder if I'm missing out, but I haven't had a chance to test it.


Most scraping also includes data processing. How about first-class support for funcs? Here too, you can infer types, including and for more than just strings...

const res = $.extract({
  singleStr: ($) => $('h1').text().trim(),
  arrayOfStr: [
    ($) => $('h1, h2').toArray().map((el) => $(el).text().trim()),
  ],
  arrayOfObj: [{
    int: ($) => parseInt($('#length').text()) || 0,
  }],
});

The last thing I'll comment on is out. I think it might be a little overloaded. Maybe better would be...

type SelectConfig = {
  selector?: string;
  // XOR these...
  parse?: <T>($: CheerioAPI) => T;
  content?: 'text' | 'html';
  prop?: keyof HTMLElement;
  attr?: string;
  data?: string; 
  style?: string;
}

Anyway, really cool you're thinking about this direction! This really must be a huge portion of what Cheerio users are doing.

@fb55
Copy link
Member Author

fb55 commented May 16, 2022

Thanks for the feedback! See some responses below.


What if an array meant that an array should be returned?

That's the idea!

tupleOfStr: ['h1', 'h2'], // literally a tuple, throws if either selector returns more than one item

An individual selector should stand for the first match; I've added a limit option to cheerio-select that will enable us to implement this: cheeriojs/cheerio-select#307

The idea with multiple array elements was to allow users to extract different properties. Eg.

$.extract({
	titles: [
		// The document's `<title>` tag. Will use the `textContent`.
		'title',
		// The Open Graph `title` property. Will use the `content` attribute.
		{ selector: 'meta[property="og:title"]', out: 'content' }
	}]
})

Ideally, there should still be a way to limit the number of elements retrieved. That way, we could support use-cases such as https://github.com/microlinkhq/metascraper/blob/b3379a9300ad1ed6de155592866b1e555e1f5382/packages/metascraper-title/index.js

what about $.extract('#scope', {})

I tried to model this by allow out to be an object — from the example above:

$.extract({
  posts: [
    {
      selector: ".post",
      out: {
        title: ".title",
        body: ".body",
        link: { selector: "a :has(> .title)", out: "href" },
      },
    },
  ],
})

This extracts the title, body and link for every post; all the nested selectors are relative to .post.

do you think there's major gains to be had from scanning the entire extract tree to optimize all the selectors automatically?

Yes, although this is quite complicated to do and won't be a part of the initial version of this.

How about first-class support for funcs?

I tried to achieve this by allowing functions for the out property. It might make sense to allow functions for the selector as well.

The last thing I'll comment on is out. I think it might be a little overloaded.

There are currently three different values: (1) a string that will be passed to Cheerio's prop method, (2) an object that will be used as a nested object, and (3) a function that will be called with the object.

If we didn't overload the object, the alternative would be runtime errors for users that don't use TS. Removing that potential issue seems worth the added complexity.

As for using prop: This is a neat way of allowing the most common extractions. It supports attributes, serialisation types (innerHTML, outerHTML, textContent, innerText), and it is able to resolve links (as of #2510).

@mikestopcontinues
Copy link

$.extract({
	titles: [{
		// The document's `<title>` tag. Will use the `textContent`.
		'title',
		// The Open Graph `title` property. Will use the `content` attribute.
		{ selector: 'meta[property="og:title"]', out: 'content' }
	}]
})

Just to clarify, I like this solution. Two selectors within an array meaning two values. I'm not sure if it's just me, but I still read your initial spec to mean that ['h1, h2'] === ['h1', 'h2'].

Ideally, there should still be a way to limit the number of elements retrieved. That way, we could support use-cases such as https://github.com/microlinkhq/metascraper/blob/b3379a9300ad1ed6de155592866b1e555e1f5382/packages/metascraper-title/index.js

FWIW, this is exactly how I scrape data now. No knowing when Amazon is going to change their DOM, so I have a bunch of selectors for each bit of data, plus a test that picks the best match. I know .extract won't go that far, but I figure it's worth raising a use-case.

what about $.extract('#scope', {})

The one thing about {selector, out: {title, body, link}} is that it requires all nested objects to have scope. Unless selector is optional, of course. Given performance considerations, there's value to the extract config paralleling the DOM structure. But I guess the question is if the API should push in that direction.

I'm still inclined to want to keep all my selectors grouped by the data they return (a la the above example) because it makes it much easier to process in the next step. (Otherwise I need to maintain two mappings, rather than just one.)

I tried to achieve this by allowing functions for the out property. It might make sense to allow functions for the selector as well.

I don't think there'd be any benefit if you still had to nest the function. A string selector with an out func accomplishes the same thing. I was just thinking about streamlining the interface a bit. It's okay either way.

There are currently three different values: (1) a string that will be passed to Cheerio's prop method, (2) an object that will be used as a nested object, and (3) a function that will be called with the object.

I just took a look at the prop API. It's much cooler than I'd realized. Maybe the only thing I'll suggest then is that the prop name be changed from out to value. Semantically, it feels more like it contains prop/parsing functionality better.

@mvasin
Copy link

mvasin commented Dec 15, 2022

Hi and thanks for the great library!

I noticed the extract method in the docs https://cheerio.js.org/interfaces/CheerioAPI.html#extract and in the tests https://github.com/cheeriojs/cheerio/blob/dec7cdc9ad21a1fc5667a2ed015aba9ee3b47e5f/src/api/extract.spec.ts, but when I try to use it:

const $ = cheerio.load('<div>hello</div>')
$.extract({
  div: 'div',
})

cheerio blows up with

$.extract({
  ^

TypeError: $.extract is not a function

I'm using version 1.0.0-rc.12.

@fb55
Copy link
Member Author

fb55 commented Dec 15, 2022

This was just merged and a new release hasn't been issued yet. I'm working through my list for remaining changes, so this hopefully won't take long.

@b6t
Copy link

b6t commented Feb 25, 2023

Hi, It looks like this isnt released yet. Any timing updates?

@sroussey
Copy link

image

Appears to not match documentation.

@Carleslc
Copy link

Any estimation for the new release?

@anthonycmain
Copy link

What happened to this feature, its exactly what I needed and seems to be documented, but it doesn't seem to be available?

@denkan
Copy link

denkan commented Mar 28, 2023

Liked what's been discussed here.

Needed this and grew tired of waiting for Cheerio, so I just published my implementation of these ideas + own takes:
https://www.npmjs.com/package/cheerio-json-mapper

Might be useful to others as well.

@anthonycmain
Copy link

Liked what's been discussed here.

Needed this and grew tired of waiting for Cheerio, so I just published my implementation of these ideas + own takes: https://www.npmjs.com/package/cheerio-json-mapper

Might be useful to others as well.

Thanks for writing this and sharing @denkan, I've been playing with it this evening and its exactly what I need, I will feed back any bugs I find in your own github repo

@quentinlamamy
Copy link

Any update ?
it's still in doc but not in code

@archae0pteryx
Copy link

This was just merged and a new release hasn't been issued yet. I'm working through my list for remaining changes, so this hopefully won't take long.

Life... am i right? Great work so far on this all ya'all. Much needed library for sure. Keep up the good work.

@dRoskar
Copy link

dRoskar commented Sep 14, 2023

This should not be documented in the user guide, if it's not actually released yet:
https://cheerio.js.org/docs/advanced/extract

@ivanakcheurov
Copy link

This should not be documented in the user guide, if it's not actually released yet: https://cheerio.js.org/docs/advanced/extract

Since the website is also here in this repo, perhaps it would be better to have each release with a corresponding tag. And only the latest released version of the website (with relevant docs) would get actually deployed to the web.
Just suggesting.
But otherwise cheerio looks solid. Thanks to the contributors!

kodiakhq bot pushed a commit to X-oss-byte/Canary-nextjs that referenced this issue Sep 22, 2023
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [cheerio](https://cheerio.js.org/) ([source](https://togithub.com/cheeriojs/cheerio)) | [`1.0.0-rc.9` -> `1.0.0-rc.12`](https://renovatebot.com/diffs/npm/cheerio/1.0.0-rc.9/1.0.0-rc.12) | [![age](https://developer.mend.io/api/mc/badges/age/npm/cheerio/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/cheerio/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/cheerio/1.0.0-rc.9/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/cheerio/1.0.0-rc.9/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) |

---

### Release Notes

<details>
<summary>cheeriojs/cheerio (cheerio)</summary>

### [`v1.0.0-rc.12`](https://togithub.com/cheeriojs/cheerio/releases/tag/v1.0.0-rc.12)

[Compare Source](https://togithub.com/cheeriojs/cheerio/compare/v1.0.0-rc.11...v1.0.0-rc.12)

Bugfix release. Fixed issues:

-   Align `prop` undefined handling with jQuery by [@&#8203;fb55](https://togithub.com/fb55) in [cheeriojs/cheerio#2557
-   Allow deep imports of `cheerio/lib/utils` by [@&#8203;blixt](https://togithub.com/blixt) in [cheeriojs/cheerio#2601

#### New Contributors

-   [@&#8203;blixt](https://togithub.com/blixt) made their first contribution in [cheeriojs/cheerio#2601

**Full Changelog**: cheeriojs/cheerio@v1.0.0-rc.11...v1.0.0-rc.12

### [`v1.0.0-rc.11`](https://togithub.com/cheeriojs/cheerio/releases/tag/v1.0.0-rc.11)

[Compare Source](https://togithub.com/cheeriojs/cheerio/compare/v1.0.0-rc.10...v1.0.0-rc.11)

`cheerio@1.0.0-rc.11` is hopefully the last RC before the 1.0.0 release of Cheerio. There are two APIs that will be added for the next major release: An `exract` method ([cheeriojs/cheerio#2523) and NodeJS specific loader methods ([cheeriojs/cheerio#2051). These are still in flux and I'd appreciate feedback on the proposals.

A big thank you to everyone that contributed to this release! This includes code contributors, as well as the amazing financial support on [GitHub Sponsors](https://togithub.com/sponsors/cheeriojs)!

Under the hood, a lot of work for this release went into updating parse5, cheerio's default HTML parser. Have a look at [parse5's release notes](https://togithub.com/inikulin/parse5/releases/tag/v7.0.0) to see what has changed there.

#### Breaking

-   Cheerio is now a dual CommonJS and ESM module. That means that deep imports will now fail in newer versions of Node. [cheeriojs/cheerio#2508
-   `script` and `style` contents are added again in `.text()` [cheeriojs/cheerio#2509
    -   To keep the old behavior, switch `.text()` to `.prop('innerText')`
-   The TypeScript types inherited from upstream dependencies have changed. [cheeriojs/cheerio#2503
    -   Node types are now using tagged unions, which will make consumption a bit easier.

#### Features

-   Relevant options are now forwarded to `cheerio-select` [cheeriojs/cheerio#2511
    -   Custom pseudo classes can now be specified [using the `pseudos` option](https://cheerio.js.org/interfaces/CheerioOptions.html#pseudos).
-   For the `.prop()` method:
    -   Add `textContent` and `innerText` props [cheeriojs/cheerio#2214
    -   Users can now specify a `baseURI` option, which will lead to `href` and `src` props to be resolved as URLs. [cheeriojs/cheerio#2510
-   Added a `slim` export, which will always use htmlparser2 [cheeriojs/cheerio#1960

#### Fixes

-   Have `text` turn passed values to strings [cheeriojs/cheerio#2047
-   Include `undefined` in the return type of `get` by [@&#8203;glen-84](https://togithub.com/glen-84) in [cheeriojs/cheerio#2392
-   Recognise comments as HTML [cheeriojs/cheerio#2504
-   Add missing `undefined` return value [cheeriojs/cheerio#2505
-   Export missing static methods [cheeriojs/cheerio#2506
-   Have style parsing add malformed fields to previous field [cheeriojs/cheerio#2521

#### Refactor

-   Use `domutils` module directly [cheeriojs/cheerio#1928
-   Hand-roll `isHTML` [cheeriojs/cheerio#1935
-   Move initialization logic to `load` [cheeriojs/cheerio#1951
-   Only return elements in `closest` [cheeriojs/cheerio#2057
-   Remove unnecessary code, be more explicit [cheeriojs/cheerio#2279
-   Use stricter TS, ESLint configs [cheeriojs/cheerio#2507
-   Update exported values [cheeriojs/cheerio#2512

#### Development Experience

-   Migrate husky to v6 by [@&#8203;DavideViolante](https://togithub.com/DavideViolante) in [cheeriojs/cheerio#1934
-   Update CI by [@&#8203;XhmikosR](https://togithub.com/XhmikosR) in [cheeriojs/cheerio#2149
-   Set permissions for GitHub actions by [@&#8203;neilnaveen](https://togithub.com/neilnaveen) in [cheeriojs/cheerio#2453

#### Docs

-   Update README "is not a web browser" section by [@&#8203;mxschmitt](https://togithub.com/mxschmitt) in [cheeriojs/cheerio#2127

#### New Contributors

-   [@&#8203;DavideViolante](https://togithub.com/DavideViolante) made their first contribution in [cheeriojs/cheerio#1934
-   [@&#8203;mxschmitt](https://togithub.com/mxschmitt) made their first contribution in [cheeriojs/cheerio#2127
-   [@&#8203;glen-84](https://togithub.com/glen-84) made their first contribution in [cheeriojs/cheerio#2392
-   [@&#8203;neilnaveen](https://togithub.com/neilnaveen) made their first contribution in [cheeriojs/cheerio#2453

**Full Changelog**: cheeriojs/cheerio@v1.0.0-rc.10...v1.0.0-rc.11

### [`v1.0.0-rc.10`](https://togithub.com/cheeriojs/cheerio/releases/tag/v1.0.0-rc.10)

[Compare Source](https://togithub.com/cheeriojs/cheerio/compare/v1.0.0-rc.9...v1.0.0-rc.10)

**Fixes:**

-   `.html(node)` now moves passed nodes ([#&#8203;1923](https://togithub.com/cheeriojs/cheerio/issues/1923), fixes [#&#8203;940](https://togithub.com/cheeriojs/cheerio/issues/940))  [`258b26b`](https://togithub.com/cheeriojs/cheerio/commit/258b26b)
-   Boolean attributes are no longer special in xmlMode ([#&#8203;1903](https://togithub.com/cheeriojs/cheerio/issues/1903), fixes [#&#8203;1805](https://togithub.com/cheeriojs/cheerio/issues/1805))  [`b393e4a`](https://togithub.com/cheeriojs/cheerio/commit/b393e4a)
-   Rename parser adapter files ([#&#8203;1873](https://togithub.com/cheeriojs/cheerio/issues/1873), fixes [#&#8203;1847](https://togithub.com/cheeriojs/cheerio/issues/1847))  [`8f55dd8`](https://togithub.com/cheeriojs/cheerio/commit/8f55dd8)
-   Make `filter` work on all collections ([#&#8203;1870](https://togithub.com/cheeriojs/cheerio/issues/1870), fixes [#&#8203;1867](https://togithub.com/cheeriojs/cheerio/issues/1867))  [`fb8d31e`](https://togithub.com/cheeriojs/cheerio/commit/fb8d31e)
-   Bump cheerio-select ([#&#8203;1922](https://togithub.com/cheeriojs/cheerio/issues/1922), fixes https://www.npmjs.com/advisories/1754)  [`5cd2b9c`](https://togithub.com/cheeriojs/cheerio/commit/5cd2b9c)

**Documentation:**

-   Document how to define TS types for Plug-Ins ([#&#8203;1915](https://togithub.com/cheeriojs/cheerio/issues/1915), fixes [#&#8203;1778](https://togithub.com/cheeriojs/cheerio/issues/1778))  [`880fd2c`](https://togithub.com/cheeriojs/cheerio/commit/880fd2c)
-   Remove obsolete Testing section  [`e0c7cbb`](https://togithub.com/cheeriojs/cheerio/commit/e0c7cbb)
-   Remove now-invalid `require`  [`5dfbd35`](https://togithub.com/cheeriojs/cheerio/commit/5dfbd35)

**Refactors:**

-   Wrap shared behavior in `traversing` ([#&#8203;1909](https://togithub.com/cheeriojs/cheerio/issues/1909))  [`58e090a`](https://togithub.com/cheeriojs/cheerio/commit/58e090a)
-   Move `is` to `traversing`, optimize ([#&#8203;1908](https://togithub.com/cheeriojs/cheerio/issues/1908))  [`1c6fa3e`](https://togithub.com/cheeriojs/cheerio/commit/1c6fa3e)
-   Change order of arguments of internal `domEach` ([#&#8203;1892](https://togithub.com/cheeriojs/cheerio/issues/1892))  [`feda230`](https://togithub.com/cheeriojs/cheerio/commit/feda230)
-   Have `load` export a function ([#&#8203;1869](https://togithub.com/cheeriojs/cheerio/issues/1869))  [`c370f4e`](https://togithub.com/cheeriojs/cheerio/commit/c370f4e)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/sammyfilly/Canary-nextjs).
kodiakhq bot pushed a commit to X-oss-byte/Nextjs that referenced this issue Sep 25, 2023
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [cheerio](https://cheerio.js.org/) ([source](https://togithub.com/cheeriojs/cheerio)) | [`1.0.0-rc.9` -> `1.0.0-rc.12`](https://renovatebot.com/diffs/npm/cheerio/1.0.0-rc.9/1.0.0-rc.12) | [![age](https://developer.mend.io/api/mc/badges/age/npm/cheerio/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/cheerio/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/cheerio/1.0.0-rc.9/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/cheerio/1.0.0-rc.9/1.0.0-rc.12?slim=true)](https://docs.renovatebot.com/merge-confidence/) |

---

### Release Notes

<details>
<summary>cheeriojs/cheerio (cheerio)</summary>

### [`v1.0.0-rc.12`](https://togithub.com/cheeriojs/cheerio/releases/tag/v1.0.0-rc.12)

[Compare Source](https://togithub.com/cheeriojs/cheerio/compare/v1.0.0-rc.11...v1.0.0-rc.12)

Bugfix release. Fixed issues:

-   Align `prop` undefined handling with jQuery by [@&#8203;fb55](https://togithub.com/fb55) in [cheeriojs/cheerio#2557
-   Allow deep imports of `cheerio/lib/utils` by [@&#8203;blixt](https://togithub.com/blixt) in [cheeriojs/cheerio#2601

#### New Contributors

-   [@&#8203;blixt](https://togithub.com/blixt) made their first contribution in [cheeriojs/cheerio#2601

**Full Changelog**: cheeriojs/cheerio@v1.0.0-rc.11...v1.0.0-rc.12

### [`v1.0.0-rc.11`](https://togithub.com/cheeriojs/cheerio/releases/tag/v1.0.0-rc.11)

[Compare Source](https://togithub.com/cheeriojs/cheerio/compare/v1.0.0-rc.10...v1.0.0-rc.11)

`cheerio@1.0.0-rc.11` is hopefully the last RC before the 1.0.0 release of Cheerio. There are two APIs that will be added for the next major release: An `exract` method ([cheeriojs/cheerio#2523) and NodeJS specific loader methods ([cheeriojs/cheerio#2051). These are still in flux and I'd appreciate feedback on the proposals.

A big thank you to everyone that contributed to this release! This includes code contributors, as well as the amazing financial support on [GitHub Sponsors](https://togithub.com/sponsors/cheeriojs)!

Under the hood, a lot of work for this release went into updating parse5, cheerio's default HTML parser. Have a look at [parse5's release notes](https://togithub.com/inikulin/parse5/releases/tag/v7.0.0) to see what has changed there.

#### Breaking

-   Cheerio is now a dual CommonJS and ESM module. That means that deep imports will now fail in newer versions of Node. [cheeriojs/cheerio#2508
-   `script` and `style` contents are added again in `.text()` [cheeriojs/cheerio#2509
    -   To keep the old behavior, switch `.text()` to `.prop('innerText')`
-   The TypeScript types inherited from upstream dependencies have changed. [cheeriojs/cheerio#2503
    -   Node types are now using tagged unions, which will make consumption a bit easier.

#### Features

-   Relevant options are now forwarded to `cheerio-select` [cheeriojs/cheerio#2511
    -   Custom pseudo classes can now be specified [using the `pseudos` option](https://cheerio.js.org/interfaces/CheerioOptions.html#pseudos).
-   For the `.prop()` method:
    -   Add `textContent` and `innerText` props [cheeriojs/cheerio#2214
    -   Users can now specify a `baseURI` option, which will lead to `href` and `src` props to be resolved as URLs. [cheeriojs/cheerio#2510
-   Added a `slim` export, which will always use htmlparser2 [cheeriojs/cheerio#1960

#### Fixes

-   Have `text` turn passed values to strings [cheeriojs/cheerio#2047
-   Include `undefined` in the return type of `get` by [@&#8203;glen-84](https://togithub.com/glen-84) in [cheeriojs/cheerio#2392
-   Recognise comments as HTML [cheeriojs/cheerio#2504
-   Add missing `undefined` return value [cheeriojs/cheerio#2505
-   Export missing static methods [cheeriojs/cheerio#2506
-   Have style parsing add malformed fields to previous field [cheeriojs/cheerio#2521

#### Refactor

-   Use `domutils` module directly [cheeriojs/cheerio#1928
-   Hand-roll `isHTML` [cheeriojs/cheerio#1935
-   Move initialization logic to `load` [cheeriojs/cheerio#1951
-   Only return elements in `closest` [cheeriojs/cheerio#2057
-   Remove unnecessary code, be more explicit [cheeriojs/cheerio#2279
-   Use stricter TS, ESLint configs [cheeriojs/cheerio#2507
-   Update exported values [cheeriojs/cheerio#2512

#### Development Experience

-   Migrate husky to v6 by [@&#8203;DavideViolante](https://togithub.com/DavideViolante) in [cheeriojs/cheerio#1934
-   Update CI by [@&#8203;XhmikosR](https://togithub.com/XhmikosR) in [cheeriojs/cheerio#2149
-   Set permissions for GitHub actions by [@&#8203;neilnaveen](https://togithub.com/neilnaveen) in [cheeriojs/cheerio#2453

#### Docs

-   Update README "is not a web browser" section by [@&#8203;mxschmitt](https://togithub.com/mxschmitt) in [cheeriojs/cheerio#2127

#### New Contributors

-   [@&#8203;DavideViolante](https://togithub.com/DavideViolante) made their first contribution in [cheeriojs/cheerio#1934
-   [@&#8203;mxschmitt](https://togithub.com/mxschmitt) made their first contribution in [cheeriojs/cheerio#2127
-   [@&#8203;glen-84](https://togithub.com/glen-84) made their first contribution in [cheeriojs/cheerio#2392
-   [@&#8203;neilnaveen](https://togithub.com/neilnaveen) made their first contribution in [cheeriojs/cheerio#2453

**Full Changelog**: cheeriojs/cheerio@v1.0.0-rc.10...v1.0.0-rc.11

### [`v1.0.0-rc.10`](https://togithub.com/cheeriojs/cheerio/releases/tag/v1.0.0-rc.10)

[Compare Source](https://togithub.com/cheeriojs/cheerio/compare/v1.0.0-rc.9...v1.0.0-rc.10)

**Fixes:**

-   `.html(node)` now moves passed nodes ([#&#8203;1923](https://togithub.com/cheeriojs/cheerio/issues/1923), fixes [#&#8203;940](https://togithub.com/cheeriojs/cheerio/issues/940))  [`258b26b`](https://togithub.com/cheeriojs/cheerio/commit/258b26b)
-   Boolean attributes are no longer special in xmlMode ([#&#8203;1903](https://togithub.com/cheeriojs/cheerio/issues/1903), fixes [#&#8203;1805](https://togithub.com/cheeriojs/cheerio/issues/1805))  [`b393e4a`](https://togithub.com/cheeriojs/cheerio/commit/b393e4a)
-   Rename parser adapter files ([#&#8203;1873](https://togithub.com/cheeriojs/cheerio/issues/1873), fixes [#&#8203;1847](https://togithub.com/cheeriojs/cheerio/issues/1847))  [`8f55dd8`](https://togithub.com/cheeriojs/cheerio/commit/8f55dd8)
-   Make `filter` work on all collections ([#&#8203;1870](https://togithub.com/cheeriojs/cheerio/issues/1870), fixes [#&#8203;1867](https://togithub.com/cheeriojs/cheerio/issues/1867))  [`fb8d31e`](https://togithub.com/cheeriojs/cheerio/commit/fb8d31e)
-   Bump cheerio-select ([#&#8203;1922](https://togithub.com/cheeriojs/cheerio/issues/1922), fixes https://www.npmjs.com/advisories/1754)  [`5cd2b9c`](https://togithub.com/cheeriojs/cheerio/commit/5cd2b9c)

**Documentation:**

-   Document how to define TS types for Plug-Ins ([#&#8203;1915](https://togithub.com/cheeriojs/cheerio/issues/1915), fixes [#&#8203;1778](https://togithub.com/cheeriojs/cheerio/issues/1778))  [`880fd2c`](https://togithub.com/cheeriojs/cheerio/commit/880fd2c)
-   Remove obsolete Testing section  [`e0c7cbb`](https://togithub.com/cheeriojs/cheerio/commit/e0c7cbb)
-   Remove now-invalid `require`  [`5dfbd35`](https://togithub.com/cheeriojs/cheerio/commit/5dfbd35)

**Refactors:**

-   Wrap shared behavior in `traversing` ([#&#8203;1909](https://togithub.com/cheeriojs/cheerio/issues/1909))  [`58e090a`](https://togithub.com/cheeriojs/cheerio/commit/58e090a)
-   Move `is` to `traversing`, optimize ([#&#8203;1908](https://togithub.com/cheeriojs/cheerio/issues/1908))  [`1c6fa3e`](https://togithub.com/cheeriojs/cheerio/commit/1c6fa3e)
-   Change order of arguments of internal `domEach` ([#&#8203;1892](https://togithub.com/cheeriojs/cheerio/issues/1892))  [`feda230`](https://togithub.com/cheeriojs/cheerio/commit/feda230)
-   Have `load` export a function ([#&#8203;1869](https://togithub.com/cheeriojs/cheerio/issues/1869))  [`c370f4e`](https://togithub.com/cheeriojs/cheerio/commit/c370f4e)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/X-oss-byte/Nextjs).
@adamreisnz
Copy link

Ugh, why is this feature documented if it's not actually released yet? 😢

@christo
Copy link

christo commented Nov 9, 2023

Super confusing and time consuming to read docs added by this commit 976b087 for a proposed feature with no apparent implementation work evident in the repo. A new user like me, while not wanting to be mistaken for an ungrateful or entitled whiner, is left wondering if this kind of thing is representative of what I should expect from the rest of cheerio or if this is a rare exception.

@fb55
Copy link
Member Author

fb55 commented Nov 9, 2023

@bluescorpian
Copy link

Remove it from the docs, if its not in the latest release.

@piscopancer
Copy link

Where is extract function? There is none on Root 😭

@sebagr
Copy link

sebagr commented May 22, 2024

May 2024, still not implemented and still on the docs? Or why am I getting TypeError: $.extract is not a function? Very confusing!

@rikkit
Copy link

rikkit commented May 27, 2024

Why would you take the time to document a feature that's not implemented? So weird!

@fb55
Copy link
Member Author

fb55 commented May 27, 2024

It is implemented, just not released yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
v1.0
  
To do
Development

No branches or pull requests