Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation updates for v2.3.0 #5593

Merged
merged 26 commits into from
Jun 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
7da3ada
Update website models for v2.3.0
adrianeboyd May 19, 2020
310524d
Add docs for Chinese word segmentation
adrianeboyd May 21, 2020
ee136e5
Merge pull request #5457 from adrianeboyd/website/models-v2.3.0
ines May 21, 2020
8f98bd0
Tighten up Chinese docs section
ines May 21, 2020
8dbac77
Merge pull request #5474 from adrianeboyd/website/chinese-support-v2.3.0
ines May 21, 2020
815e687
Merge branch 'master' into docs/v2.3.0 [ci skip]
ines May 21, 2020
9818222
Merge branch 'master' into docs/v2.3.0 [ci skip]
ines May 21, 2020
75565f8
Auto-format and update version
ines May 22, 2020
e97e1b1
Update matcher.md
ines May 22, 2020
1f2a9e2
Update languages and sorting
ines May 22, 2020
4a87505
Typo in landing page
adrianeboyd May 27, 2020
c8aeed9
Merge remote-tracking branch 'upstream/master' into docs/v2.3.0
adrianeboyd May 27, 2020
6ccdfed
Infobox about token_match behavior
adrianeboyd May 27, 2020
d593f6d
Add meta and basic docs for Japanese
adrianeboyd Jun 10, 2020
4de5142
POS -> TAG in models table
ines Jun 15, 2020
925834a
Add info about lookups for normalization
adrianeboyd Jun 15, 2020
fe41abf
Updates to API docs for v2.3
adrianeboyd Jun 15, 2020
56085d2
Update adding norm exceptions for adding languages
adrianeboyd Jun 16, 2020
ef5aa64
Add --omit-extra-lookups to CLI API docs
adrianeboyd Jun 16, 2020
803e01e
Add initial draft of "What's New in v2.3"
adrianeboyd Jun 16, 2020
d018dc9
Add new in v2.3 tags to Chinese and Japanese sections
adrianeboyd Jun 16, 2020
9425f68
Add tokenizer to migration section
adrianeboyd Jun 16, 2020
93cef9c
Add new in v2.3 flags to init-model
adrianeboyd Jun 16, 2020
cdc1984
Typo
adrianeboyd Jun 16, 2020
e560e7d
More what's new in v2.3
adrianeboyd Jun 16, 2020
3be577e
Merge branch 'v2.3.x' into docs/v2.3.0-v2.3.x
adrianeboyd Jun 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ spaCy is a library for advanced Natural Language Processing in Python and
Cython. It's built on the very latest research, and was designed from day one to
be used in real products. spaCy comes with
[pretrained statistical models](https://spacy.io/models) and word vectors, and
currently supports tokenization for **50+ languages**. It features
currently supports tokenization for **60+ languages**. It features
state-of-the-art speed, convolutional **neural network models** for tagging,
parsing and **named entity recognition** and easy **deep learning** integration.
It's commercial open-source software, released under the MIT license.

💫 **Version 2.2 out now!**
💫 **Version 2.3 out now!**
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)

[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
Expand All @@ -32,15 +32,15 @@ It's commercial open-source software, released under the MIT license.
| --------------- | -------------------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. |
| [New in v2.2] | New features, backwards incompatibilities and migration guide. |
| [New in v2.3] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. |
| [Changelog] | Changes and version history. |
| [Contribute] | How to contribute to the spaCy project and code base. |

[spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.2]: https://spacy.io/usage/v2-2
[new in v2.3]: https://spacy.io/usage/v2-3
[usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/
[models]: https://spacy.io/models
Expand Down Expand Up @@ -113,12 +113,13 @@ of `v2.0.13`).
pip install spacy
```

To install additional data tables for lemmatization in **spaCy v2.2+** you can
run `pip install spacy[lookups]` or install
To install additional data tables for lemmatization and normalization in
**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
separately. The lookups package is needed to create blank models with
lemmatization data, and to lemmatize in languages that don't yet come with
pretrained models and aren't powered by third-party libraries.
lemmatization data for v2.2+ plus normalization data for v2.3+, and to
lemmatize in languages that don't yet come with pretrained models and aren't
powered by third-party libraries.

When using pip it is generally recommended to install packages in a virtual
environment to avoid modifying system state:
Expand Down
21 changes: 11 additions & 10 deletions website/docs/api/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -541,16 +541,17 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
[--prune-vectors]
```

| Argument | Type | Description |
| ------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. |
| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
| `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. |
| **CREATES** | model | A spaCy model containing the vocab and vectors. |
| Argument | Type | Description |
| ----------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag> | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. |
| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
| `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. |
| `--omit-extra-lookups`, `-OEL` <Tag variant="new">2.3</Tag> | flag | Do not include any of the extra lookups tables (`cluster`/`prob`/`sentiment`) from `spacy-lookups-data` in the model. |
| **CREATES** | model | A spaCy model containing the vocab and vectors. |

## Evaluate {#evaluate new="2"}

Expand Down
3 changes: 0 additions & 3 deletions website/docs/api/cython-structs.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,9 +171,6 @@ struct.
| `shape` | <Abbr title="uint64_t">`attr_t`</Abbr> | Transform of the lexeme's string, to show orthographic features. |
| `prefix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the start of the lexeme. Defaults to `N=1`. |
| `suffix` | <Abbr title="uint64_t">`attr_t`</Abbr> | Length-N substring from the end of the lexeme. Defaults to `N=3`. |
| `cluster` | <Abbr title="uint64_t">`attr_t`</Abbr> | Brown cluster ID. |
| `prob` | `float` | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
| `sentiment` | `float` | A scalar value indicating positivity or negativity. |

### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}

Expand Down
1 change: 1 addition & 0 deletions website/docs/api/goldparse.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ missing – the gradient for those labels will be zero.
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False.`. |
| **RETURNS** | `GoldParse` | The newly constructed object. |

## GoldParse.\_\_len\_\_ {#len tag="method"}
Expand Down
2 changes: 1 addition & 1 deletion website/docs/api/lexeme.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ The L2 norm of the lexeme's vector representation.
| `like_url` | bool | Does the lexeme resemble a URL? |
| `like_num` | bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. |
| `like_email` | bool | Does the lexeme resemble an email address? |
| `is_oov` | bool | Is the lexeme out-of-vocabulary? |
| `is_oov` | bool | Does the lexeme have a word vector? |
| `is_stop` | bool | Is the lexeme part of a "stop list"? |
| `lang` | int | Language of the parent vocabulary. |
| `lang_` | unicode | Language of the parent vocabulary. |
Expand Down
11 changes: 6 additions & 5 deletions website/docs/api/matcher.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ string where an integer is expected) or unexpected property names.

## Matcher.\_\_call\_\_ {#call tag="method"}

Find all token sequences matching the supplied patterns on the `Doc`.
Find all token sequences matching the supplied patterns on the `Doc`. As of
spaCy v2.3, the `Matcher` can also be called on `Span` objects.

> #### Example
>
Expand All @@ -54,10 +55,10 @@ Find all token sequences matching the supplied patterns on the `Doc`.
> matches = matcher(doc)
> ```
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doc` | `Doc` | The document to match over. |
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
| Name | Type | Description |
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).. |
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |

<Infobox title="Important note" variant="warning">

Expand Down
2 changes: 1 addition & 1 deletion website/docs/api/sentencizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Initialize the sentencizer.
| Name | Type | Description |
| ------------- | ------------- | ------------------------------------------------------------------------------------------------------ |
| `punct_chars` | list | Optional custom list of punctuation characters that mark sentence ends. Defaults to `[".", "!", "?"].` |
| `punct_chars` | list | Optional custom list of punctuation characters that mark sentence ends. Defaults to `['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']`. |
| **RETURNS** | `Sentencizer` | The newly constructed object. |

## Sentencizer.\_\_call\_\_ {#call tag="method"}
Expand Down