Skip to content

Commit

Permalink
Merge pull request #84 from GLAM-Workbench/update
Browse files Browse the repository at this point in the history
Update Trove Harvester package and add RO-Crate file
  • Loading branch information
wragge committed Oct 25, 2023
2 parents 6202a61 + f897e23 commit 6dbfcee
Show file tree
Hide file tree
Showing 17 changed files with 671 additions and 64 deletions.
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -11,3 +11,4 @@ getting-started/my_nma_dataset.csv
anzac-day*
front_pages*
Untitled*
scripts/add_nb_metadata.py
8 changes: 4 additions & 4 deletions .zenodo.json
Expand Up @@ -5,7 +5,7 @@
"related_identifiers": [
{
"scheme": "url",
"identifier": "https://github.com/GLAM-Workbench/trove-newspaper-harvester/tree/v2.0.0",
"identifier": "https://github.com/GLAM-Workbench/trove-newspaper-harvester/tree/v2.0.1",
"relation": "isDerivedFrom",
"resource_type": "software"
},
Expand All @@ -22,21 +22,21 @@
"resource_type": "other"
}
],
"version": "v2.0.0",
"version": "v2.0.1",
"upload_type": "software",
"keywords": [
"Trove",
"newspapers",
"Jupyter",
"GLAM Workbench"
],
"publication_date": "2023-08-31",
"publication_date": "2023-10-25",
"creators": [
{
"orcid": "0000-0001-7956-4498",
"name": "Sherratt, Tim"
}
],
"access_right": "open",
"description": "<p>Current version: <a href=\"https://github.com/GLAM-Workbench/trove-newspaper-harvester/releases/tag/v2.0.0\">v2.0.0</a></p> <p>The <a href=\"https://pypi.org/project/trove-newspaper-harvester/\">Trove Newspaper &amp; Gazette Harvester Harvester</a> makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs. While the web interface will only show you the first 2,000 results matching your search, the Newspaper &amp; Gazette Harvester will get <strong>everything</strong>.</p> <p>The Jupyter notebooks in this repository use the Trove Newspaper and Gazette Harvester to download large quantities of digitised newspaper articles from Trove. There’s also a few examples of how you can analyse and explore the harvested data.</p> <p>The notebooks include:</p> <ul> <li><strong>Using TroveHarvester to get newspaper articles in bulk</strong> — an easy introduction to the TroveHarvester tool</li> <li><strong>Trove Harvester web app</strong> — a simple web interface to the TroveHarvester, the easiest way to harvest data from Trove (runs in Voila)</li> <li><strong>Harvesting articles that mention “Anzac Day” on Anzac Day</strong> – import the Harvester as a Python library to harvest a complex search</li> <li><strong>Display the results of a harvest as a searchable database using Datasette</strong> – load your harvested data into a SQLite database and explore it using Datasette</li> <li><strong>Exploring your TroveHarvester data</strong> — use Pandas to analyse your data and create some visualisations</li> <li><strong>Explore harvested text files</strong> (experimental) — analyse the full text content of harvested articles</li> </ul> <p>See the <a href=\"https://glam-workbench.github.io/trove-harvester/\">GLAM Workbench for more details</a>.</p> <h2 id=\"cite-as\">Cite as</h2> <p>See the GLAM Workbench or <a href=\"https://doi.org/10.5281/zenodo.3545044\">Zenodo</a> for up-to-date citation details.</p> <hr /> <p>This repository is part of the <a href=\"https://glam-workbench.github.io/\">GLAM Workbench</a>.<br /> If you think this project is worthwhile, you might like <a href=\"https://github.com/sponsors/wragge?o=esb\">to sponsor me on GitHub</a>.</p>"
"description": "<p>Current version: <a href=\"https://github.com/GLAM-Workbench/trove-newspaper-harvester/releases/tag/v2.0.1\">v2.0.1</a></p> <p>The <a href=\"https://pypi.org/project/trove-newspaper-harvester/\">Trove Newspaper &amp; Gazette Harvester Harvester</a> makes it easy to download large quantities of digitised articles from Trove’s newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs. While the web interface will only show you the first 2,000 results matching your search, the Newspaper &amp; Gazette Harvester will get <strong>everything</strong>.</p> <p>The Jupyter notebooks in this repository use the Trove Newspaper and Gazette Harvester to download large quantities of digitised newspaper articles from Trove. There’s also a few examples of how you can analyse and explore the harvested data.</p> <p>The notebooks include:</p> <ul> <li><strong>Using TroveHarvester to get newspaper articles in bulk</strong> — an easy introduction to the TroveHarvester tool</li> <li><strong>Trove Harvester web app</strong> — a simple web interface to the TroveHarvester, the easiest way to harvest data from Trove (runs in Voila)</li> <li><strong>Harvesting articles that mention “Anzac Day” on Anzac Day</strong> – import the Harvester as a Python library to harvest a complex search</li> <li><strong>Display the results of a harvest as a searchable database using Datasette</strong> – load your harvested data into a SQLite database and explore it using Datasette</li> <li><strong>Exploring your TroveHarvester data</strong> — use Pandas to analyse your data and create some visualisations</li> <li><strong>Explore harvested text files</strong> (experimental) — analyse the full text content of harvested articles</li> </ul> <p>See the <a href=\"https://glam-workbench.github.io/trove-harvester/\">GLAM Workbench for more details</a>.</p> <h2 id=\"cite-as\">Cite as</h2> <p>See the GLAM Workbench or <a href=\"https://doi.org/10.5281/zenodo.3545044\">Zenodo</a> for up-to-date citation details.</p> <hr /> <p>This repository is part of the <a href=\"https://glam-workbench.github.io/\">GLAM Workbench</a>.<br /> If you think this project is worthwhile, you might like <a href=\"https://github.com/sponsors/wragge?o=esb\">to sponsor me on GitHub</a>.</p>"
}
9 changes: 9 additions & 0 deletions Explore-harvested-text-files.ipynb
Expand Up @@ -1399,6 +1399,15 @@
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"rocrate": {
"author": [
{
"name": "Sherratt, Tim",
"orcid": "https://orcid.org/0000-0001-7956-4498"
}
],
"name": "Explore harvested text files"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
Expand Down
9 changes: 9 additions & 0 deletions Exploring-your-TroveHarvester-data.ipynb
Expand Up @@ -1673,6 +1673,15 @@
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"rocrate": {
"author": [
{
"name": "Sherratt, Tim",
"orcid": "https://orcid.org/0000-0001-7956-4498"
}
],
"name": "Exploring your harvested data"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
Expand Down
2 changes: 1 addition & 1 deletion README.md
@@ -1,6 +1,6 @@
# Trove Newspaper and Gazette Harvester

Current version: [v2.0.0](https://github.com/GLAM-Workbench/trove-newspaper-harvester/releases/tag/v2.0.0)
Current version: [v2.0.1](https://github.com/GLAM-Workbench/trove-newspaper-harvester/releases/tag/v2.0.1)

The [Trove Newspaper & Gazette Harvester Harvester](https://pypi.org/project/trove-newspaper-harvester/) makes it easy to download large quantities of digitised articles from Trove's newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis. You can also save the full text of every article, as well as copies of the articles as JPG images, and even PDFs. While the web interface will only show you the first 2,000 results matching your search, the Newspaper & Gazette Harvester will get **everything**.

Expand Down
9 changes: 9 additions & 0 deletions Using-TroveHarvester-to-get-newspaper-articles-in-bulk.ipynb
Expand Up @@ -423,6 +423,15 @@
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"rocrate": {
"author": [
{
"name": "Sherratt, Tim",
"orcid": "https://orcid.org/0000-0001-7956-4498"
}
],
"name": "Using TroveHarvester to get newspaper and gazette articles in bulk"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
Expand Down
9 changes: 9 additions & 0 deletions display_harvest_results_using_datasette.ipynb
Expand Up @@ -341,6 +341,15 @@
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"rocrate": {
"author": [
{
"name": "Sherratt, Tim",
"orcid": "https://orcid.org/0000-0001-7956-4498"
}
],
"name": "Display the results of a harvest as a searchable database using Datasette"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
Expand Down
9 changes: 9 additions & 0 deletions harvest-specific-days.ipynb
Expand Up @@ -533,6 +533,15 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"rocrate": {
"author": [
{
"name": "Sherratt, Tim",
"orcid": "https://orcid.org/0000-0001-7956-4498"
}
],
"name": "Harvesting articles that mention \"Anzac Day\" on Anzac Day"
}
},
"nbformat": 4,
Expand Down
9 changes: 9 additions & 0 deletions newspaper_harvester_app.ipynb
Expand Up @@ -289,6 +289,15 @@
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"rocrate": {
"author": [
{
"name": "Sherratt, Tim",
"orcid": "https://orcid.org/0000-0001-7956-4498"
}
],
"name": "Trove Newspaper & Gazette Harvester"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {
Expand Down
6 changes: 5 additions & 1 deletion requirements-dev.in
Expand Up @@ -5,4 +5,8 @@ nbqa
black[jupyter]
isort
flake8
pre-commit
pre-commit
rocrate
giturlparse
jupyterlab-code-formatter
gitpython

0 comments on commit 6dbfcee

Please sign in to comment.