Wordy document analyser

A simple example of natural language processing.

Specification

Primary user story

I am an end-user with many documents. I want to see which interesting words occur most frequently, so that I can identify important topics across all documents.

Scenarios

View results

Given documents are parsed
When the user opens the page
Then display words sorted by frequency
And display a sample sentence from each document

Upload a new file

Given the filename has not been used before
When the user uploads the file
Then parse the contents
And display the updated results

Upload an existing file

Given the filename has been used before
When the user uploads the file
Then replace the old contents
And parse the new contents
And display the updated results

Sample output

Word	Count	Samples
philosophy	42	I don't have time for philosophy... (document X)
		Surely this was a touch of... (document Y)
		Still, her pay-as-you-go philosophy... (document Z)

Notable features

Simple command to download NLTK natural language data
Stopwords (a, the, and...) are removed

Limitations

This site is not production-ready!

Minimal file validation
Cannot specify download location for NLTK data
Requires admin access to delete documents

Out of scope

Access rights and granular permissions
Database instance (i.e. other than SQLite)
Fully offline operation (no additional download steps required)
Integration testing
Languages other than English (American spelling?)
Security-hardened configuration

Next steps

See TODO.md for strategies to add NLP features, achieve high scalability and improve UI responsiveness.

Screenshot

Run from Docker

Prerequisites

Steps

cd simplenlp
docker-compose up
# the first time you run this command,
# it will take a while to build the image

Once the container is running you can visit http://localhost:8000

The default credentials for http://localhost:8000/admin/ are username admin, password admin.

Usage

Open http://localhost:8000 in your browser. The results table starts empty.
Choose and upload a text file to add it to the results.
- You will get an error if you attempt to upload a non-text file. Go back and try again.
Sample sentences are truncated. Hover over a sample to see the full sentence. (See screenshot for an example.)
Keep uploading text files to recalculate the results. Duplicate filenames will overwrite existing documents and related results.

Run from source

Prerequisites

Python 3.8
pipenv

Steps

Create a virtual environment for Python:

cd simplenlp
pipenv install
# tested on Ubuntu 20.04 LTS
# if this fails, try deleting Pipfile.lock

Bootstrap Django:

pipenv shell
# activate virtual environment

python manage.py initwordy
# download natural language data (NLTK data)
# this can take 1-2 minutes
# default location is ~/nltk_data

python manage.py migrate
# create schema for SQLite database

python manage.py createsuperuser
# follow prompts to create an admin user

python manage.py runserver
# start Django site

Confirm there are no errors or warnings in the logs.
- If you see a warning, you probably missed the initwordy step above.
See the "Usage" and "Notes" sections for Docker for more details.
- You will get an error when uploading text files if you missed the initwordy step while bootstrapping Django. Stop the site, run the initwordy command and try again.

Testing

Follow the steps below to generate a coverage report.

In this case line coverage is high (see coverage.txt) and it is more important to consider case coverage. See wordy/tests.py for current cases.

pipenv install --dev
# installs developer resources

coverage run --source='.' manage.py test
# runs all Django tests
# branches marked "pragma: no cover" are ignored
# (typically integration-related issues)

coverage report > coverage.txt
# generates summary report as coverage.txt

coverage html
# (optional) generates detailed report as htmlcov/index.html

Code quality

black .
# apply coding conventions

coverage run --source='.' manage.py test
coverage report > coverage.txt
# run tests and generate coverage report

git diff --exit-code
# confirm these commands do not generate changes

If all tasks pass, your changes are ready for submission. Otherwise you need to fix, commit and validate again.

You can invoke these tasks in any CI/CD pipeline. All cleanup and validation tasks should succeed without error or modification.

Sharing and contributions

Wordy document analyser
https://gitlab.com/lofidevops/simplenlp
Copyright 2022 David Seaward
SPDX-License-Identifier: GPL-3.0-or-later

Shared under GPL-3.0-or-later. We adhere to the Contributor Covenant 2.0 without modification, and certify origin per DCO 1.1 with a signed-off-by line. Contributions under the same terms are welcome.

For details see:

COPYING.md, full license text
CODE_OF_CONDUCT.md, full conduct text (report via a private ticket)
CONTRIBUTING.md, full origin text (git -s)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
simplenlp		simplenlp
wordy		wordy
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
COPYING.md		COPYING.md
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
TODO.md		TODO.md
coverage.txt		coverage.txt
docker-compose.yml		docker-compose.yml
manage.py		manage.py
screenshot.png		screenshot.png

License

lofidevops/simplenlp

Folders and files

Latest commit

History

Repository files navigation

Wordy document analyser

Specification

Primary user story

Scenarios

Sample output

Notable features

Limitations

Out of scope

Next steps

Screenshot

Run from Docker

Prerequisites

Steps

Usage

Run from source

Prerequisites

Steps

Testing

Code quality

Sharing and contributions

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages