Skip to content

marctorsoc/title_detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

The system implemented detects titles in documents, given some features for each title:

- text
- is_bold
- is_italic
- is_underlined
- left
- right
- top
- bottom

The master branch works with Poetry, while there's an old branch working with setuptools.

Installation

The system required Python3 and conda installed. Step by step recommended installation:

  1. Create a new virtual environment using conda

    conda create --name a_name python=3.7

  2. Activate environment

    conda activate a_name

  3. (From source) This is not really an installation, just allows to create the environment (this might take a while)

    poetry install

  4. (From whl, skip if done from source) Install as (this might take a while)

    pip install title_detector-0.1.0-py3-none-any.whl

  5. (Both cases) Install spacy language model

    python -m spacy download en_core_web_sm

Usage

From the root project directory type (this is not required if installed with the wheel as package):

title_detector [command] [possible args]

Help can be retrieved by

title_detector --help

and

title_detector [command] --help

## Examples

title_detector train --> defaults to the sample train data

title_detector train --max_docs 300

title_detector detect --> defaults to the sample test data

title_detector detect --predicted_data_path sample/train_sections_data_detected.csv

title_detector evaluate --> defaults to the sample test data

title_detector clean --> defaults to the default model output location

Note that when using sample data, one needs to be in the root directory so that the data can be found

Testing

TODO

Author: Marc Torrellas Socastro
2019, July

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published