Skip to content

Commit

Permalink
Readme update for first RC
Browse files Browse the repository at this point in the history
  • Loading branch information
TAHRI Ahmed R committed Aug 27, 2019
1 parent 1519d80 commit dbb3bdc
Showing 1 changed file with 45 additions and 60 deletions.
105 changes: 45 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
> Library that help you read human* written text from unknown charset encoding.<br /> Project motivated by `chardet`, I'm trying to resolve the issue by taking another approach.
This project offer you a alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
Also as of July, August 2019 it's still a beta release.
**First RC available !**

| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
| ------------- | :-------------: | :------------------: | :------------------: |
Expand All @@ -31,12 +31,12 @@ Also as of July, August 2019 it's still a beta release.
| `Reliable` **with** distinguishable standards ||||
| `Free & Open` ||||
| `Native Python` ||||
| `Does not have specific code for specific charset` ||| |
| `Detect spoken language` ||| N/A |

<p align="center">
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://image.noelshack.com/fichiers/2019/31/5/1564761473-ezgif-5-cf1bd9dd66b0.gif" alt="Cat Reading Text" width="200"/>

<small>Cats are going to enjoy newly decoded text</small>
<small>I need people to verify if this lib is covering most of cases at this stage.</small>

<small>\* : When written, should not be gibberish.</small><br>
<small>\*\* : They are clearly using specific code for a specific charset even if covering most of existing one</small><br>
Expand All @@ -54,13 +54,47 @@ pip install charset_normalizer

## 🚀 Basic Usage

#### Just print out normalized text
### CLI
This package come with a CLI

```bash
usage: normalizer [-h] [--verbose] [--normalize] [--replace] [--force]
file [file ...]

positional arguments:
file Filename

optional arguments:
-h, --help show this help message and exit
--verbose Display complementary information about file if any.
--normalize Permit to normalize input file. If not set, program does not
write anything.
--replace Replace file when trying to normalize it instead of creating a
new one.
--force Replace file without asking if you are sure, use this flag with
caution.
```
```bash
normalizer ./data/sample.1.fr.srt
```
```
+----------------------+----------+----------+------------------------------------+-------+-----------+
| Filename | Encoding | Language | Alphabets | Chaos | Coherence |
+----------------------+----------+----------+------------------------------------+-------+-----------+
| data/sample.1.fr.srt | cp1252 | French | Basic Latin and Latin-1 Supplement | 0.0 % | 84.924 % |
+----------------------+----------+----------+------------------------------------+-------+-----------+
```
### Python
*Just print out normalized text*
```python
from charset_normalizer import CharsetNormalizerMatches as CnM
print(CnM.from_path('./my_subtitle.srt').best().first())
```
#### Normalize any text file
*Normalize any text file*
```python
from charset_normalizer import CharsetNormalizerMatches as CnM
try:
Expand All @@ -85,14 +119,14 @@ In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
## 🍰 How
- Discard all charset encoding table that could not fit the binary content.
- Measure chaos, or the mess once opened with a corresponding charset encoding.
- Extract matches with the lowest mess detected.
- Finally, if there is too much match left, we measure coherence.
- Discard all charset encoding table that could not fit the binary content.
- Measure chaos, or the mess once opened with a corresponding charset encoding.
- Extract matches with the lowest mess detected.
- Finally, if there is too much match left, we measure coherence.
**Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. Then **I observed**, then
*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
**I established** some ground rules about **what is obvious** when **it's seems like** a mess.
I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
improve or rewrite it.
Expand All @@ -110,53 +144,4 @@ Feel free to check [issues page](https://github.com/ousret/charset_normalizer/is
Copyright © 2019 [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
Letter appearances frequencies used in this project © 2012 [Denny Vrandečić](http://denny.vrandecic.de)

## LoC

It is **always possible** to **make a difference** in this world. I was told it is impossible to propose a real alternative of Chardet / uChardet in conception terms speaking.

*using cloc tool on master branch of each project*

**Chardet** *Python*
```sh

-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 42 491 1458 36112
-------------------------------------------------------------------------------
SUM: 42 491 1458 36112
-------------------------------------------------------------------------------

```


**uChardet** *C++*
```sh

-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
C++ 51 740 2958 6927
C/C++ Header 22 286 1039 876
CMake 4 30 8 234
-------------------------------------------------------------------------------
SUM: 77 1056 4005 8037
-------------------------------------------------------------------------------

```

**Charset Normalizer** *Python*

```sh

-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 6 170 155 977
-------------------------------------------------------------------------------
SUM: 6 170 155 977
-------------------------------------------------------------------------------

```
Letter appearances frequencies used in this project © 2012 [Denny Vrandečić](http://denny.vrandecic.de)

0 comments on commit dbb3bdc

Please sign in to comment.