Optical Character Recognition for "copy paste" from quarantine VM

We are examining malware on VM cut off from network and available drives. I would like to be able to copy paste some information (hashes/checksums) and some korean filenames for google translate

Requirements/setup

I am using python3.6 on my local Ubuntu 16.04 machine I set up a virtualenvironment and added the python packages described in requirements.txt

virtualenv -p /home/cas/miniconda/bin/python --no-site-packages ocr
source ocr/bin/activate
pip install -r requirements.txt

The main OCR engine used is tesseract-ocr, this is intalled with apt

sudo apt-get install tesseract-ocr

on virutual machine we have python2.7 on windows, I had nothing to do with that config, but i'm really happy we have python interpreter!

Example workflow

1

On virtual machine I have mounted a USB image with OSFmount.
The filenames are in characters, want to investigate with google translate!

2

on VM copy the file name into a file called "unicode_raw.txt"
run get_utf_codes.py to save out the utf codes for OCR recognition
note this is done on vm, in this case we are doing OCR on output_ocr.txt
this is a great place to make text as clear as possible by changing to large font in notepad
a high quality image will give OCR algo best chance at accuracy 3 -on local machine take a screenshot of the unicode representation for OCR analysis
run snip_to_text.py on local machine, -i flag is png image, -o is output file

(ocr) cas@ubuntu:~/working_dir/python_ocr$ python snip_to_text.py -i mp3_chars.png -o mp3_chars_out.txt

you now should have text in the mp3_chars_out.txt (-o param), look it over and make any corrections (for example I remove some random spaces 4
now we need to get back to utf-8 chars.
hacky AF solution is to simply print them to console and file as strings in python (so you dont have to deal with the \ escape chars)
this is done manually in print_utf.py
this also stores output in output file called unicode_out.txt

(ocr) cas@ubuntu:~/working_dir/python_ocr$ python print_utf.py

finally you can google translate it!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
screenshots		screenshots
README.md		README.md
get_utf_codes.py		get_utf_codes.py
print_utf.py		print_utf.py
requirements.txt		requirements.txt
snip_to_text.py		snip_to_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

screenshots

screenshots

README.md

README.md

get_utf_codes.py

get_utf_codes.py

print_utf.py

print_utf.py

requirements.txt

requirements.txt

snip_to_text.py

snip_to_text.py

Repository files navigation

Optical Character Recognition for "copy paste" from quarantine VM

Requirements/setup

Example workflow

About

Releases

Packages

Languages

donoghuc/python_ocr

Folders and files

Latest commit

History

Repository files navigation

Optical Character Recognition for "copy paste" from quarantine VM

Requirements/setup

Example workflow

About

Resources

Stars

Watchers

Forks

Languages