Optical Character Recognition for "copy paste" from quarantine VM

We are examining malware on VM cut off from network and available drives. I would like to be able to copy paste some information (hashes/checksums) and some korean filenames for google translate

Requirements/setup

I am using python3.6 on my local Ubuntu 16.04 machine I set up a virtualenvironment and added the python packages described in requirements.txt

virtualenv -p /home/cas/miniconda/bin/python --no-site-packages ocr
source ocr/bin/activate
pip install -r requirements.txt

The main OCR engine used is tesseract-ocr, this is intalled with apt

sudo apt-get install tesseract-ocr

on virutual machine we have python2.7 on windows, I had nothing to do with that config, but i'm really happy we have python interpreter!

Example workflow

1

On virtual machine I have mounted a USB image with OSFmount.
The filenames are in characters, want to investigate with google translate!

2

on VM copy the file name into a file called "unicode_raw.txt"
run get_utf_codes.py to save out the utf codes for OCR recognition
note this is done on vm, in this case we are doing OCR on output_ocr.txt
this is a great place to make text as clear as possible by changing to large font in notepad
a high quality image will give OCR algo best chance at accuracy 3 -on local machine take a screenshot of the unicode representation for OCR analysis
run snip_to_text.py on local machine, -i flag is png image, -o is output file

(ocr) cas@ubuntu:~/working_dir/python_ocr$ python snip_to_text.py -i mp3_chars.png -o mp3_chars_out.txt

you now should have text in the mp3_chars_out.txt (-o param), look it over and make any corrections (for example I remove some random spaces 4
now we need to get back to utf-8 chars.
hacky AF solution is to simply print them to console and file as strings in python (so you dont have to deal with the \ escape chars)
this is done manually in print_utf.py
this also stores output in output file called unicode_out.txt

(ocr) cas@ubuntu:~/working_dir/python_ocr$ python print_utf.py

finally you can google translate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optical Character Recognition for "copy paste" from quarantine VM

Requirements/setup

Example workflow

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optical Character Recognition for "copy paste" from quarantine VM

Requirements/setup

Example workflow