Skip to content

donoghuc/python_ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optical Character Recognition for "copy paste" from quarantine VM

We are examining malware on VM cut off from network and available drives. I would like to be able to copy paste some information (hashes/checksums) and some korean filenames for google translate

Requirements/setup

I am using python3.6 on my local Ubuntu 16.04 machine I set up a virtualenvironment and added the python packages described in requirements.txt

virtualenv -p /home/cas/miniconda/bin/python --no-site-packages ocr
source ocr/bin/activate
pip install -r requirements.txt

The main OCR engine used is tesseract-ocr, this is intalled with apt

sudo apt-get install tesseract-ocr

on virutual machine we have python2.7 on windows, I had nothing to do with that config, but i'm really happy we have python interpreter!

Example workflow

1

  • On virtual machine I have mounted a USB image with OSFmount.
  • The filenames are in characters, want to investigate with google translate! objective

2

  • on VM copy the file name into a file called "unicode_raw.txt" unicode
  • run get_utf_codes.py to save out the utf codes for OCR recognition
  • note this is done on vm, in this case we are doing OCR on output_ocr.txt
  • this is a great place to make text as clear as possible by changing to large font in notepad
  • a high quality image will give OCR algo best chance at accuracy unicode 3 -on local machine take a screenshot of the unicode representation for OCR analysis mp3
  • run snip_to_text.py on local machine, -i flag is png image, -o is output file
(ocr) cas@ubuntu:~/working_dir/python_ocr$ python snip_to_text.py -i mp3_chars.png -o mp3_chars_out.txt

ocr

  • you now should have text in the mp3_chars_out.txt (-o param), look it over and make any corrections (for example I remove some random spaces 4
  • now we need to get back to utf-8 chars.
  • hacky AF solution is to simply print them to console and file as strings in python (so you dont have to deal with the \ escape chars)
  • this is done manually in print_utf.py
  • this also stores output in output file called unicode_out.txt
(ocr) cas@ubuntu:~/working_dir/python_ocr$ python print_utf.py

unicode

  • finally you can google translate it! translate

About

want to screenshot text on a quarentined VM and be able to use it

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages