tesseract-ocr 筆記

Kuan

4 min readJun 1, 2020

寫在最開頭，我的練習結果成功率只有慘不仁賭可以形容，所以此文章單純紀錄 tesseract-ocr 練習驗證碼的步驟。

tesseract-ocr/tessdoc

Have questions about the training process? If you had some problems during the training process and you need help, use…

github.com

有用到的
tesseract 4.x
imagemagick
jTessBoxEditor

第一步

找尋你要練習的驗證碼。現在蠻多網站都有這個機制，我是直接寫個爬蟲抓下來的。如果你要抓個幾千幾萬張的，記得設一下間隔時間，不然被 ban...

驗證碼示意圖

第二步

驗證碼二值化

#pythonimport cv2
img = cv2.imread("xxxx.jpg")
img2 = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
img2 = cv2.inRange(img2, lowerb=160, upperb=255)
cv2.imwrite("xxxx.jpg", img2)

二值化後的驗證碼示意圖

第三步

將二值化後的驗證碼轉成 tif 檔

convert *.jpg -density 300 -depth 8 -background white -alpha off tif/img-%d.tif

用 jTessBoxEditor 合併剛剛產生的 tif 檔

Tools > Merge Tiff
我是以 100 張為一組。
檔名則是 [language].[font].exp[num].tif
如: 練習 ABC 語言，字型為 Arial
ABC.Arial.exp0.tif

第四步

產生 box 檔 > 比對 box 檔

tesseract ABC.Arial.exp0.tif ABC.Arial.exp0 --psm 6 lstmbox

Box Editor > Open
一張一張比對，將 Char 欄位修正。
最耗時間的步驟，如果是要練習個上百上千張的，幾乎一整天都在處理這個。

第五步

產生 lstmf 檔 > 產生 lang.training_files.txt 檔

tesseract ABC.Arial.exp0.tif ABC.Arial.exp0 -l eng --psm 6 lstm.trainecho 剛剛產出的 lstmf 絕對路徑 > ABC.training_files.txt

第六步

從原有的 traineddata 擷取 lstm 檔

combine_tessdata -e /home/eng.traineddata eng.lstm

第七步

開始訓練

lstmtraining 
--model_output /path/output --continue_from eng.lstm  --train_listfile ABC.training_files.txt --traineddata /home/eng.traineddata --max_iterations 4000

第八步

產生 traineddata 檔

lstmtraining --stop_training --continue_from output_checkpoint --traineddata /usr/local/share/tessdata/eng.traineddata --model_output ABC.traineddata

經過重重步驟，練習了 300 多張圖片，想說起碼會有個 40% 的準確率。

所以我又額外爬了 500 張圖片下來比對，準確率 8%。

tesseract-ocr 筆記

tesseract-ocr/tessdoc

Have questions about the training process? If you had some problems during the training process and you need help, use…

第一步

第二步

第三步

第四步

第五步

第六步

第七步

第八步

所以我又額外爬了 500 張圖片下來比對，準確率 8%。

Written by Kuan

No responses yet