public class tessFileParser
extends java.lang.Object
A UTF-8 file output by calling SafsTessdll.exe and, alternatively, a UTF-8 file output by Tesseract.exe.
The DLL format file is the result of running TOCR to recognize text in an image. As a text formated file, it contains not only every recognized letter but also its coordinates in the image.
Command: SafsTessdll.exe imagefile.gif resultUTF-8.txt eng Note: SafsTessdll.exe requires tessdll.dll. Both two files stay in C:\safs\bin.
Format of the UTF-8 from the DLL:
------------------------------------------------------------------------- |A[41](132,32)->(149,13) |p[70](147,37)->(163,18) |p[70](162,37)->(178,18) ---- Char[Unicode](left, bottom)->(right, top) | ---- \\ space |<nl> ---- \\ new line |<para> ---- \\ paragraph -------------------------------------------------------------------------
The Tesseract.EXE format file is the result of running TOCR to recognize text in an image. As a text formated file, it contains not only every recognized letter but also its coordinates in the image.
Command: tesseract.exe scaledImageFile.tif resultUTF8 -l eng nobatch|batch.nochop makebox
Format of the UTF-8 from the EXE:
------------------------------------------------------------------------- |A 132 32 149 13 |p 147 37 163 18 |p 162 37 178 18 char left bottom right top -------------------------------------------------------------------------
Another caveat of the EXE output format is that coordinates are calculated assuming 0,0 is the bottom-left of image--NOT the top-left. Thus, when converting to screen coordinates the rect.y value must be recalculated to be:
In addition, versions of Tesseract before r344 (~May 19, 2010) had a bug in which the bounds were always calculated about 11 or 12 pixels off in the y coordinate.
While it sounds more complicated to use the tesseract.exe version, the accuracy of text recognition from this version is much higher than that of the DLL version.
Constructor and Description |
---|
tessFileParser(java.lang.String tessfile) |
Modifier and Type | Method and Description |
---|---|
(package private) java.lang.String |
getText() |
java.awt.Rectangle |
getTextArea(java.lang.String searchText,
int index)
Get the area of the Nth instance of searchText that is sought in parsing result.
|
static void |
main(java.lang.String[] args)
Can be used to unit test.
|
(package private) void |
setTessFile(java.lang.String tessfile)
tessfile must be a UTF formated text file output by SafsTessdll.exe.
|
void setTessFile(java.lang.String tessfile)
tessfile
- java.lang.String getText()
public java.awt.Rectangle getTextArea(java.lang.String searchText, int index)
There are two modes in which the text coordinate information can be extracted. The TessDLL mode and the Tesseract.exe mode. This routine will detect which mode was used and will decipher the coordinate information accordingly.
The TessDLL mode returns a normal Rectangle using the standard coordinate system of 0,0 indicating the top-left corner of the search area and all coordinates are relative to that top-left corner.
The Tesseract.exe mode returns a ReverseRectangle because the coordinate system puts 0,0 at the bottom-left corner of the search area and all coordinates are relative to that bottom-left corner.
searchText,
- string for which to search in tessFileParser, any leading and trailing whitespace will
be removed before seeking.index,
- starts from 1, specifies to find the Nth instance of searchText. Uses 1 if index<=0.
The index of the text is the same regardless of the coordinate search mode that was used.public static void main(java.lang.String[] args)
java org.safs.tools.ocr.tesseract.tessFileParser [-f parseFile] [findtext]
parseFile - path to a test ~tempcoor.txt file to process.
findtext - a substring of text to find coordinates for.
args
- -- 3 array items: [-f parseFile] [findtext]
parseFile - path to a test ~tempcoor.txt file to process.Copyright © SAS Institute. All Rights Reserved.