tessFileParser (SAFS API DOCUMENT)

java.lang.Object
- org.safs.tools.ocr.tesseract.tessFileParser

```
public class tessFileParser
extends java.lang.Object
```
Parses two types of Tesseract output data.
A UTF-8 file output by calling SafsTessdll.exe and, alternatively, a UTF-8 file output by Tesseract.exe.
The DLL format file is the result of running TOCR to recognize text in an image. As a text formated file, it contains not only every recognized letter but also its coordinates in the image.
Command: SafsTessdll.exe imagefile.gif resultUTF-8.txt eng Note: SafsTessdll.exe requires tessdll.dll. Both two files stay in C:\safs\bin.
Format of the UTF-8 from the DLL:
```
 ------------------------------------------------------------------------- 
 |A[41](132,32)->(149,13)
 |p[70](147,37)->(163,18)
 |p[70](162,37)->(178,18)    ---- Char[Unicode](left, bottom)->(right, top)
 |                           ---- \\ space 
 |<nl>                       ---- \\ new line
 |<para>                     ---- \\ paragraph
 -------------------------------------------------------------------------
 
```
The Tesseract.EXE format file is the result of running TOCR to recognize text in an image. As a text formated file, it contains not only every recognized letter but also its coordinates in the image.
Command: tesseract.exe scaledImageFile.tif resultUTF8 -l eng nobatch|batch.nochop makebox
Format of the UTF-8 from the EXE:
```
 ------------------------------------------------------------------------- 
 |A 132 32 149 13
 |p 147 37 163 18
 |p 162 37 178 18     char left bottom right top
 -------------------------------------------------------------------------
 
```
Another caveat of the EXE output format is that coordinates are calculated assuming 0,0 is the bottom-left of image--NOT the top-left. Thus, when converting to screen coordinates the rect.y value must be recalculated to be:
In addition, versions of Tesseract before r344 (~May 19, 2010) had a bug in which the bounds were always calculated about 11 or 12 pixels off in the y coordinate.
While it sounds more complicated to use the tesseract.exe version, the accuracy of text recognition from this version is much higher than that of the DLL version.
Author:

JunwuMa
MAR 12, 2010 Original Release
MAR 26, 2010 (JunwuMa) Refactoring and update to support UTF-8 format.
OCT 21, 2010 (Carl Nagle) Refactoring allowing tesseract.exe nobatch makebox coordinates.

Constructor Summary

Constructors
Constructor and Description

tessFileParser(java.lang.String tessfile)

Constructors
Constructor and Description
`tessFileParser(java.lang.String tessfile)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`(package private) java.lang.String`	`getText()`
`java.awt.Rectangle`	`getTextArea(java.lang.String searchText, int index)` Get the area of the Nth instance of searchText that is sought in parsing result.
`static void`	`main(java.lang.String[] args)` Can be used to unit test.
`(package private) void`	`setTessFile(java.lang.String tessfile)` tessfile must be a UTF formated text file output by SafsTessdll.exe.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - tessFileParser
```
tessFileParser(java.lang.String tessfile)
```
- Method Detail
  - setTessFile
```
void setTessFile(java.lang.String tessfile)
```
    tessfile must be a UTF formated text file output by SafsTessdll.exe.
    
    Parameters:
    
    tessfile -
  - getText
```
java.lang.String getText()
```
  - getTextArea
```
public java.awt.Rectangle getTextArea(java.lang.String searchText,
                                      int index)
```
    Get the area of the Nth instance of searchText that is sought in parsing result. It represents the area of searchText in image.
    There are two modes in which the text coordinate information can be extracted. The TessDLL mode and the Tesseract.exe mode. This routine will detect which mode was used and will decipher the coordinate information accordingly.
    The TessDLL mode returns a normal Rectangle using the standard coordinate system of 0,0 indicating the top-left corner of the search area and all coordinates are relative to that top-left corner.
    The Tesseract.exe mode returns a ReverseRectangle because the coordinate system puts 0,0 at the bottom-left corner of the search area and all coordinates are relative to that bottom-left corner.
    
    Parameters:
    
    searchText, - string for which to search in tessFileParser, any leading and trailing whitespace will be removed before seeking.
    
    index, - starts from 1, specifies to find the Nth instance of searchText. Uses 1 if index<=0. The index of the text is the same regardless of the coordinate search mode that was used.
    
    Returns:
    
    Rectangle, ReverseRectangle, or null. The area of searchText found; null if not found. If a ReverseRectangle is returned it is expected the user will transform\convert the Y coordinates according to the height of the image\searchArea that was used. For example, if the area was 800 pixels high and the match was found at the "top" of the area at 1,1 then the ReverseRectangle would contain coordinates 1,799. To get the true 1,1 the user would have to recalculate Y with 800-799 to get at 1,1.
  - main
```
public static void main(java.lang.String[] args)
```
    Can be used to unit test.
    java org.safs.tools.ocr.tesseract.tessFileParser [-f parseFile] [findtext]
    parseFile - path to a test ~tempcoor.txt file to process.
    findtext - a substring of text to find coordinates for.
    
    Parameters:
    
    args - -- 3 array items: [-f parseFile] [findtext] parseFile - path to a test ~tempcoor.txt file to process.
    findtext - a substring of text to find coordinates for.

Class tessFileParser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

tessFileParser

Method Detail

setTessFile

getText

getTextArea

main