Closed
Description
in a command line you would use "tesseract.exe pic1.bmp pic1.txt -psm 4" and put a pic1.uzn file in the current directory.
When I try
Tesseract.TesseractEngine tesseract = new Tesseract.TesseractEngine("....path... tessdata", "eng", Tesseract.EngineMode.Default);
Tesseract.Pix picture = Tesseract.Pix.LoadFromFile(@"...path... pic1.bmp");
Tesseract.Page page = tesseract.Process(picture, Tesseract.PageSegMode.SingleColumn); //PSM -4
...
string text = page.GetText();
will lead to an exception on GetText (same as tesseract.exe would fail if there is no uzn file)
Therefore I assume that the .net wrapper does not find (or search for) the uzn file.
Could you please tell me what to do or if this is a bug?
Metadata
Metadata
Assignees
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
charlesw commentedon Jan 24, 2014
Hi michi729,
I wasn't even aware of uzn files before now which is probably why it doesn't work. Anyway I've done a little reading and it seems like tesseract needs to know the input name for file (which make sense since this is how it finds the uzn file). Do you think it makes sense to add this to the Page class, in which case you could do:
Alternatively I could overload the Process method so it takes the input name as an optional parameter. Do you have any preferences? Also if you could kindly provide an example image and corresponding uzn file with a brief description of what you expect the output should be so I can write up a test case to verify the implementation. Note this should of course not contain any confidential information or be copyrighted.
michi729 commentedon Jan 24, 2014
Hi Charles, thanks for the quick response!
I will get back to you with an example picture as well as uzn file in time.
All the best, Michael
michi729 commentedon Jan 24, 2014
Calling "tesseract.exe test.png test -psm 4"
with tesseract, test.png and test.uzn in the same directory will result in a test.txt with the content
This is another test
Content of test.uzn:
100 130 200 30 Text
charlesw commentedon Jan 24, 2014
Thanks just what I needed.
michi729 commentedon Jan 27, 2014
Hi Charles, I am not sure, if this should be added as parameter. Tesseract itself just replaces the suffix of the current picure's name. I.e. you could get the picture name from parsing LoadFromFile. What do you think?
charlesw commentedon Jan 27, 2014
In theory yes, however this would only work if the image was loaded from file. Tesseract actually doesn't work this way and according to my analysis of the source relies on the image name being passed in as an additional parameter to it's ProcessPage routine. Its a pretty simple fix really so should have it done tomorrow sometime, assuming no unforeseen issues arise.
michi729 commentedon Jan 27, 2014
You are right :-) And thanks for taking the time!
Added support for uzn files - issue #66
charlesw commentedon Jan 27, 2014
Just released an updated nuget package (1.10) that supports uzn files though an optional parameter on Process as previously discussed. Please note that using a PSM of SingleColumn (4) does NOT work due to a bug in Tesseract 3.02 (https://code.google.com/p/tesseract-ocr/issues/detail?id=653) however other options do. This issue will be resolved once tesseract 3.03 has been released.
michi729 commentedon Jan 28, 2014
Hi Charles, thank you very much for your fast support :-)