One of the great things about working in the Open Source space is that you sometimes get to work with NGOs such as Liberty Asia. Established in 2011, Liberty Asia is made up of a group of dedicated professionals from different industries who feel strongly that a more effective, coordinated response to slavery is essential and that leveraging technology available to the corporate sector and providing it to the NGO sector will facilitate this response. As part of this Liberty Asia is providing a dedicated Collaboration platform to NGOs that fight against human trafficking in Asia. The main focus of this platform is to facilitate collaboration and information sharing during an investigation.
The challenge that needed to be overcome was to facilitate text searching through the large number of scanned PDF images that are entered into the system during an investigation. Alfresco provides full text search via Solr but only for content that has text as part of its format. For many scanned PDFs this was not the case. In order to overcome this an OCR engine was needed that could be easily integrated into Alfresco and also did not have a high price tag for the NGO.
Seed used the Tesseract OCR engine in conjunction with Alfresco transformation to provide a solution to this requirement. Tesseract is used by Google these days and has been heavily re-engineered by them for Google Drive so its very solid. The way full text works for the solr engine is that any content which may be transformed into plain text will be added to the solr index. Therefore we needed a custom transformation that would take an image PDF and transform it to plain text. Solr would then use the results of the transformation to add the word list to the solr index and whence the PDF image documents would be searchable. For the solution the following things needed to be taken into account:
- Regardless of the OCR engine used, when you ask for a wordlist from an OCR engine it is an expensive excercise in terms of infrastructure.
- From a mimetype/format perspective a scanned PDF is the same as a standard PDF and may already have been OCRd as part of the scanning process. For these types of PDFs the solution should not OCR the pdf but instead should get the embedded word list from the PDF document.
- Alfresco does not know the difference between a scanned PDF or a standard pdf, its all just a PDF mimetype to Alfresco.
- Tesseract can only OCR one page at a time of either Tiff or PNG.
With this in mind we developed the following Solution:
Solution overview
Transformation:
We created a custom PDF to Text transformation that will be called whenever a PDF document is added to the Repository. The transformation calls out to a shell script (Runtime Executable) to create the wordlist.
<bean id=”transformer.worker.pdfimg2ocrtxt” class=”org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker” >
<property name=”mimetypeService”>
<ref bean=”mimetypeService” />
</property>
<property name=”checkCommand”>
<bean class=”org.alfresco.util.exec.RuntimeExec”>
<property name=”commandsAndArguments”>
<map>
<entry key=”.*”>
<list>
<value>ls</value>
<value>/opt/ocr/ConvertPDFImage2Text.sh</value>
</list>
</entry>
</map>
</property>
</bean>
</property>
<property name=”transformCommand”>
<bean class=”org.alfresco.util.exec.RuntimeExec”>
<property name=”commandsAndArguments”>
<map>
<entry key=”.*”>
<list>
<value><BASH-SCRIPT-LOCATION>/ConvertPDFImage2Text.sh</value>
<value>${source}</value>
<value>${target}</value>
</list>
</entry>
</map>
</property>
<property name=”errorCodes”>
<value>1,2,3</value>
</property>
</bean>
</property>
</bean>
<bean id=”transformer.pdfimg2ocrtxt” class=”org.alfresco.repo.content.transform.ProxyContentTransformer” parent=”baseContentTransformer”>
<property name=”worker”>
Ensure our Transformation is called for PDFs
It should be noted that there is already a registered transformation between PDF and Text. Alfresco determines which transformation to run by first round robining the existing transformations and then determining which one is faster after a number of runs. However, since version 4.2 it is also possible to set the transformation priority for a transformer and that will be used to determine which transformation to run first. The standard transformation has a priority of 50 so we set the priority to 30 which ensures out custom transformation will be run. This is set in the alfresco-global.properties.
content.transformer.pdfimg2ocrtxt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.supported=true
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.maxSourceSizeKBytes.use.index=9999
We also have to set the maxSourceSizeKBytes so that only documents below a certain size would be transformed. This was to ensure that very large PDF documents did not provide performance issues if added to the system.
One thing to note here, we spent about a day trying to figure out why our transformation was not being called even though we set the priority and found that the values in alfresco-global.properties were not being picked up. We therefore had to set the priority using jmx in order to get this to work for the transformation.
Runtime Exectutable Bash Script
The Runtime Executable mentioned in the Transformation above (ConvertPDFImage2Text.sh) calls a bash script as we are running this in linux. The bash script is responsible for doing extracting the wordlist. The script first calls pdftotext to see if a wordlist can be extracted from the pdf. This will be the case for standard pdf files and also image pdf files that have been scanned and an ocr wordlist embedded by the scanner. In the case where we find a wordlist we return that wordlist to target file for the transformation.
TEMP_PDFTXT_FILE=$TMPDIR/$name/pdftext.txt
echo running command “pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE”
pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE
FILESIZE=$(stat -c%s “$TEMP_PDFTXT_FILE”)
echo “Size of $TEMP_PDFTXT_FILE = $FILESIZE bytes.” >> ${LOGFILE}
# if file exists and has a size bigger than 0 then set wordlist as result of transformation and exit.
if [ -s $TEMP_PDFTXT_FILE ]; then
echo Found wordlist from in $TEMP_PDFTXT_FILE >> ${LOGFILE}
cat $TEMP_PDFTXT_FILE >> ${TARGET}
rm -rf $TMPDIR/$name
exit 0;
fi
In cases where we cannot get a wordlist, we create one using tesseract. As mentioned above tesseract can only ocr one page at a time. We therefore use ghostcript to break the pdf down into multiple pages and for each of these pages we create a wordlist by calling tesseract ocr. All of the individual pages wordlists are added to one large wordlist wich is added to the target transformation file and therefore is indexed by solr. Once we are finished processing we clean up all of the temporary files and folders created by the shell script.
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f “$SOURCE”
# process each page
for f in $( ls *.jpg ); do
# extract text
/usr/bin/tesseract $f $TMPDIR/$name/${f%.*} -l eng
cat $TMPDIR/$name/${f%.*}.txt >> $TMPDIR/$name/res.txt
rm -f $TMPDIR/$name/${f%.*}.txt
rm -f $f
done
#combine all pages back to a ${TARGET}
cat $TMPDIR/$name/res.txt >> ${TARGET}
Note: Depending on the linux flavour and version you running you may need to export the necessary location of tesseract and its supporting libraries to ensure the script works when run from Alfresco. We included exports in our script directly to ensure this, ie LD_LIBRARY_PATH, PATH and LD_PRELOAD.
Conclusion
Using this approach it is possible to make scanned pdf images searchable in Alfresco using an existing open source OCR engine. For Liberty Asia, who provide an information hub to hundreds of disparate NGO investigators with varied scanning devices, the solution will ensure that scanned image PDFs are searchable and can be used in investigations. Hopefully the solution is now one step closer to helping prevent human trafficking.
Acknowledgments: Thanks to Daniel Figucio of Alfresco for pointing us in the right direction and helping troubleshoot issues when developing this solution. Also thanks to Verizon for helping us work through the tesseract library dependencies in their cloud server instance where this runs.