Some Background Information
Five years ago, just after opening the doors to the public, our office operated just like most offices (especially for the time). As part of our provisioning, we had purchased an array of filing cabinets to suit our short term and long term document storage and retrieval needs. Within a few months of opening the store I, and others in the office staff, became weary of having to dig through those filing cabinets to find various pieces of paper that employees, vendors and customers requested. I also noted that we had purchased an All-In-One printer, fax, copier and scanner. We used all of the functions except scanning, we could not think of a use for it. Late one December evening, a use for the scanner came to mind: we could scan in all of these papers we are filing and then digitally retrieve them. This way they would never got lost, only rarely get misfiled, would be easier to store (digital files take up digital space, not filing cabinets full).
The EDNA (Electronic Document Network Access) system was launched 15 December 2005 featuring all of the documents we had filed from our store opening to date. After proving the value of the system, the company put into place a short term document storage program (in case we realized in a couple of months that we had forgotten something) and then changed our document processing procedures to include EDNA. Later, financing documents were added into the mix under the name of FINSTEEN.
Both EDNA and FINSTEEN served the company for nearly six years until there were simply too many documents (nearly 200,000 of them). Then the old search engine could not keep up and the system was put into an “ICU” until it could be revived. Though, we the completion of this project, the old EDNA Search engine was officially retired and the new DRS Search engine was put in its place.
After nearly six years of accessing documents electronically, the choice of going back to paper was not an option. The only reasonable solution was to load all the scanned file names into a database that would be capable of handling the nearly 200,000 files we currently had and tens of thousands more in the future. This would give us, in essence, what we had but in a faster, more stable platform.
In addition to indexing the document names, I also wanted to use Optical Character Recognition (OCR) technology to enable full text search on those documents. OCR is a slowly developing, and unheralded technology that enables computers to convert scanned documents into editable text. Though OCR has limited applications, I felt that in this particular instance our clearly printed documents and standardized formats would be good candidates for the technology. Once the OCR was indexed, we would be able to more easily find lost and mislabeled documents.
Knowing that it is often less expensive (and frequently better quality) to find an already developed project that we can purchase rather than develop in house, I started searching for a suitable, existing search engine. The search did not go very well. All but one of the engines I found—that would be affordable—could index the document names, but none of them would index the contents. Not indexing the documents’ contents would nullify any benefit offered by the OCR process. Though using OCR is not a critical issue, it is one that I feel will strongly benefit us in future ways that we, as yet, do not know.
The one I did find that would expose the OCR data in the search required a specialized platform and additional programming. In order to maintain simplicity, deploying a specialized platform was a detractor. I have a strong preference to fit any new search engine into our existing systems (systems that are well suited to the task), as opposed to building a specialized platform.
After a week of searching and testing a variety of search engines, I concluded that we would need to custom build our search engine in order to get the features and flexibility we wanted.
Solving the OCR Issue
The first break through in the project was finding Google’s Tesseract software. Tesseract is a free, open source, software that reads TIF files and outputs a text file. While other software could do the same thing, Tesseract was the only software (that did not cost thousands of dollars) that could do the file reading automatically from a script. Every other piece of software required user interaction, a level of overhead that would negate the benefit of the new process.
The ability to run the OCR from a script sacrificed a degree of reliability (though it is still very accurate) but allowed us to benefit from the OCR data without adding any extra labor costs. It also allowed us to keep the entire process after the manual file naming automated.
I found programming the actual search engine to be a fairly easy and straight forward process. I found several online tutorials that guided through the process and even taught me some interesting programming techniques to close obvious security loop holes and prevent simple data breaches. I was grateful for these tutorials because I am not versed enough in PHP to know what security vulnerabilities I have.
The entire search engine is built into a single, elegant web page that can be accessed directly or, more commonly, accessed through our company intranet site. The previous EDNA Search contained four separate files, brought together at runtime to appear as a single webpage to the end user (though search results were returned in a separate page). The new single file format made the new search engine a lot easier to code and, more importantly, to troubleshoot. For the end user, the single page also made using basic web browser functionality such as the Back and Forward buttons work as expected. This has reduced the frustration of clicking Back and losing all the previous search results.
As I was developing the DRS Search page, I started by mimicking the basic functionality of the EDNA Search page. One of the key features of the old search page was selecting the type of files the user wanted to search: sales or financing. Late one night, as I was ponder the user interface—more specifically, how to cram all the options (sales/financing, filename/OCR data and filename) into the search page while keeping it clean and respectable— it struck me: sales documents are always noted by their sale order number (e.g.” 10542230_so20110623.tif”) and financing documents are always noted by the customer name (e.g. “Smith, John wf20110623.tif”). This simple difference meant that whenever a user entered a numerical search they wanted a sales document and when they entered an alphabetical search they wanted a financing document. I talked with Isabel who confirmed by understanding and I set about programming the search engine to search both the sales and the financing documents at the same time.
Initially when I set up the search database I had planned to keep the indexes of the financing documents separate from the sales documents. Though not for a particular reason, I simply thought that there might be a reason later on and that it would be much easier to combine the two database tables at a later date than to try to separate them. This caused me great frustration in trying to get the search engine to search both tables at the same time, with combined results, sorted alphabetically.
At first I thought I would just “fake” it by search the sales and then the financing. While this was easiest to do, it double the lines of code for the search and created two separate, alphabetically sorted lists that would be confusing if there happened to be search results in both tables. (The previous system would run separate search and group the results by area, so sales documents were all grouped together and financing were all grouped together under respective headings.)
Instead of giving up and taking the easy way, I decided to actually learn some SQL (the database language). The little SQL I already knew was very limited and centered around data extraction; my knowledge of the language was so limited that I would only risk changing two parts of the query, hoping I could transfer the data in Excel. I learned SQL. Not enough to be proficient at it, but enough to now understand how the queries I use actually work.
In my learning of SQL, I learned two new words: LIKE and UNION. In this particular problem, UNION was the magic word. It took the two tables, temporarily combined them and allowed a search to be run against them and for the results to be returned into a single, alphabetically sorted result. No need to fake, I learned the real deal.
Prior to my delving into SQL, I only knew the WHERE IS statement. This statement is the silver medal in search: WHERE you find something that IS equal to this thing. In order to make the OCR data useful, I needed something more powerful. To WHERE IS, the vast text collect by the OCR process would be nearly impossible to match against, thus making it useless.
WHERE LIKE changes all of that. Instead of looking for an exact match, the search is looking for a similar match. To humans, navy blue IS blue but to the computer navy blue is LIKE blue. Changing this simple word opened up the entire range of OCR data to free form search.
While the DRS Search page was difficult for me to program, the actual indexer proved impossible. I spent more than a week solid trying to learn, copy and mimic techniques in hopes of crafting my own piece of software that could read the contents of the OCR files and add them into the database. After a week of trying I gave up.
Daniel: Adam, I have a problem.Two hours later the finished code was in my inbox and the last, and most difficult, piece of the new search engine was in my possession. I felt like a dirty politician who just gave up winning the elections the fair way, but I also felt incredibly relieved: the great burden of the programming that very complex code was done. Adam’s code provided the perfect amount of flexibility so that I could easily fit into the project.
Adam: What is it?
Daniel: I need some PHP code that can scan a directory and put the filename and contents of each text file into a database. I have tried for more than a week to do this. I just cannot get it.
Adam: Have you tried the GetContents command?
Daniel: I just want it done. Can you do it?
Adam: Yeah, I can have it to you by tomorrow.
Daniel: You are a lifesaver. Send me the bill when it is done.
The final challenge in the project was how to automate the process of getting the sources files from the various offsite locations, running them through the OCR, indexing them in the database, and putting them in their final resting place.
I started with existing functional infrastructure used by the EDNA system, the EDNA Trickle Transfer (ENDA TT). EDNA TT was perfect for moving the files from the remote locations to a central one, and I was able to add the code needed to automate the OCR process. This evidenced a major deficiency in the process: EDNA TT ran every hour of the hour. The time it took the OCR process to run varied from several minutes to over an hour depending on the number of files in the batch. This meant that the old timer mechanism would not work. I thought of running daily batches instead of hours batches knowing that that would guarantee sufficient time, but such would mean users would have to wait an entire work day before accessing newly added files, an unacceptable wait.
A solution came one day as my roommate was describing the how the human body can detect an imbalance and releases a hormone to trigger a cascade of hormones to bring the body back into balance. Then it struck me, I could have the OCR process trigger the indexing process once it was done, once the indexer completed it could then wait for a period of time (I chose two hours) before triggering the OCR process and completing the loop. This circular process has proven to be a perfect and reliable system for giving each process enough time to complete with overlapping.