27 June 2011

Document Retrieval System (More than a New Toy)

An essay for COMM 297R, 27 June 2011


Some Background Information
Five years ago, just after opening the doors to the public, our office operated just like most offices (especially for the time). As part of our provisioning, we had purchased an array of filing cabinets to suit our short term and long term document storage and retrieval needs. Within a few months of opening the store I, and others in the office staff, became weary of having to dig through those filing cabinets to find various pieces of paper that employees, vendors and customers requested. I also noted that we had purchased an All-In-One printer, fax, copier and scanner. We used all of the functions except scanning, we could not think of a use for it. Late one December evening, a use for the scanner came to mind: we could scan in all of these papers we are filing and then digitally retrieve them. This way they would never got lost, only rarely get misfiled, would be easier to store (digital files take up digital space, not filing cabinets full).

The EDNA (Electronic Document Network Access) system was launched 15 December 2005 featuring all of the documents we had filed from our store opening to date. After proving the value of the system, the company put into place a short term document storage program (in case we realized in a couple of months that we had forgotten something) and then changed our document processing procedures to include EDNA. Later, financing documents were added into the mix under the name of FINSTEEN.

Both EDNA and FINSTEEN served the company for nearly six years until there were simply too many documents (nearly 200,000 of them). Then the old search engine could not keep up and the system was put into an “ICU” until it could be revived. Though, we the completion of this project, the old EDNA Search engine was officially retired and the new DRS Search engine was put in its place.

The Project
After nearly six years of accessing documents electronically, the choice of going back to paper was not an option. The only reasonable solution was to load all the scanned file names into a database that would be capable of handling the nearly 200,000 files we currently had and tens of thousands more in the future. This would give us, in essence, what we had but in a faster, more stable platform.

In addition to indexing the document names, I also wanted to use Optical Character Recognition (OCR) technology to enable full text search on those documents. OCR is a slowly developing, and unheralded technology that enables computers to convert scanned documents into editable text. Though OCR has limited applications, I felt that in this particular instance our clearly printed documents and standardized formats would be good candidates for the technology. Once the OCR was indexed, we would be able to more easily find lost and mislabeled documents.

Going Shopping
Knowing that it is often less expensive (and frequently better quality) to find an already developed project that we can purchase rather than develop in house, I started searching for a suitable, existing search engine. The search did not go very well. All but one of the engines I found—that would be affordable—could index the document names, but none of them would index the contents. Not indexing the documents’ contents would nullify any benefit offered by the OCR process. Though using OCR is not a critical issue, it is one that I feel will strongly benefit us in future ways that we, as yet, do not know.

The one I did find that would expose the OCR data in the search required a specialized platform and additional programming. In order to maintain simplicity, deploying a specialized platform was a detractor. I have a strong preference to fit any new search engine into our existing systems (systems that are well suited to the task), as opposed to building a specialized platform.

After a week of searching and testing a variety of search engines, I concluded that we would need to custom build our search engine in order to get the features and flexibility we wanted.

Solving the OCR Issue
The first break through in the project was finding Google’s Tesseract software. Tesseract is a free, open source, software that reads TIF files and outputs a text file. While other software could do the same thing, Tesseract was the only software (that did not cost thousands of dollars) that could do the file reading automatically from a script. Every other piece of software required user interaction, a level of overhead that would negate the benefit of the new process.

The ability to run the OCR from a script sacrificed a degree of reliability (though it is still very accurate) but allowed us to benefit from the OCR data without adding any extra labor costs. It also allowed us to keep the entire process after the manual file naming automated.

DRS Search
I found programming the actual search engine to be a fairly easy and straight forward process. I found several online tutorials that guided through the process and even taught me some interesting programming techniques to close obvious security loop holes and prevent simple data breaches. I was grateful for these tutorials because I am not versed enough in PHP to know what security vulnerabilities I have.

The entire search engine is built into a single, elegant web page that can be accessed directly or, more commonly, accessed through our company intranet site. The previous EDNA Search contained four separate files, brought together at runtime to appear as a single webpage to the end user (though search results were returned in a separate page). The new single file format made the new search engine a lot easier to code and, more importantly, to troubleshoot. For the end user, the single page also made using basic web browser functionality such as the Back and Forward buttons work as expected. This has reduced the frustration of clicking Back and losing all the previous search results.

Learning SQL
As I was developing the DRS Search page, I started by mimicking the basic functionality of the EDNA Search page. One of the key features of the old search page was selecting the type of files the user wanted to search: sales or financing. Late one night, as I was ponder the user interface—more specifically, how to cram all the options (sales/financing, filename/OCR data and filename) into the search page while keeping it clean and respectable— it struck me: sales documents are always noted by their sale order number (e.g.” 10542230_so20110623.tif”) and financing documents are always noted by the customer name (e.g. “Smith, John wf20110623.tif”). This simple difference meant that whenever a user entered a numerical search they wanted a sales document and when they entered an alphabetical search they wanted a financing document. I talked with Isabel who confirmed by understanding and I set about programming the search engine to search both the sales and the financing documents at the same time.

Initially when I set up the search database I had planned to keep the indexes of the financing documents separate from the sales documents. Though not for a particular reason, I simply thought that there might be a reason later on and that it would be much easier to combine the two database tables at a later date than to try to separate them. This caused me great frustration in trying to get the search engine to search both tables at the same time, with combined results, sorted alphabetically.

At first I thought I would just “fake” it by search the sales and then the financing. While this was easiest to do, it double the lines of code for the search and created two separate, alphabetically sorted lists that would be confusing if there happened to be search results in both tables. (The previous system would run separate search and group the results by area, so sales documents were all grouped together and financing were all grouped together under respective headings.)

Instead of giving up and taking the easy way, I decided to actually learn some SQL (the database language). The little SQL I already knew was very limited and centered around data extraction; my knowledge of the language was so limited that I would only risk changing two parts of the query, hoping I could transfer the data in Excel. I learned SQL. Not enough to be proficient at it, but enough to now understand how the queries I use actually work.

In my learning of SQL, I learned two new words: LIKE and UNION. In this particular problem, UNION was the magic word. It took the two tables, temporarily combined them and allowed a search to be run against them and for the results to be returned into a single, alphabetically sorted result. No need to fake, I learned the real deal.

Prior to my delving into SQL, I only knew the WHERE IS statement. This statement is the silver medal in search: WHERE you find something that IS equal to this thing. In order to make the OCR data useful, I needed something more powerful. To WHERE IS, the vast text collect by the OCR process would be nearly impossible to match against, thus making it useless.

WHERE LIKE changes all of that. Instead of looking for an exact match, the search is looking for a similar match. To humans, navy blue IS blue but to the computer navy blue is LIKE blue. Changing this simple word opened up the entire range of OCR data to free form search.

Giving Up
While the DRS Search page was difficult for me to program, the actual indexer proved impossible. I spent more than a week solid trying to learn, copy and mimic techniques in hopes of crafting my own piece of software that could read the contents of the OCR files and add them into the database. After a week of trying I gave up.
Daniel: Adam, I have a problem.
Adam: What is it?
Daniel: I need some PHP code that can scan a directory and put the filename and contents of each text file into a database. I have tried for more than a week to do this. I just cannot get it.
Adam: Have you tried the GetContents command?
Daniel: I just want it done. Can you do it?
Adam: Yeah, I can have it to you by tomorrow.
Daniel: You are a lifesaver. Send me the bill when it is done.
Two hours later the finished code was in my inbox and the last, and most difficult, piece of the new search engine was in my possession. I felt like a dirty politician who just gave up winning the elections the fair way, but I also felt incredibly relieved: the great burden of the programming that very complex code was done. Adam’s code provided the perfect amount of flexibility so that I could easily fit into the project.

Circular Referencing
The final challenge in the project was how to automate the process of getting the sources files from the various offsite locations, running them through the OCR, indexing them in the database, and putting them in their final resting place.

I started with existing functional infrastructure used by the EDNA system, the EDNA Trickle Transfer (ENDA TT). EDNA TT was perfect for moving the files from the remote locations to a central one, and I was able to add the code needed to automate the OCR process. This evidenced a major deficiency in the process: EDNA TT ran every hour of the hour. The time it took the OCR process to run varied from several minutes to over an hour depending on the number of files in the batch. This meant that the old timer mechanism would not work. I thought of running daily batches instead of hours batches knowing that that would guarantee sufficient time, but such would mean users would have to wait an entire work day before accessing newly added files, an unacceptable wait.

A solution came one day as my roommate was describing the how the human body can detect an imbalance and releases a hormone to trigger a cascade of hormones to bring the body back into balance. Then it struck me, I could have the OCR process trigger the indexing process once it was done, once the indexer completed it could then wait for a period of time (I chose two hours) before triggering the OCR process and completing the loop. This circular process has proven to be a perfect and reliable system for giving each process enough time to complete with overlapping.

24 June 2011

The Storyteller

This is an autobiographical essay in response to the ever so frequent question: what do you do at work?

The warehouse, plain and bland, stands just off the road like a giant fortress. Like all good fortresses, it is not the outside that attracts visitors. Rather, it is the treasure that lies inside it. Outside, the warehouse looks drab with its towering light brown walls and black trimmings, the office door doesn’t help much: “Employees Only” the small placard warns those who would open the foreboding black door.

On the other side of the door things change little. The interior walls are a lighter shade of brown, almost taupe, and while there is no decorative trim, the walls are mostly empty. In the office, the workers are chatting; sometimes on the phone, sometimes to each other. Such is the humble work place of Daniel. Although he often works from any of the company’s other four locations, this is his chosen sanctuary to work his wonders.

Although the outside is bleak, the inside office is obviously better suited to his creative thinking. The chatter of the customer service representatives provide a gentle background noise and people to occasionally socialize with and bounce ideas off of. But it is the massive twenty foot white board that makes the office ideal. “This,” Daniel says referring to the white board, “this is the only white board we have that is big enough to unload my brain onto.” Currently the white board is covered with scribbles, notes and a mass of lines and boxes. He uses the white board to stage his “stories”.

Isabel, the Customer Service Manager, refers to him as an “Information God”, a title that makes Daniel laugh—he prefers the title of “Storyteller”. Isabel tells of when he first set up the company’s Customer Service surveys. The Customer Service Office had been hoping, at best, for a spreadsheet that tallied the results and were worried mostly about the ease of gathering the data. Instead of the basic spreadsheet they requested, Daniel delivered what he calls “a beautifully matriculated masterpiece of storytelling.” The survey system provides a friendly entry system for collecting the survey information and a full set of graphs and charts to explain the results, none of which require any technical expertise. “Like any good story, the mechanics are there, but they’re hidden,” Daniel explains.

This is how most of his “stories” work: they feature quick and easy access to the data through charts, graphs, buttons to automatically retrieve up-to-date data , and “smart, dumb” text—complex formulas that output different texts based on the data in the spreadsheet so the responses look smart, but really aren’t.

The ease with which people can use his spreadsheets has made them popular in the company, but he says that people should thank his boss for that, not him. “I like a good graph or chart, but the real data is where the best stories are found,” he says, “it was my boss who insisted that I make the stories in the data easier to see.”

He sits hunched over his meticulously clean desk. The wood surface has on it a grand total of five objects: his phone and keys, lying side by side on the left side on the desk, a laptop, a second monitor and a wireless mouse. While his physical desk is nearly empty, his two screens are not. The laptop is cluttered with his email and various informational pages and the second monitor, hooked up as an extension of the laptop, is filled with a massive spreadsheet. It is this very spreadsheet that he often thinks of when people ask him if he knows Microsoft Excel. “Know it,” he says, “I live in it.”

While he might joke about his knowledge of Excel, compared to most people, he does live in it. And like a monk left to himself to delve into the depths of sacred works, he knows Excel extremely well. “I often laugh when people ask if Excel can do certain things,” he laughs at this thought. “I usually tell them, ‘just tell me what you want and I’ll make Excel do it for you’.”

This particular day he is working on one of his most complex “stories”: the company payroll. The numerous windows displayed across his screens are all critical story elements. They are part of a massive revision that he recently released. He explains that each screen has a function and purpose, and while they all together may seem overwhelming, no one else ever sees them all together.

While some might think it an incredible feat, to him it is little more than a documentary. “No one does all the work,” he explains. Some poor soul digs to find random statistics that will be quoted in the voice overs. Another poor soul does the preliminary location and people research. The lucky host goes out and shoots footage with the camera crew. Then yet another host of people come in to cut the footage together and scale the production to the correct level.

Payroll, for him, is no different. The new version—called Blackfin after ocean tuna—has simple, little spreadsheets and databases that different departments enter in little bits of information. At payroll time a single, bright green button, labeled “Extract All”, is pressed and through Excel’s magic all the little spreadsheets and databases are rounded up and processed to the familiar, but ever changing, story of payroll.

Later, after he is done patching the payroll file, he walks to the back of the warehouse and rummages through a mess of old parts. The parts were recently purchased as a lump-sum from a business that was closing. He’s not looking for anything in particular, but more just wants to get away. His excursions into the warehouse are usually fruitless themselves, but they allow Daniel to refocus his mind. “I never know what I’ll find back here,” he says. Moments later he coos, “Ooh, these belong in the IT room,” he says with an elevated pitch as he wraps a bundle of network cables around his neck. Obviously, he’s done this before.

Satisfied with today’s find he heads back towards the office. He doesn’t get far before he stops again, this time to pick up a rolling office chair. “The guys keep stealing chairs from the office for their lunchroom,” he explains. Instead of pushing the chair to the office, he sits in it, grabs the arm rests, carefully aims the chair and then, with cables still dangling from his neck, gives a swift kick and sends himself hurtling down the aisle between dining chairs and bedroom mirrors. The noise of the rolling chair can be heard throughout the whole warehouse, just one example of his creative eccentric nature.

Back in the office, after dropping off the cables and the chair, Daniel examines his white board. He crosses some items off the board, and then taps his marker against the board. The next item on his list is to put up a reminder about upcoming network changes. He sits back down at his desk and brings up the company intranet site; he considers the site to be one of his greatest stories. “Everyone in the company uses it every day,” he says with pride in his voice. He starts to tell the story of making the site such a success, and then pauses as he looks around the office. “I’ve never told them the secret of my success.” After a few seconds of typing, he reviews the reminder and with a satisfactory nod posts it online.

Those secrets are some of his rarest stories. In fact, he boasts, no one person has heard them all, and he intends to keep it that way. “If anyone knew that all I do every day is tell stories,” Daniel says with a smile, “well, they might find a way to live without me.” With that, he goes back to work: reviewing data, looking for connections and recording the stories he finds.

08 June 2011

Covenants, Contracts and Promises

This was a religion assignment that I did not want to do, but was finally convinced to do it.

As members of this marvelous church, we often are told, and comment, that we are “covenant making people”. I have been told this so much in my young life, that I first what a covenant was and then, if I made so many of them, where did I put them all. Only later in my life did I begin to understand what a covenant was and why they were so important to me.

I have often heard covenants described as a two-way promise or a divine contract. I would suggest a covenant is at once the same as and nothing like these concepts. While each of these concepts are helpful they are also limited based on our individual mortal experience. To any individual who was promised something they never received or who entered into a contract that was later broken—or to those who have been the one to break a promise or cancel a contract—the mortal experience pales in comparison to the divine experience of covenants.

Divine covenants do resemble promises and contracts made here on earth in that they are binding agreements entered into by two parties: God, the Father, and an individual child of His. Within these agreements, God promises us certain blessings and rewards based on certain conditions. These conditions are clearly presented, as are the blessings, so that the covenants can be made in full force with no plausible deniability.

Divine covenants are unlike promises and contracts made here on earth in that they permanently binding based upon our performance; they leave no room “wiggle room” for either party, no subject of interpretation. Either the child did or did not perform the required task, in which case God will give the promised blessing. In this respect, too, divine covenants differ from earthly ones: God fulfills His end of the promise with unfailing exactness.

There are five basic covenants we enter into with the Lord:

1. Baptism
2. Confirmation
3. Priesthood (for the brothers)
4. Endowment
5. Sealing

I wish to point out that, as with all covenants, the Sealing covenant is between an individual and God, not between two individuals.

Accompanied with each covenant is a specific ordinance. Just as any earthly contract is not considered valid until some agreed upon ritual or rite is execute, so to with divine covenants. Each ordinance contains a set of specific words and physical motions that are used to signify their performance by one in authority to act in God’s stead and the willingness and understand of the child to enter into and maintain the covenant. Once completed, these ordinances “activate” the covenant and its force in our lives.

The promised blessings are always based upon our faithfulness and worthiness. There is a separate, inclusive, worthiness standard included with each covenant. In this way, God is available to prevent those who partake in an ordinance for the wrong reasons from benefitting from the process. Personal worthiness is an important part of maintaining covenants. It is not enough to simply complete the ordinance; we must strive to maintain the purity required by God in our day-to-day lives.

Worthiness disappears as we partake in activities or lifestyles that are contrary to those outlined within the covenants we have made. For example, with the covenant of baptism we promise to care for the needy, remain chaste and to pay tithing. Failure to maintain any one of these will make us unworthy until we repent and begin doing them again. Unworthy activities do not just include failure to perform gospel tasks, but also includes participation in unworthy endeavors. These can include the places we choose to visit, the jokes we choose to repeat, the company we choose to keep, the physical actions we choose to do and even the private thoughts we choose to entertain.

It is important, if we expect to retain the blessings promised within a covenant, that we maintain our worthiness. When unworthy activities do occur in our lives, it is important to repent of them as quickly as possible in order to restore ourselves to a worthy state and reinstate our lost blessings.

Worthiness is coupled with justification, a process used to gauge the attempts of actions to remain worthy based on the covenants made, our knowledge and understanding of the gospel, and the intends of our heart. It is through justification that the mercy portion of the gospel is put into action.

A continued improvement of our covenant keeping abilities allows us to access another portion of the atonement: sanctification. Sanctification is the purifying power of the atonement working to make us better. Where justification is concerned about maximizing our blessing now by ensuring we get as many as we deserve, sanctification is concerned about maximizing our blessing is the future by ensuring that we continue to progress to our full potential.

In most cases, justification and sanctification work together to propel beyond the lowly constraints of our mortal selves by allowing us to obtain ever greater heights of spirituality and perfection. They can, however, leads to great disappointment if misunderstood. For example, to believe that upon completion of the sealing ordinance one’s exaltation has is assured could lead to a rude shock when, upon death, you are informed that your deviant, post sealing and unrepented behavior has disqualified your salvation.

While justification can often enable us to receive blessings beyond what we thought we could qualify for, my general rule is that whenever I think I am justified I have just lost the last excuse to receive the justification. Remember, the gospel plan is about becoming like God, not trying to see how much you get away with.

As we learn to better understand the role of covenants in our lives, the symbolism embedded within the ordinance we use to make those covenants, the grace we receive through justification and the purifying power of sanctification in our lives, we will be better positioned to take advantage of each and use those to progress every quicker towards our final, celestial goal.

03 June 2011

The Bliss of Surrendor

This past week concluded the end (or nearly the end) of a project that I have been struggling to complete for three weeks. The project was forced upon be by the failure of a five year old search engine (using the term loosely). Had it not been for that failure I would never have touched the project. More pressing, it was that the old program represented one of IT's greatest accomplishments and failings: the replacement of a manual process with a technological one. This meant that the failure of the old search engine was interfering with normal business operations across the company making it's replacement a pressing issue.

For two weeks I struggled with finding, testing and trying to build a replacement without any success. In the first week I found many good solutions that would replace the old search with the same features we currently had, but not allow for any new features or innovations. In the second week of searching, I tried to cobble together various solutions into a custom built solution that would give us the features we had and add some for the future. By the middle of the week I realized that I would need to build the new engine from scratch. By the end of the week I was banging my head into the wall out of frustration. The coding was too hard. On Tuesday I decided to give up and call a friend. A couple of hours and a small fee later, I had the missing cog in the new engine.

More surprising to me than the engine working—things work all the time—was the overwhelming relief that my surrender brought. There was no more useless struggle, no more needing to take long walks after only a short period of working, no more wondering why my code did not work. Instead, with the missing piece in place, the rest of the project just coasted to completion.

This whole experience got me thinking about life, God and C.S. Lewis (in that order). I find that often in life there are things that I cannot do but I know that God can. I guess it is these times, when I am pressed against my limits (not the imaginary ones that I could push through if I really wanted to, but the actual physical limitations of this corporeal form) that I need most to surrender and, as C. S. Lewis infers in his writings, that it is when I surrender to God that He can then take over and make everything flow.

I do not expect that I will remember this lesson for long, nor do I hope to—how will I know my limits if I give up before finding them. Instead, I hope that the next time I surrender after exhausting myself that it will be as blissful as this time.