ITRKoch

Friday, June 21, 2013

Android and Tesseract (Part 1)

Over the past few days I've been playing around with the Tesseract native packages that one can rope into a library for Android applications. This library allows one to conduct optical character recognition on Android mobile devices, which is a rather intriguing concept. The ability to do this has been around for some time (2006). You can read more general information on its history and where it comes from here: http://en.wikipedia.org/wiki/Tesseract_(software). The story is rather interesting as the software was originally written by Hewlett Packard in the late 80s early 90s area and sometime down the road ended up in the possession of Google and thereafter available for use in Android. So I figured I'd share a bit on my experience with it in two parts. The first of these will be a brief overview on the setup and the next part will be a bit with some sample code I managed to get together for its use.

The set up is a fairly easy task, though it does require a little bit of critical thinking as there are some problems that can be hard to work through even with community resources. Before you start a project of your own, make sure your IDE (in my case Eclipse) has the ability to compile Java and C++. If you need to add this feature on Eclipse you can find it in their Indigo repository by adding it through the Help > Install New Software dialog.

Once you have those things you'll need to go out and download the Tesseract library project files which you can find here: https://github.com/rmtheis/tess-two. You can either clone the repository or simply download an archive copy, the choice is yours on that front. Upon download you then can simply import this project into your IDE environment. When you've finished doing so you'll want to make sure you have checked off in your project properties that it is indeed an Android library. In Eclipse it looks like the below screenshot.

After that you will need to make sure you have set up the Java NDK (http://developer.android.com/tools/sdk/ndk/index.html) with the TessTwo project . All you have to do here is unpack the archive somewhere accessible and define the path to it in your IDE. In my Eclipse set up the setting is here under the project properties:

Then you just need to run a build on it and let the IDE do its work. The build can take some time so if I were you I'd suggest finding something to do while the time passes. After the build completes you are ready to use it in a project. We will go into actually making use of this on the next post which I hope to have hammered out here this coming week.

Tuesday, May 21, 2013

Apologies and some news

First off I must apologize for neglecting this blog for a while. Over the past couple of months I have been going through a bit of a professional transition that has left me rather occupied and distracted. I will however make the effort to begin posting weekly once more and perhaps more often should time permit.

On to the second agenda item is a bit of news. As one may have found out from either linkedIn or a response to a blog comment I left a couple days ago, I am no longer working at Trustyd. It is a sad and unfortunate turn of events that lead to this point, but I have been away from that particular organization since sometime in April. If you see this and do have any questions for me about this, or wish for advice as someone who has a deep knowledge of the product, feel free to contact me via email (koch.ryan@gmail.com).

In any event I hope to return to the regular scheduled posting here within the next day or so. I will try to think of a riveting topic for all of you to enjoy!

Tuesday, February 12, 2013

Hard drives and Pacific disputes

Now reading the title of this you might wonder what Hard drives and territorial disputes in the Pacific ocean could possibly have to do with one another. As you may be privy to we experienced a hard drive shortage due to some natural disasters in 2011 and the reverberations of this can still be felt in prices today to some degree. One thing to note in all of this is that a fair share of the companies with production in Thailand and other places are owned by Japanese companies. With tensions rising between Japan and China over a set of disputed islands one might wonder if a potential conflict could exacerbate the shortage and drive prices up again.

The dispute is over small islands in the East Asian sea known as Diaoyu in China and the Senkaku in Japan. Some what recently there have been semi severe incidents in which a provocation was a real risk. An example of this recently is a Chinese vessel locking radar onto a Japanese warship. The Pacific is full of such disputes especially considering the nine dotted line map China released showing the territory they see as rightfully theirs.

But what does this have to do with hard drive supplies? Well an overlooked problem is one we faced previously as a bottle neck and that is the motor. Japan's Nidec firm which produces around 80% of the motors has some portion of its manufacturing based in China. Any conflict may reduce the production capacity of this operation and thus limit the number of hard drives available. A possible conflict could create other problems as computer components are manufactured all over East and South East Asia and a conflict between China and one of those parties may bring to bear significant barriers to trade for the duration. At the end of it the cost to trade will ultimately be paid by consumers who would have to pay premiums for technology goods whose supplies are strained.

The next question is how likely is all of this. Personally I believe China and Japan will find that it is not in their best interest to pursue a conflict, and that this is a rather unlikely scenario. It is not in China's best interest to become a belligerent power as it goes against their philosophy of a peaceful rise, which they have been using the assuage the concerns of regional powers. Japan would suffer in losing market access to China and the loss in manufacturing for companies with plants based there. Ultimately it doesn't look like it would be a positive for either power, however pride and territorial disputes can make nations act rather irrationally.

Thursday, January 17, 2013

Wing it and start coding

One of the better professional experiences I've had of late is attempting to learn how the Android environment works in regards to developing for it. I was tasked with a project involving the creation of an application which really had a very simple goal. But the task seemed horribly daunting, while I had taken object oriented programming, and have messed around with a few languages in the past I had not tried to develop anything for mobile. Honestly I have found the best thing to do is just jump in and start.

Seriously, just start planning the project

As with a lot of things the first step is the hardest. I spent a long time reading through random portions of the Android API documentation (http://developer.android.com/develop/index.html). Eventually though one has to actually start designing their project and then coding it. So one day I just started jotting down that the thing was supposed to do on a white board. After writing out each individual task the application needed to achieve, it was then easy to break it down into individual methods and classes. After that you now have a path or a check list of all the things you need to learn how to do in java using the Android APIs.

For example I needed to write an app to interpret XML data, store it and then display it to the user on demand, and complete the parse/download on a background thread. So breaking it down the tasks are:

- Download XML data and parse it
- Create storage space for parsed result
- Create some sort of UI to view results stored in a Database
- Start the download/parse on some sort of regular schedule

Those 4 tasks can then be broken up into individual methods and classes. For example in using a database to store information I needed to write a Database Handler to create it, define it's schema, and define all the I/O methods (more or less the CRUD stuff). One thing that is interesting to read is Oracle's beginning guide to java which also covers object oriented thinking as it's that sort of language (link: http://docs.oracle.com/javase/tutorial/java/concepts/).

Start using Google to find tutorials for everything

In my experience with this I found that more or less everything I was trying to do had been done by someone else in the past in some form, and was documented. It's actually really easy to search for and then figure out how to write classes and methods for a whole variety of tasks. For example I needed to figure out how to parse XML and found an amazing tutorial on that portion and combined with the lessons learned from a tutorial on Sqlite (embedded database).

Outside of doing stuff with Android I've also found that Code Academy is a pretty cool place to learn about coding. The interactive projects are actually rather good and certainly do an excellent job of teaching one the way a language works. It's perfect for beginners or someone trying to pick up a new language for kicks. Here's a link: www.codeacademy.com

I suppose while this article seems a bit aimless the point is to share with you that coding is fun and easy to pick up if you look in the right places. The internet is filled with pretty much everything you need from API docs, SDK docs, and tutorials. The best part is most of is completely free. So go ahead and wing it and start coding!

Wednesday, December 19, 2012

Linux server performance

In my daily tasks I deal with a lot of Linux servers, and from time to time decide to tweak them for performance reasons, depending on what task they are executing. A lot of the units I'm dealing with tend to be operating a postgres database and some sort of data store for a custom application that is being run (usually via tomcat). I've found that there are three easy things to play around with in order to get the most out of the system, especially if the resources on the box are fairly limited. Those things are the swappiness value, the I/O scheduler, and use of the renice command implemented with a script called via crontab.

Swappiness

The Swappiness value is what systems administrators and engineers use to instruct the linux kernel on how aggressive the system should be in storing pages of memory on disk, as opposed to in memory. Most default installations have this value at 60 which is supposed to represent a balanced number (the range is: 0-100). In my situation where I'm running a lot of database operations I've found that a higher value seems to help free up memory for use in postgres related processes, where otherwise idle system processes may have been holding on to that memory. This has been particularly effective in situations where I have application servers with just barely enough memory to get by.

You can adjust the swappiness value two ways. The first is more of a testing/temporary measure and can be done by using the following command (via the terminal):

sysctl -w vm.swappiness=(value you'd like to set it to)

You can also make this change by editing the following file: /proc/sys/vm/swappiness . One should exercise caution when editing this file though as it does require a bit of monitoring to make sure that you aren't breaking vital processes when changing memory allocation settings.

I/O Scheduler

CFQ (Completely Fair Queuing)
If my memory is still serving me well, on most Linux distributions this is the default setting. This scheduler serves as a sort of general use setting as it has decent performance on a large number of configurations and uses ranging from servers to desktops. This scheduler attempts to balance resources evenly for multiple I/O requests, and across multiple I/O devices. It's great for things like desktops or general purpose servers.

Deadline
This one is particularly interesting as it more or less takes 5 different queues and reorders tasks in order to maximize I/O performance and minimize latency. It attempts to get near real time results with this method. It also attempts to distribute resources in a manner that avoids having a process lose out entirely. This one seems to be great for things like database servers, assuming that the bottle neck in the particular case isn't CPU time.

Noop
This is a particularly lightweight scheduler and attempts to reduce CPU latency by reducing the amount of sorting occurring in the queue. It assumes that the device(s) you are using have a scheduler of their own that is optimizing the order of things.

Anticipatory
This scheduler uses a slight delay on I/O operations in order to sort them in a manner that is most efficient based on the physical location of the data on disk. This tends to work out well for slower disks, and older equipment. The delay can cause a higher level of latency as well.

In choosing your scheduler you have to consider exactly what the system is doing. In my case as I stated before I am administering application/database servers with a fair amount of load, so I've chosen the deadline scheduler. If you'd like to read into these with a bit more detail I'd check out this Redhat article (it's old but still has decent information: http://www.redhat.com/magazine/008jun05/features/schedulers/)

You can change your scheduler either on the fly by using:
echo <scheduler> > /sys/block/<disk>/queue/scheduler

Or in a more permanent manner (survives reboot) by editing the following file:
/boot/grub.conf
You'll need to add 'elevator=<scheduler> to the kernel line.

Using renice

Part of what my boxes do is serve up a web interface for users to interact with. When there are other tasks going on and the load spikes access to this interface can become quite sluggish. In my scenario I'm using tomcat as the webservices application and it launches with a 0 nice value (the normal user priority level in a range from -15-15 with lower being more important). The problem with this is that postgres also operates on the same priority and if it is loaded up with queries they are both on equal footing when fighting for CPU time. So in order to increase the quality of the user experience I've decided to set the priority for the tomcat process to -1, allowing it to take CPU time as needed when users interact with the server. I've done this using a rather crude bash script, and an entry on crontab (using crontab -e).

The script
--

#!/bin/bash
tomcatString="$(ps -eaf|grep tomcat|cut -c10-15)"
renice -1 -P $tomcatString

The crontab entry:

*/10 * * * * sh /some/path/here/reniceTomcat

All the above uses are the ps,grep, and cut commands to pull the process ID and then run the renice command on that ID by streaming it in. The crontab entry just calls it on a periodic basis to make sure the process stays at that priority. In the case of the above it's doing it every 10 minutes, but it can be set to just about any sort of scheduling. To read more on how to use cron scheduling check out this article: http://www.debian-administration.org/articles/56.

Thursday, November 29, 2012

Analytics in Columbus?

I read something rather interesting in the paper this morning. It would seem that IBM is putting a new analytics center right in my backyard here in Columbus Ohio. This is big news for the city as it's supposed to bring in around 500 new tech jobs as well add credibility to the region as a tech center. Data/Business Analytics is a fascinating field, and is certainly worth a gander as it represents something significant for the future of the technology sector.

You see right now all the talk is about 'big data' and how it is stored, where it is served from, how its collected. But the lingering question that a lot of companies are now answering is 'what do you do with it once it's there?'. Companies such as IBM are taking this data boiling it down and using it to formulate strategies, see patterns of behavior that might be uncouth, and where consumer interest is going. This trend has caused a new type of IT job to exist that is an interesting mix of both technological, and business savvy.

What all of this means for those of us here in the Midwest is that its a step toward breaking the assumption that all of the IT talent is on either the east or west coast of the US. So all in all this should be a positive sign for the economy here, plus I must add it will be interesting to see the sort of talent that Ohio State is able to churn out for this field. To that end the Fisher College is opening up a new Graduate program for it, and the college itself is looking into something in the undergrad arena.

There should be some interesting times ahead for the Tech sector in Columbus.

Wednesday, November 14, 2012

Deficit Hawk: A cool federal budget app

So due to my being a bit of a public policy nerd on top of my enjoyment of technology I started playing around with an app on the Google Play Store called 'Deficit Hawk'. A friend of mine had suggested it, and I must say its a neat little app. It takes CBO projections, and possible choices that cover new revenues as well as cuts and allows you to attempt to set a budget plan. It's incredibly easy to use, and gives you a nice graph so you can see how you are doing.

The plan I created when messing around with it is below:

----- NEW SPENDING CUTS -----

$-88.0 billion over ten years - Add a Public Plan to the Health Insurance Exchanges

$-88.5 billion over ten years - Apply the Social Security Benefit Formula to Individual Years of Earnings

$-112.0 billion over ten years - Base Social Security Cost-of-Living Adjustments on an Alternative Measure of Inflation

$-2.0 billion over ten years - Charge transactions fees to fund the Commodity Futures Trading Commission

$-4.8 billion over ten years - Drop Wealthier Communities from the Community Development Block Grant Program

$-20.8 billion over ten years - Increase Fees for Aviation Security

$-26.5 billion over ten years - Increase Guarantee Fees Charged by Fannie Mae and Freddie Mac

$-241.2 billion over ten years - Increase the Basic Premium for Medicare Part B to 35 Percent of the Program's Costs

$-85.6 billion over ten years - Limit Highway Funding to Expected Highway Revenues

$-62.4 billion over ten years - Limit Medical Malpractice Torts

$-84.6 billion over ten years - Link Initial Social Security Benefits to Average Prices Instead of Average Earnings|Implement progressive price indexing

$-124.8 billion over ten years - Raise the Age of Eligibility for Medicare to 67

$-119.9 billion over ten years - Raise the Full Retirement Age in Social Security

$-642.0 billion over ten years - Reduce Growth in Appropriations for Agencies Other Than the Department of Defense|Freeze Funding at 2011 Level

$-610.7 billion over ten years - Reduce the Growth in Appropriations for the Department of Defense|Freeze Funding at 2011 Level

$-112.0 billion over ten years - Require Manufacturers to Pay a Minimum Rebate on Drugs Covered Under Medicare Part D for Low-Income Beneficiaries

$-3.6 billion over ten years - Transfer the Tennessee Valley Authority's Electric Utility Functions and Associated Assets and Liabilities

----- NEW REVENUE -----

$309.5 billion over ten years - Accelerate and Modify the Excise Tax on High-Cost Health Care Coverage

$96.1 billion over ten years - Expand Social Security Coverage to Include Newly Hired State and Local Government Employees

$241.4 billion over ten years - Extend the Period for Depreciating the Cost of Certain Investments

$70.9 billion over ten years - Impose a Fee on Large Financial Institutions

$456.8 billion over ten years - Increase the Maximum Taxable Earnings for the Social Security Payroll Tax

$1.2 trillion over ten years - Limit the Tax Benefit of Itemized Deductions to 15 Percent

$48.7 billion over ten years - Raise Tax Rates on Capital Gains

--------

In any case if you have an interest in public policy, and you enjoy playing around with neat apps on your phone or tablet I suggest giving this a go.