Archive

Archive for February, 2009

Splitting the Difference

February 5th, 2009 Chris No comments

From a now not-so-recent Slashdot article:

The company [Volvo] will introduce a concept car based on the S60 this month at the Detroit Auto Show, looking ahead a few years to the goal that by 2020 “no one should be killed or injured in a Volvo car.” The concept car will have forward-looking radar as a proximity sensor, and the ability to brake if a collision is imminent. When the car senses a collision, a light flashes on the windscreen display along with an audible warning. If the driver doesn’t act, the car will brake automatically.

2009-volvo-s60-concept

Now, I consider myself pretty competent behind my wheel.  I consider most other drivers incompetent behind theirs.  But I would trust a human driver over a computer driver any day of the year.  Let me amend that: I would trust a human driver over a computer driver any day of the year when the road system contains any human drivers.  If the entire road system is being controlled by computer, then I think we might be on to something.  But so long as there are any humans driving any of the cars, having computers control any of them can only be trouble.

Humans don’t operate on rules. At least not from a computers’ perspective.  Human behavior has to be regarded as nearly completely random on the road, because the road system is not a closed system.  Any number of things, as far fetched as a plane crash, can happen while you’re driving, and your response is, in a certain respect, random.

I’ve slowly been finding the time to work through a new idea of mine, a data processing application where the focus lies on automation of processes to increase system throughput over large (not necessarily “clean”) data sets.  I started to wonder… exactly HOW much should I be leaving up to the user, and how much can I safely stuff into the black box labeled “super secret algorithm”?

The answer, to me, seemed clearly evident no sooner had I questioned myself.  The answer is quite simple: pass to the system all repetitive tasks.  After all, that’s what programs do, isn’t it?  By definition?  A program exists to take the same discrete set of actions time and time again, and branch here and there when the data changes. This is the very power the developer wields, at it’s most basic.  Where someone has to add the same numbers again and again, we can write a program for that.  Where someone has to verify names on a list against a second list, we can write a program for that.  These are basic examples, but you can easily extend the logic to more targeted and advanced tasks.

For example, where you have an authoritative list of spellings for some type of entity… say.. city names (“Chicago”, “Philadelpia”, “Portland”, etc) there are times when you’re bound to come across misspellings from your input sources. Such as “Philadelpia” in the above list. Since your algorithms may not be intelligent enough to figure out that that is a misspelling of “Philadelphia” the first time it sees it, then that’s something you have to pass off to a human. But here’s where you can pass the buck. By adding “Philadelpia” to a list of known misspellings, and attaching it to “Philadelphia”, even if your application may not be smart enough to pick up the mistake the first time around, you know you won’t be bugging your users about it again. This is a very basic example, but if you take the idea and apply it elsewhere, not only may you find yourself having fun writing some interesting algorithms, but you’ll be slimming down the work your users have to do.

The real trick in this is recognizing a repetitive task when it’s wearing a different costume.

wolfsheep

Categories: Uncategorized Tags:

Loop control

February 4th, 2009 Chris No comments

The STL (Standard Template Library) is an indispensable part of any C++ developers toolkit.  Their table of contents is practically burned into my URL bar.  Lists, vectors, maps, sets – all are such rudimentary data structures that I find it difficult to write an application without any of them.  One of the penalties of the STL though (after you master the gag reflex over its sometimes brutish syntax) are the iterators.

I love the iterators, don’t get me wrong, they are a very handy construct.  But if you’re not careful, they can chew the efficiency right out of your application.  The simplest example is this (supposing you have a vector of unsigned ints called “numbers”):

std::vector::iterator it;
for ( it = numbers.begin(); it != numbers.end(); ++it ) {
  // Do stuff with *it
}

In the above code, you’ll see a highly unexceptional for loop.  What’s wrong with it?  Potentially nothing.  So long as the vector isn’t too big.  Here’s an alternative:

std::vector::iterator it;
std::vector::iterator end = numbers.end();
for ( it = numbers.begin(); it != end; ++it ) {
  // Do stuff with *it
}

The difference is obvious, but it can really be a deal breaker if your containers are large enough.  I think back to this past summer.  I have an application for work that really needs to run as lean as possible.  It was doing okay, but I knew there were bottlenecks.  After consulting gprof, I realized that std::map::end() was chewing up billions of cycles.  Unnecessarily.  After applying the above logic throughout my code, I saw a significant performance increase.

How many cycles are you wasting in your loop control statements?

Update: Sorry for the odd behavior of the arrow brackets in some feed readers. Should be fixed now.

Categories: Uncategorized Tags:

Fuzzy Datetime

February 3rd, 2009 Chris No comments

Judging by traffic statistics, it seems that lot of people are trying to find ‘fuzzy time’ logic. Can’t blame them. So was I.

Well, after some other assorted work, I realized that although my fuzzy_time() function worked pretty well, it didn’t cut it in all cases.  Sometimes, you need a bit more detail than “about five months ago” but aren’t interested in the Unix timestamp of the event, exactly.  So, I came up with a fuzzy_datetime() function, in the spirit of it’s predecessor. Here’s some sample output:

-1 second: 04:31pm
-1 minute: 04:30pm
-1 hour  : 03:31pm
-1 day   : Yesterday at 04:31pm
-3 days  : Saturday (3 days ago) at 04:31pm
-1 week  : January 27th at 04:31pm
-1 month : January 3rd at 04:31pm
-4 months: October 3rd at 04:31pm
-8 months: June 3rd, 2008
-2 years : February 3rd, 2007

Enjoy.

/* Function: fuzzy_datetime
   * Author: Chris Tonkinson
   * Description: This is a function to take a time value, and return a user-friendly format
   *              string such as "Nov 17th (2 days ago)", "July 2nd, 2005", etc.
   * Parameters: $time - Any value acceptable to PHP's strtotime() function
   */
  function fuzzy_datetime( $time ) {
    if ( ( $time = strtotime( $time ) ) == false ) {
      return 'an unknown time';
    }
    define( 'NOW',        time() );
    define( 'ONE_MINUTE', 60 );
    define( 'ONE_HOUR',   ONE_MINUTE*60 );
    define( 'ONE_DAY',    ONE_HOUR*24 );
    define( 'ONE_MONTH',  ONE_DAY*30 );
    define( 'ONE_YEAR',   ONE_MONTH*12 );
 
    // sod = start of day :)
    $day = mktime( 0, 0, 0, date( 'm', $time ), date( 'd', $time ), date( 'Y', $time ) );
    $day_now = mktime( 0, 0, 0, date( 'm', NOW ), date( 'd', NOW ), date( 'Y', NOW ) );
    $yesterday = mktime( 0, 0, 0, date( 'm', NOW-ONE_DAY ), date( 'd', NOW-ONE_DAY ), date( 'Y', NOW-ONE_DAY ) );
    $year = date( 'Y', $time );
    $year_now = date( 'Y', NOW );
    $time_ago = (NOW-$time);
 
    // Show the hour and minute, if $time is during 'today'...
    if ( $day == $day_now ) {
      return date( 'h:ia', $time );
    }
 
    // Show a 'Yesterday at' string for values during 'yesterday'...
    if ( $day == $yesterday ) {
      return 'Yesterday at ' . date( 'h:ia', $time );
    }
 
    // Up to six days ago, we use the name of the day (e.g. 'Monday')...
    if ( $time_ago < ONE_DAY*6 ) {
      return date( 'l', $time ) . ' (' . (($day_now-$day)/ONE_DAY) . ' days ago) at ' . date( 'h:ia', $time );
    }
 
    // If $time occured 'this' calendar year, show a month and day name...
    if ( $time_ago < ONE_MONTH*6 || $year == $year_now ) {
      return date( 'F jS', $time ) . ' at ' . date( 'h:ia', $time );
    }
 
    // The default case, we show the month, day of the month, and year...
    return date( 'F jS, Y', $time );
  }
Categories: Uncategorized Tags:

Don’t be lazy. Don’t use eval().

February 2nd, 2009 Chris 55 comments

There are many advantages to using interpreted languages. One of the obvious benefits of interpreted (as opposed to compiled) languages in my mind is dynamic typing. In some contexts, I don’t care whether foo is a string, and int, or a unique key associative container – I just care that it exists. Every program (whether executable or script, whether binary, bytecode, or ascii text) is simply a set of rules governing data manipulation. Connecting the dots, we can easily see how the power and flexibility of an interpreted language manifests itself. More times than we realize, our “rules” which govern data manipulation are, in themselves, pieces of data. This fundamental principle, when massaged a little, yields the methods by which we are granted one of the trickiest functions in modern language design. The tremendously powerful, absolutely flexible, eval().  I’ll be looking at it from a Python perspective

A mentor and former coworker of mine once warned me of the dangers of eval(). I scrutinized his arguments. I considered his position. Ultimately, the wisdom in his logic won me over, and I have, I’ll admit, looked back.  But only to realize that there really is never a valid excuse to use eval().  And that’s just what they are, excuses.

wtf

With the average modern desktop processor core approaching 2 or 3 GHz, and the average modern desktop RAM meeting or besting 2GB, we are spoiled.  All of us.  We are lulled into a false sense of security by believing that throwing more (and better) hardware at a problem is a sufficient excuse for writing poor code, shoddy algorithms, and overall paying less attention to detail.  Don’t get me wrong, Jeff’s argument is carefully thought out and well presented.  But I take it with a grain of salt.  In fact, lots of salt.  Don’t use fast hardware as an excuse to be lazy.  Where does the habit of eschewing proper paradigm and using your hardware as a crutch stop?  Perhaps none of us will be renowned computer scientists, but if we continue to take the easy way out simply because our hardware allows us to do so – and we teach new programmers to do the same – how many future Turing’s and Knuth’s are we denying the opportunity to get a grasp on what’s really going on and contribute great things to the industry?

The modern interpreter is a maze of logic, grammar, instruction, and contextual analyzation.  A lot more goes on than meets the eye to execute your favorite little script:

print "Hello, World!"

I think that any instructor worth this salt would balk at the prospect of writing that little ditty in this fashion:

print eval( ""Hello, World!"" )

Why?  Because it’s entirely unnecessary.  The eval() function isn’t a free lunch.  You’ve got to load in environmental variables, parse the target code string, in the case of Python and similar languages, compile that text into bytecode, and THEN you’ve got to actually go ahead and execute the code.  There is a much more straightforward way of doing it (previously illustrated) and so for efficiency’s sake, no argument exists to use the second form over the first.

There is also another concern.  A little thing called security.  As I see it, there are two basic methods of eval() invocation.  The first method (call it “A”) involves eval()ing code which all or in part was contributed by a user (e.g. a human).  We’re talking about something like this:

if ( sys.argc &lt; 2 ):
  print "syntax: ", argv[0], "  [arg1, arg2...]"
  exit( -1 )
func = eval( argv[1] )
apply( func, argv[2:] )

Here, we are running arbitrary user generated syntax right in our code.  The second method (call it “B”) is when you run code that is wholely or partly contributed by some nonhuman subsystem, such as a randomization routine, a hashing algorithm, or a MySQL query.  For example:

for record in result: # Where 'result' is a valid MySQL result object
  func = eval( record[0] )
apply( func, record[1:] )

Both methods of using eval() are inextricably unsafe, in particular, both allow arbitrary systems (whether human or not – and more on that later) to load code into Python to be executed along with the rest of the script.  When one system takes code in plain text form, and passes that code to another system for execution, that’s called injection, and it’s exactly what eval() is doing.  A seasoned web developer would never intentionally leave a vehicle by which a user could inject JavaScript using their site, and similarly anyone who uses a database should be well aware of the threat posed by SQL injection.  And yet time and time again developers hardcode injection into their own scripts by using eval().

You might argue that method B is a much safer alternative, since there is no direct human interaction.  While there is some merit to that stance, the key is to realize that those other subsystems (the hashing algorithm, the database, the “secure” source files) are themselves vunerable to attack.  If you’re eval()ing code that came from a database, and someone comprimises that database, there goes the security of your script, and the validity of anything relying upon it’s output.  In effect, method B is only a roundabout way of method A.  And we all know that security through obscurity offers no security at all.

Recently a professor of mine (Dr. Dale Parson, formerly of Bell Labs) set us an assignment (pdf).  We are to take an incomplete script and fill in a few very small holes.  The exercise itself is painfully simple, but his goal is obvious enough.  Dr. Parson wants us to learn the mechanics of apply() and eval() while giving us some overall exposure to the language (Python, that is).  So, strictly from an instructional point of view, I see the exercise as pretty reasonable.  What I find flawed is that it perpetuates the use of eval().

I’d be willing to bet that, upon submission of this project, Dr. Parson (whether manually or via some helper script) will first run the program to see if it even works, and then open the source and check the code.  In that order.  In this assignment, we are prompted to write our own sort function, and then using eval() and apply() teach the script how to take the name of the sort function we’d like to use via command line arguments (e.g. sys.argv).  What would happen, if I were to write a function called mysort() which conformed the the appropriate signature and peformed the requested sorting, but also annihilated the users home directory?  What would stop me?  Not eval().  And if he examined the code first?  I could spend the time to obscure the code enough that, without explicitly looking for my having done so, he wouldn’t notice even if he did examine the code before he ran it.

And it would be his fault for using eval().

Sparing the details, his project guidelines suggest writing the following line of code, in order to get a pointer to the sort function the user wants to use:

sortfunc = eval( sys.argv[1] )

How do you check to make sure sys.argv[1] holds the name of a valid sort function?  Much more straightforward (and tremendously more safe) would be to fabricate a whitelist of acceptable sort functions, like so:

whitelist = { 'mergesort': mergesort, 'splitsort': splitsort, 'mysort': mysort }

And then your apply() call changes from:

apply( sortfunc, arguments )

to:

apply( whitelist[sys.argv[1]], arguments )

Granted, in this situation it’s a case of ’six of one, half a dozen of the other’ since we would be writing the function and adding it to the whitelist, and the professor would still probably run them blindly.  The point is this process is done much faster, tidier, and more secure, than if we would use eval() for it.  Again I stress that the difference is nil in an insignificant school project like this, but using the above example, hopefully it has become evident how leaky eval() really is.

In the past five or so years, I have had what I thought was a legitimate excuse to use eval() two (only two!) times.  Each time, I got up, got a glass of water, took a stroll around the office or my apartment, sat back down, and reworked the script without eval().  And both times, the resulting code was cleaner and more concise from a design standpoint.  And, because I went without eval(), the code was quicker and more secure.

I very much hope that the lecture after our project is due, the professor goes on a nice rant about the evils of eval(), and explains how this was merely an exercise to give us a taste of what kinds of things interpreted languages are capable of (most of my colleagues have only studied C++ and Java up until now), and is totally clear that in a real world application, eval() shouldn’t be touched with a thirty-nine-and-a-half-foot-pole.

grinch_santa

Update (6/12/2009): The second chapter of this discussion: eval( “round 2″ ).

Categories: Uncategorized Tags: