Sunday, October 25, 2009

Faster composite regular expressions

Regular expressions is a powerful tool, but they quickly become too long to be readable. Some people use //x modifier. I prefer split into many smaller regular expressions, for example:

    my $re_num = qr/.../;
    my $re_quoted = qr/.../;
    my $re_value = qr/$re_num|$re_quoted/;

It works just fine and usually I compile them in package space beforehead and then use in functions with //o:

    my $re_foo = ...;
    sub foo {
        ...
        if ( /^$re_foo/o ) {
            ...
        }
        ...
    }

Doesn't matter what exactly you do, the question is how much speed do you loose if you need these REs to be dynamic. I've decided to make a simple test to understang which one is faster:

    use Benchmark qw(cmpthese);
    my $count = -60;

    my $re = qr/\d+/;
    my $re_pre = qr/^\d+$/;

    cmpthese($count, {
        static => sub { return "123456789" =~ /^\d+$/ },
        o => sub { return "123456789" =~ /^$re$/o },
        no_o => sub { return "123456789" =~ /^$re$/ },
        no_o_pre => sub { return "123456789" =~ $re_pre },
    });

    cmpthese($count, {
        static => sub { return "123456w789" =~ /^\d+$/ },
        o => sub { return "123456w789" =~ /^$re$/o },
        no_o => sub { return "123456w789" =~ /^$re$/ },
        no_o_pre => sub { return "123456w789" =~ $re_pre },
    });

Just compare four different variants: just plain old static regexp, regexp in a variable with some additions, the same with //o and finally another RE with all additions and use it without any quotes. Here are results:

                  Rate     no_o no_o_pre        o   static
    no_o      851115/s       --     -30%     -41%     -47%
    no_o_pre 1222940/s      44%       --     -15%     -24%
    o        1443941/s      70%      18%       --     -11%
    static   1613818/s      90%      32%      12%       --
                  Rate     no_o no_o_pre        o   static
    no_o      923012/s       --     -33%     -37%     -46%
    no_o_pre 1376153/s      49%       --      -6%     -19%
    o        1471770/s      59%       7%       --     -14%
    static   1705241/s      85%      24%      16%       --

Results are consistent with my hopes. I'll try to describe them, but can not say I do know everything about this. In 'no_o' case perl have to compile regular expression each time you run the code. Time spent in compilation is enough to give up 40% to next variant. 'o' and 'no_o_pre' are very close and I expected something like that. In 'o' case perl have to compile once at runtime and each time check cache. In 'no_o_pre' perl have to check each time that thing on the right hand is an RE object. It's probably possible to make //o case very close to static by rebuilding op_tree, however that will disappoint some deparse modules. Static case is the fastest and it's understandable.

Should you use this? Yes. All the time? No. For example if you write a parser for apache log, not simple one, but parser that takes log file format strings and builds regular expressions for this particular format. In this case I would think twice about design and the way REs are used.

Wednesday, October 07, 2009

Easy thing, but useful, strange that nobody implemented it earlier

This post is about Perl, Mason, memory leaks and hunting them easily in objects oriented applications based on these technologies.

It's not a secret that you can cause a memory leak by introducing a cycle with references. It often happens in tree structures when parent holds references on all its children and each child references its parent.

Perl has references weakening that helps avoid most of problems or you can ask people to call a method to destroy structure. Developers who post modules on the CPAN usually aware of the solutions and cover this. However, it can be done differently and it's easy to overlook in the doc.

For a long time I was using different modules to catch leaks, for example Devel::Leak::Object. It's really a useful module, but I used with custom patches for better diagnosis.

Recently had to look into leaks once again and started to wonder how to find a leak that is not reproducible on my machine, but a customer see it and can not say which request cause it. Looked at the CPAN again. Found Devel::LeakGuard::Object, new reincarnation of Devel::Leak::Object with additional ways to instrument reporting.

It was very easy to write a simple memory leaks tracer for mason based applications as a mason plugin. At this moment it helped me identify three small memory leaks in Request Tracker software just by enabling this new module in my devolpment environment. Leaks just poped up in logs during testing of things.

I hope that MasonX::LeakGuard::Object can help other people as well.