Hierarchical Structures in Python – i.e. folders

The method I found most appealing in dealing with hierarchical structures is a tree. I think is pretty straightforward and easy to implement and customize.
First we need a class that defines the nodes of the tree.

class Node:
    def __init__(self, name):
        self.children_list = []
        self.name_str = name

    def add_child(self, node):
        self.children_list.append(node)

This is the basic version of the node class. We are good to go. We have the two basic functions to create a new node and to store nodes as children thus building a hierarchy.

So, we build a root node…

root = Node("/")

then we create a second one

first_child = Node("first_child")

and append it to root

root.append(first_child)
    tree.append(node.name_str)

We could go on adding nodes (create node and append to other already existing node as child) to create a tree.

Now lets assume we have a nice tree and we want to return a json representation of the structure.
I (since I like to keep thinks seperated) would make a new file and import the node file.
This new file might look something like this:

def get_tree(node, tree=[]):
    tree.append(node.name_str)
    for child in node.children_list:
        subtree=[]
        tree.append(subtree)
        get_tree(child, subtree)
    return tree

So there you go, the programmers best friend recursion. This would return the tree as nested array containing a name and, if the node has children, an array of children. This structure could then be pickled or returned or stored.
To associate content with the folder there are several options. You could have a list of content objects similar to children_list or ids of content.

Advertisements

Perl vs Python – Regex

tl;dr

Perl does not outperform Python when it comes to regexes. When the term to match is preceded by “.*” the speed drops significantly.

I am constantly told that Perl has much better regex performance than python. When I ask people how they know they answer with “everybody knows that” or “because it’s native” or I am shown some obscure benchmarks whcih seem to test anything but regex performance (hardcoded regex vs interpolated etc.). I wanted to know, and I wanted to fiddle around with performance analysis since I am dealing with Big-O lately. So, without putting an end to the discussion and more as a base for discussions with colleagues and friends here is what I did:

1. I took a large text (Moby Dick at archive.org

2. I tried to wrote very small programs in perl and python

3. I read in the whole file and measure the time (to be able to see whether one program takes longer to read or not)

4. I ran the code with regex

5. I changed the regex and ran them again

6. I measured with linux time

I am however not interrested in absolute performance (which is machine dependent) but relative.
Version were
perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
and
Python 2.7.6 (default, Jan 17 2014, 15:43:59) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin

The first two scripts were these

import re;

count = 0
with open(‘mobydick.txt’,’r’) as f:
data = f.read();


#!/usr/sbin/perl -w
use utf8;
use strict;
use warnings;

my $string;

open FILE, “<“, “mobydick.txt”;
$string = join(“”, );
close FILE;

Ran them both and got

python py_regex.py 0,02s user 0,02s system 53% cpu 0,069 total

perl pl_regex.pl 0,01s user 0,02s system 70% cpu 0,047 total

Pretty close. So, I don’t have to concern myself with reading speed in the next measurements.

Then I changed the code to include some regexes. I just counted how many times the word “Pequod” was used.

import re;

count = 0
with open(‘mobydick.txt’,’r’) as f:
data = f.read();

m = re.findall(‘(Pequod)’, data);

for find in m:
print find
count+=1

print “%d” %count


#!/usr/sbin/perl -w
use utf8;
use strict;
use warnings;

my $count = 0;
my $string;

open FILE, “<“, “mobydick.txt”;
$string = join(“”, );
close FILE;

my @m = $string =~ /(Pequod)/g;

foreach(@m){
print “$_\n”;
$count++;
}

print $count.”\n”;

Ran them again and got:

Pequod
[...]
Pequod
66
python py_regex.py 0,02s user 0,01s system 89% cpu 0,033 total

And

Pequod
[...]
Pequod
66
perl pl_regex.pl 0,01s user 0,01s system 89% cpu 0,021 total

Okay, that was a little suprising since in the discussions I had before “outperforms” was a term used quite often.
Maybe it was just that the regex was simply not complex enough or something…

Change the regex and keep everything else.

m = re.findall('(.*Pequod.*)\s', data);

my @m = $string =~ /(.*Pequod.*)\s/g;

And run it again

the Pequod. Devil-Dam, I do not know the origin of ;
[...]
SLOWLY wading through the meadows of brit, the Pequod
66
python py_regex.py 0,07s user 0,01s system 95% cpu 0,082 total

Not too bad an increase.

the Pequod. Devil-Dam, I do not know the origin of ;
[...]
SLOWLY wading through the meadows of brit, the Pequod
66
perl pl_regex.pl 18,16s user 0,09s system 99% cpu 18,347 total

GOODNESS ME!!

This drop in speed seems to occur when the matching term is preceded by “.*”.  This might be connected to the lack of variable length look-behind, but that is just me speculating.

But nonetheless I wouldn’t consider Perl as a language for applications dealing with text, as I could never be sure, not to be left with a regex that leads to performance issues in the system.

 

A sip from Flask

Lately I came to find Django a bit top heavy for one of my projects, so I chose Flask as a lighter and smaller alternative.
After fiddling with the tutorials for a bit I wanted to have a setup with several modules. Suprisingly that wasn’t as easy to do as the snippets and examples showed several options and configurations and… So, this is what worked for me. May not be the true gospel but I wanted modules to be set to certain urls like mounted apps in padrino.

This is what I came up with:

    + Project
      -- start.py
      + module1
         -- __init__.py
         -- app.py
      + module2
         -- __init__.py
         -- app.py

So module1 and 2 are two functional units which should answer to specific prefixes (localhost:5000/module1 and localhost:5000/module2) and start.py is the file to run the whole show.

I used flask-blueprint to get it all under the roof.

First let’s get the modules to behave like modules. In module1/app.py I added:

     from flask import Blueprint
     app1 = Blueprint('app1', __name__)
     ...
         @app1.route
     ...

For module2 app.py looks similar except that app1 is changed to app2.

So, now we have the blueprints, of which the project does not know yet. In fact we don’t have any app so far. All the nutrs and bolts go into start.py:

    from flask import Flask
    from module1.app import app1 
    from module2.app import app2 

     project = Flask(__name__)
     project.register_blueprint(app1, url_prefix='/path1')
     project.register_blueprint(app2. url_prefix='/path2')

     if __name__ == '__main__':
         project.run()

This is the beauty of blueprint (imho). Import the blueprint, register it and pu t it on a dedicated path.

Done. To modules in a flask-application.

gitweb – shorty

The Team demanded (or asked nicely) for a graphical overview of all the git repositories, so here is the quick way to do it:

  1. Install gitweb

       sudo apt-get install gitweb
    
  2. Make an empty directory that is the root of all the repositories e.g. pub. This is necessary since git has no concept of a root repository holding others

     mkdir pub/
    
  3. Change owner to the user who owns the repositories

     sudo chown -R git:git pub
    
  4. Now we link the repositories into pub/ (Move to pub/ and do)

     ln -s /path/repo1.git rep1
     ln -s /path/repo2.git rep2
    
  5. Now we open /etc/gitweb.conf and edit the variable to pub/

     $projectroot = "/path/pub"
    

Now http://server/gitweb should show the list of repos. If not you probably have to edit $projectroot in /usr/share/gitweb/gitweb.cgi too.

git hooks – reel in

In the last post I sketched out a simple jabber-notification script for remote git repositories. There are some things, that can be improved there.

First I added an additional argument to exclude the commiter from the message queue. I know that I commited, so I don’t have to be informed about that later (I updated my github repo). So, I have another argument in the call, but what now?
In pushbot.py there is a dict to hold the name of the commiter (or email) as key and the jabberid as a value.

But that in itself is pretty useless, so we have to tweak the hook a little to give the name of the commiter as 2 parameter. This is best achieved in using

git log -1

which gives us the last commit entry. Better stil we can add a formatting instructions like this

git log -1 --pretty=format:"%ce"

which gives us the email-address of the commiting party. I will use this as the key in the pushbot dict holding the jabber-ids to which the push-notification shoudl be sent. I don’t use the commiters name here, beccause of formatting hubub and the fact, that I am less likely to run into problems with doubles.

So, in pushbot.py I will add an email-address as a key

rcps_list={'email@server' :  'jabber@server'}

Now the commiter should not receive any message concerning his now commits. But still we could improve the notification message by using the very same git log statement.

In hooks/post-receive we could generate a more detailed message using

git log -1 --pretty=format:"%cn, %s"

Which gives us the name of the commiter and the subject line. Insert this into the message and you have a nice push notification with sufficient details to decide what you should do without too much overhead.

Git Hook, Line and Sinker

Selfhosting your git repositories is not a bad idea. In fact it is a great idea and it’s pretty simple too.

First you make a new directory on an accessible machine which by convention ends on .git. Something like /allmycode/repo1.git

Move into the directory and execute

 git init --bare --share

Great, we got ourselves a shareable git repository. If you dont’ want ro be the only one to be working on that repository and have no intention of making it public either you should create a user specific for git operations on the machine you serve your repositories from.
Let’s assume your specialized user is called “git”

You can now add ssh-public-keys from all parties that should have access on the repos via copy-ssh-id to /home/git/.ssh/id_rsa.pub and have a nice passwordless access-control.

Now we can start to work on the remote repository.
In you local working directory we

git init

and provide the user information that is used in the commit message

git config --global user.name "Your Name"

git config --gloabl user.email your@email.sth

This was all local, so let’s add the information about remote

git remote add origin git@server:/allmycode/repo1.git

this enables us to make a push to remote with the shorter

git push origin master

It is completely viable to add differently labeled remote repositories e.g.

 git remote add github sth@github

and push a specialised branch (without passwords for example) there via

 git push github public

Nice, self-hosted remote repositories! You can start collaborating. And when you do you, you might want to automate transferring the newest version to a testing server. You could do this with a cronjob and some copying, or, you could use git’s very own hooks, to be specific a post-push hook.
Connect to the remote repository and enter the directory hooks/. Here you find some nice samples, but we want something different. We want a post-receive hook, which means everytime somebody pushes changes to
the remote repository this action is called. So we create that hook:

touch post-receive

then we paste in

#!/bin/sh
GIT_WORK_TREE=/path/to/serverroot/ git checkout -f

and save. Make it executable and you made a git hook. Congrats!
Since we have a user named git who is the owner of all the repos on our remote machine we must add him to the group that controls the webserver paths (www-data or else) Full instructions to make the checkout work.

Now every push to the remote repository should trigger a checkout which hopefully makes the newest version available on the webserver.

But let’s tweak things a little. Say we want to be notified whenever a commit has been pushed. Email and telephone are viable but timeconsuming and you don’t want to, and frankly should not have to, bother. I think Jabber is a great way of getting the information across without spamming the whole team. So I made a little script to send a message to everybody who cares to give me his jabber-id. You can get it here via

git clone https://github.com/kodekitchen/punobo.git

If you add to the post-receive hook

 python /<path-to-repo>/pushbot.py "Something has been pushed."

not only will your testing/demo/development server automatically have been updated, but all listed members of the working group will be informed about it on Jabber.

Business Card with Latex

So, I needed some business cards for a meeting but I rarely ever need more than 8 or so at a time (yes, I’m aware that they look less classy, but having some done would have taken to long) I decided to make some with Latex.
So, here i swhat I have done:

First I declared a Xetex-Preamble to be able to use the fonts from my linux system and don’t have to bother with encoding

 documentclass[a4paper,11pt]{article}
 usepackage[cm-default]{fontspec}
 usepackage{xunicode}
 usepackage{xltxtra}
 usepackage{graphicx} 
 setmainfont[Mapping=tex-text]{Ubuntu}
 setsansfont[Mapping=tex-text]{Ubuntu}
 setmonofont[Mapping=tex-text]{Cantarell}

Next I got rid of all elements that by default come with the article documentclass and redefined width and height of the paper to match an A4 sheet and some other dimensions.

 pagestyle{empty}
 setlength{unitlength}{1mm}
 setlength{paperheight}{297mm}
 setlength{paperwidth}{210mm}
 setlength{oddsidemargin}{-7mm}
 setlength{topmargin}{32mm}
 setlength{textheight}{280mm}

After that I declared all text elements that should be on the card.

 newcommand{bcname}{Caspar David Dzikus}
 newcommand{bctitleA}{KodeKitchen Writer}
 newcommand{bctitleB}{}
 newcommand{bccontactA}{555-555-5555}
 newcommand{bccontactB}{caspar@kodekitchen.com}
 newcommand{bccontactC}{http://kodekitchen.com}
 newcommand{bcsub}{coding and stuff}

The document itself is pretty straightforward: The card itself is a picture which is then repeated ten times (five rows, two columns) in another picture. To help cut the cards marks are placed in the corner of each card (which is 80 x50mm)

 begin{document}
 begin{picture}(170,209)(0,0)
 multiput(0,0)(0,50){5}{
    multiput(0,0)(80,0){2}{
       begin{picture}(80,50)(0,0)
       % marks for cutting
       put(-1,0){line(1,0){2}}
       put(0,49){line(0,1){2}}
       put(-1,50){line(1,0){2}}
       put(0,-1){line(0,1){2}}
       put(80,49){line(0,1){2}}
       put(80,-1){line(0,1){2}}
       put(79,0){line(1,0){2}}
       put(79,50){line(1,0){2}}

      put(13,39.5){textsf{LARGEbcname}}
      put(13,34){textsf{scriptsizebctitleA}}
      put(13,31){textsf{scriptsizebctitleB}}
      put(13,24){tt{normalsizebccontactA}}
      put(13,19){tt{normalsizebccontactB}}
      put(13,14){tt{normalsizebccontactC}}
      put(55,8){textsf{scriptsizebcsub}}

     end{picture}
     }
 }
 end{picture}

end{document}

And this is what you get

Indexing with Elasticsearch and Django

So, every decent webapp needs a search feature? Okay, here we go.

All starts with downloading elasticsearch
After extracting start it with

bin/elasticsearch -f

The -f paramter gives you a little output, especially the port and host. By standard this would be localhost:9200.

So let’s get to the Django bit.
First thing to check is whether the model object you want to index for search has one or more foreign key fields.
If so, you might not want to index the ids (it is very unlikely that some user would search for an id).
So what to do? Since data is passed to elasticsearch as a JSON object we will use djangos built in serializer to convert our model object into a JSON object and then pass that on. The serializer provides an option to use something called natural keys, which is called by adding

use_natural_keys = True

to the serializers.serialize(‘json’, modelObject) as a third element. The successfully use this, the model which the foreign key field references has to be extended by a method natural_key.

As an example let’s say, we got to model classes one is product which has a foreign key field manufacturer which references a model of said name:

Manufacturer
    name
    address
    website...

Product
    prod_id
    name
    manufacturer &lt;- there it is, a foreign key to the above
    price...

So if we want to index products for search we may want the manufacturer field to be a name (or a name and address combination etc.). Therefore we define a method “natural_key” in the Manufacturer class i.e.:

def natural_key(self):
  return (self.name)

Thus when serializing a Product the “unsearchable” ID is converted to the manufacturer’s name.

The general idea now is to pass the object as an serialized string to a function that then does the indexing on its own. Doing something ike this:

...
new_product = Product(...)
new_product.save()
myIndexModule.add_to_index(serializers.serialize('json', [new_product], use_natural_keys=True))

So, now to the indexing itself. I use pyelasticsearch for no special reason except that its documentation seemed decent.
The indexer is located in a module since I wanted it to be separated from the rest of the application and it is pretty short.

from pyelasticsearch import ElasticSearch
import json

ES = ElasticSearch('http://localhost:9200')

def add_to_index(string):
    deserialized = json.loads(string)
    for element in deserialized:
        element_id=element["pk"]
        name = element["model"].split('.')[1] <- (this is to get rid of the module prefix but this is just cosmetics)
        index = name + "-index"
        element_type = name
        data = element["fields"]
        ES.index(index, element_type, data, id=element_id)

That’s it. One could certainly do more sophisticated stuff (like plural for the index and singular for the element type and than do something clever about irregular plurals…) but it does the job.

Now let’s use ElasticSearc as a datastore for an application.

But why should we do this. Let’s assume we have an application with a member and a non-member area. Members can do stuff on a database and non-members can not. Since you want to keep the database load from user that do not add anything to your service to a minimum to provide a snappy experience for your members you don’t want them to clog the connection with database requests and decide to let ElasticSearch handle that.
And anyway, it’s just for fun 🙂

So the idea is to make an ajax call to elasticsearch and show a list of the last ten products added to the index to the user. In one of your views for non-members you put a javascript function like this:

$.getJSON('http://localhost:9200/product-index/_search?sort=added&order=asc&from=0&size=10', function(response){....})

and in the function you can now start to play around with the fields like

$.each(response.hits.hits, function(i, item){
     item._source.name
     ...
}

and present them to the users.

Custom authentication in Django

After fiddling with Djangos auth-app for a while I decided t rather have my own (I know, why should one do this? Answer: To learn).
It consists of several steps:

  1. registration
  2. activation
  3. adding a password
  4. login

First I created an app for user-management

 $python manage.py startapp user_management    

This gave me the structure to work with.
First I created the usermodel:

 from django.db import models    
 import bcrypt    

 class User(models.Model):

    email = models.CharField(max_length=100, unique=True)
    firstname = models.CharField(max_length=30)
    lastname = models.CharField(max_length=30)
    password = models.CharField(max_length=128)
    last_login = models.DateTimeField(auto_now=True)
    registered_at = models.DateTimeField(auto_now_add=True)
    core_member = models.BooleanField()
    activation_key = models.CharField(max_length=50, null=True)    

The idea here was to have email as username and to have that unique. I don’t consider usernameshis is a good choice for logins but rather a feature for profiles, but that depends on one’s taste I think.

The registration view is pretty straight forward . I create a RegistrationForm object with fields for email, first and last name.
The activation_key is simply a string of randomly chosen ASCII characters and digits.
Activation itself is just creating a link, sending it and comparing the random part of the link and the stored string. If they match is_active is set to True and the user can set his/her password. For passwords I normally store bcrypt hashes in the database (NEVER! store plaintext passwords in a database!). This is quite simple and can be done by following this description.

The function for setting the password goes into the model. For this to work I use a classmethod. As the name suggests, this is a method bound to the class, not an instance of said class which allows to get objects as in “cls.objects.get()” which is the classmethod’s equivalent to self.something in instance methods.

@classmethod
def set_password(cls, user_id, plain_pass):    
    secret = bcrypt.hashpw(plain_pass, bcrypt.gensalt())
    user = cls.objects.get(pk=user_id)
    user.password = secret
    user.save()
    return True

The login process itself is done via another classmethod which I named authenticate:

@classmethod
def authenticate(cls, email, password, request):
    user = cls.objects.get(email__exact=email)
    if bcrypt.hashpw(password, user.password) == user.password:
        request.session['user_id'] = user.id
        user.save() # this is to get last_login updated
        return user
    else:
        return None

(In order for this to work you have to enable the session middleware and the session app in settings.py.)

So, a quick rundown.

Since I use email as an unique identifier for the login the function expects an email address which is used to find the person to authenticate, the plaintext password (e.g. as given from a inputfield) and the request object to make use of a session. (I use database session handling for development but there are alternatives described in the django docs.)

The bcrypt function returns True if given plaintext password hashed and the stored hash match False if not.

After haveing checkd that the user has given the right credentials I’m going to store the user_id in the session which allows me to get the full set of user information should I need it.

I save the user to trigger the auto_now function of the user model in which updates the last_login field to the actual time.

Now with

User.authenticate(email, password, request) 

the user is logged in.