Parsing a Newick Tree Using Perl Regex

yo-dawg-regex

A newick tree is a way of representing a tree, with branch lengths and internal node names, as a single string. An example is:

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

Which would correspond to the tree below (notice how the braces show the groupings of nodes and the colons act as delimiters for showing branch length):

 

These come up ALOT in computational biology, as a tree is often how we think of the evolutionary history of the natural world (think like your family tree, but across all organisms).

So, one day I decided to write a regex that parses such a newick string. If you’re new to regular expressions (regexs), then I wrote an introductory guide to them hosted here.

Anyway, parsing a newick string appears to be something alot of people on message boards etc. want to do as well, but very few of them receive satisfactory answers. So I had a go at it myself. The result being the two regexs below. Note that I wrote these with the intention of parsing a rooted tree, but $CladeREGEX should work on an unrooted tree as well:

 

##Construct regexes that parse the newick format - these took ages to make!!
our $CladeREGEX;
#Have to predeclare else variable wont be in scope (it calls itself)
#Example string: ^A:0.1,B:0.2,(C:0.3,D:0.4):0.5$
$CladeREGEX = qr{
(?:
  (
     (?: \w+ (?: :\d*\.?\d*(?:[eE][-+][0-9]+)?)? )
     #Match simple clades of one item like: A:0.1
  |
     (:? \( (??{$CladeREGEX}) \) \w* (?: :\d*\.?\d*(?:[eE][-+][0-9]+)?)? )
     #Initialise $CladeREGEX again so as to parse more complex clades, e.g (C:0.3,D:0.4):0.5
  )
   (?: , (??{$CladeREGEX}) )?
)
}x;

 

And ….

 

our $SubcladeRegex;
$SubcladeRegex= qr{
#Example string: ^A:0.1,B:0.2,(C:0.3,D:0.4):0.5$

    \( ((??{$CladeREGEX})) \) (\w*) (?: : (\d*\.?\d* (?:[eE][-+][0-9]+)?  ))?
#Initialise $CladeREGEX again so as to parse more complex clades, e.g (C:0.3,D:0.4):0.5
}x;

 

$CladeREGEX actually does the vast majority of the work here. It takes a newick string and will recursively match to it, working down through nesting levels for as long as necessary and matching any number of items in a clade (separated by commas). $SubcladeRegex is just a way of extracting the node name (if labelled), descendent clade and branch length (if it exists) of the top level of a clade. These all get thrown into the temporary variables $1, $2 and $3. I used it in my script as follows (note that the first set of braces and the semi-colon at the end of the full newick string are stripped off elsewhere in the script). This was then called recursively on each string $DescendentString until there was only a leaf left as the descendent (tested for with another if statement later).

 

if($NewickString =~ m/^$SubcladeRegex$/){

# Node name optional, but a branch length is specified,
#example: (:0.1,:0.2,(:0.3,:0.4):0.5):0.0
#or (A:0.1,B:0.2,(C:0.3,D:0.4):0.5)ROOT:0.0
# OR node name given, but a branch length is not specified,
#example: (A,B,(C,D)) or (A,B,(C,D)E)F (i.e only leaf nodes #labelled and all nodes labelled)

my ($DescendentString,$NodeName,$Branchlength) = ($1,$2,$3);
...

 

The full script that utilises this can be found on my github page in the repository perl-libs-custom/Supfam and the module TreeFuncsNonBP. The function is Newick2Node (I create a hash of all the tree information from the newick string).

I’ve used a whole load of more advanced Perl regex concepts here, such as pre-compiled regexs (qr{}), nested regexs (using  ??{} ) and non-caputing groups ( (:? ) ). Rather than replicate some of the excellent tutorials on the matter, I shall simply link through to one. If you don’t then find the pimp-my-ride featured image amusing, then you are not a massive nerd like me. Well done.

Note that these REGEXs will only work in Perl and will not work in Perl Compatible REGREX Engines in other languages, as there are a whole load of things unique to perl (such as precompiled regexs). Perl gets alot of bad press, but it really doesn’t deserve it.
Citations to this article:

One Response

  1. Adam says:

    If anyone finds this post useful and would like me to put in some code comments and a more detailed explanation of how it works, I’d be only too happy … just say!

Leave a Reply to Adam Cancel reply

*