# Parsing a Newick Tree Using Perl Regex

A newick tree is a way of representing a tree, with branch lengths and internal node names, as a single string. An example is:

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

Which would correspond to the tree below (notice how the braces show the groupings of nodes and the colons act as delimiters for showing branch length):

These come up ALOT in computational biology, as a tree is often how we think of the evolutionary history of the natural world (think like your family tree, but across all organisms).

So, one day I decided to write a regex that parses such a newick string. If you’re new to regular expressions (regexs), then I wrote an introductory guide to them hosted here.

Anyway, parsing a newick string appears to be something alot of people on message boards etc. want to do as well, but very few of them receive satisfactory answers. So I had a go at it myself. The result being the two regexs below. Note that I wrote these with the intention of parsing a rooted tree, but $CladeREGEX should work on an unrooted tree as well: ##Construct regexes that parse the newick format - these took ages to make!! our$CladeREGEX;
#Have to predeclare else variable wont be in scope (it calls itself)
#Example string: ^A:0.1,B:0.2,(C:0.3,D:0.4):0.5CladeREGEX = qr{
(?:
(
(?: \w+ (?: :\d*\.?\d*(?:[eE][-+][0-9]+)?)? )
#Match simple clades of one item like: A:0.1
|
(:? $$(??{CladeREGEX})$$ \w* (?: :\d*\.?\d*(?:[eE][-+][0-9]+)?)? )
#Initialise $CladeREGEX again so as to parse more complex clades, e.g (C:0.3,D:0.4):0.5 ) (?: , (??{$CladeREGEX}) )?
)
}x;

And ….

our $SubcladeRegex;$SubcladeRegex= qr{
#Example string: ^A:0.1,B:0.2,(C:0.3,D:0.4):0.5$$$((??{CladeREGEX}))$$ (\w*) (?: : (\d*\.?\d* (?:[eE][-+][0-9]+)? ))? #Initialise$CladeREGEX again so as to parse more complex clades, e.g (C:0.3,D:0.4):0.5
}x;

$CladeREGEX actually does the vast majority of the work here. It takes a newick string and will recursively match to it, working down through nesting levels for as long as necessary and matching any number of items in a clade (separated by commas).$SubcladeRegex is just a way of extracting the node name (if labelled), descendent clade and branch length (if it exists) of the top level of a clade. These all get thrown into the temporary variables $1,$2 and $3. I used it in my script as follows (note that the first set of braces and the semi-colon at the end of the full newick string are stripped off elsewhere in the script). This was then called recursively on each string$DescendentString until there was only a leaf left as the descendent (tested for with another if statement later).

if($NewickString =~ m/^$SubcladeRegex$/){ # Node name optional, but a branch length is specified, #example: (:0.1,:0.2,(:0.3,:0.4):0.5):0.0 #or (A:0.1,B:0.2,(C:0.3,D:0.4):0.5)ROOT:0.0 # OR node name given, but a branch length is not specified, #example: (A,B,(C,D)) or (A,B,(C,D)E)F (i.e only leaf nodes #labelled and all nodes labelled) my ($DescendentString,$NodeName,$Branchlength) = ($1,$2,\$3);
...

The full script that utilises this can be found on my github page in the repository perl-libs-custom/Supfam and the module TreeFuncsNonBP. The function is Newick2Node (I create a hash of all the tree information from the newick string).

I’ve used a whole load of more advanced Perl regex concepts here, such as pre-compiled regexs (qr{}), nested regexs (using  ??{} ) and non-caputing groups ( (:? ) ). Rather than replicate some of the excellent tutorials on the matter, I shall simply link through to one. If you don’t then find the pimp-my-ride featured image amusing, then you are not a massive nerd like me. Well done.

Note that these REGEXs will only work in Perl and will not work in Perl Compatible REGREX Engines in other languages, as there are a whole load of things unique to perl (such as precompiled regexs). Perl gets alot of bad press, but it really doesn’t deserve it.