A newick tree is a way of representing a tree, with branch lengths and internal node names, as a single string. An example is:
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);
Which would correspond to the tree below (notice how the braces show the groupings of nodes and the colons act as delimiters for showing branch length):
These come up ALOT in computational biology, as a tree is often how we think of the evolutionary history of the natural world (think like your family tree, but across all organisms).
So, one day I decided to write a regex that parses such a newick string. If you’re new to regular expressions (regexs), then I wrote an introductory guide to them hosted here.
Anyway, parsing a newick string appears to be something alot of people on message boards etc. want to do as well, but very few of them receive satisfactory answers. So I had a go at it myself. The result being the two regexs below. Note that I wrote these with the intention of parsing a rooted tree, but $CladeREGEX should work on an unrooted tree as well:
##Construct regexes that parse the newick format - these took ages to make!! our $CladeREGEX; #Have to predeclare else variable wont be in scope (it calls itself) #Example string: ^A:0.1,B:0.2,(C:0.3,D:0.4):0.5$ $CladeREGEX = qr{ (?: ( (?: \w+ (?: :\d*\.?\d*(?:[eE][-+][0-9]+)?)? ) #Match simple clades of one item like: A:0.1 | (:? \( (??{$CladeREGEX}) \) \w* (?: :\d*\.?\d*(?:[eE][-+][0-9]+)?)? ) #Initialise $CladeREGEX again so as to parse more complex clades, e.g (C:0.3,D:0.4):0.5 ) (?: , (??{$CladeREGEX}) )? ) }x;
And ….
our $SubcladeRegex; $SubcladeRegex= qr{ #Example string: ^A:0.1,B:0.2,(C:0.3,D:0.4):0.5$ \( ((??{$CladeREGEX})) \) (\w*) (?: : (\d*\.?\d* (?:[eE][-+][0-9]+)? ))? #Initialise $CladeREGEX again so as to parse more complex clades, e.g (C:0.3,D:0.4):0.5 }x;
$CladeREGEX actually does the vast majority of the work here. It takes a newick string and will recursively match to it, working down through nesting levels for as long as necessary and matching any number of items in a clade (separated by commas). $SubcladeRegex is just a way of extracting the node name (if labelled), descendent clade and branch length (if it exists) of the top level of a clade. These all get thrown into the temporary variables $1, $2 and $3. I used it in my script as follows (note that the first set of braces and the semi-colon at the end of the full newick string are stripped off elsewhere in the script). This was then called recursively on each string $DescendentString until there was only a leaf left as the descendent (tested for with another if statement later).
if($NewickString =~ m/^$SubcladeRegex$/){ # Node name optional, but a branch length is specified, #example: (:0.1,:0.2,(:0.3,:0.4):0.5):0.0 #or (A:0.1,B:0.2,(C:0.3,D:0.4):0.5)ROOT:0.0 # OR node name given, but a branch length is not specified, #example: (A,B,(C,D)) or (A,B,(C,D)E)F (i.e only leaf nodes #labelled and all nodes labelled) my ($DescendentString,$NodeName,$Branchlength) = ($1,$2,$3); ...
The full script that utilises this can be found on my github page in the repository perl-libs-custom/Supfam and the module TreeFuncsNonBP. The function is Newick2Node (I create a hash of all the tree information from the newick string).
I’ve used a whole load of more advanced Perl regex concepts here, such as pre-compiled regexs (qr{}), nested regexs (using ??{} ) and non-caputing groups ( (:? ) ). Rather than replicate some of the excellent tutorials on the matter, I shall simply link through to one. If you don’t then find the pimp-my-ride featured image amusing, then you are not a massive nerd like me. Well done.
Note that these REGEXs will only work in Perl and will not work in Perl Compatible REGREX Engines in other languages, as there are a whole load of things unique to perl (such as precompiled regexs). Perl gets alot of bad press, but it really doesn’t deserve it.
Citations to this article:
- Brian D Foy, Advanced Regular Expressions, Mastering Perl (O’Reilley Media)
- Wikimedia, Newick format Article, Wikipedia
If anyone finds this post useful and would like me to put in some code comments and a more detailed explanation of how it works, I’d be only too happy … just say!