Projects available
Research Experience Placements
A BBSRC Research Experience Placement is available for 2012. Elligible students must be registered on an undergraduate degree program at a UK University and must not be in their first or final year. Applicants must also be able to demonstrate an outstanding academic record and be strongly motivated to gain a first-hand experience in research. Applicants please contact Julian Gough directly. For more information on the scheme see the BBSRC website.
Bioinformatics PhD projects
All PhD projects are subject to
competitive application, and it is expected that only outstanding students can be accepted. There is flexibility in the projects listed below, especially with regard to background and training (e.g. computer science, biology or other natural/physical sciences); any good project proposal will be considered.
- Protein and genome evolution
- Human and mouse interaction networks
- Data-mining for experimental targets
- Functional annotation of SCOP
Available MSc and final year undergradute projects
Machine Learning 1
This project is to use the power of statistics of observed protein domain architectures to predict other protein domains by providing a context which can be used to adjust the probabilities from a score function. If successful, the methodology (and possibly code if it's good), will be deployed on the highly accessed SUPERFAMILY website http://supfam.org.
The project could include more or less statistics/probability depending on the level of interest of the student, will definitely require coming up with an algorithm and working with some existing PERL code and writing more. There will be a data analysis aspect and possibly the deisgn and creation of a suitable benchmark. This project has a very clearly defined starting point but has a potential depth to be explored by the more enthusiastic student.
Medical Devices
Over the coming decade there is going to be a revolution that will forever change society: personal medicine. This will come in two principle forms. The first is that all of you will have your genome sequenced in your lifetime. The second is electronic devices which automatically record and sample health-related information throughout your life. This project is connected to medical devices that record aspects of your daily life and the computer science behind: the communication between the devices, the recording of the data, the secure storage of the data, access and processing of the data and analysis of the data.
An example of the sort of thing that I am talking about are wireless bathroom scales, e.g. this product. This project could be developed to address any of the issues listed above connected to daily medical recording devices. One suggestion would be to design and implement the facebook equivalent of personal medicine. People upload and share their personally recorded data (e.g. weight uploaded from wifi scales) in a secure (optionally anonymous) environment, allowing them to compare and link their health with other people. It could correlate their own vital statistics with health problems or sporting ability. The user will learn of diseases they are at risk from, lifestyle improvements they could make, sporting abilities they didn't know they had or training regimes that will improve their performance. All tailored to their personal health and lifestyle.
N.B. This project is quite open-ended and would suit a student with a lot of initiative.
Nanopublication (semantic web)
The concept of nanopublications uses ideas from the semantic web to publish the smallest possible units of publication including: a subject, a predicate, an object and provenance. This project is to automatically generate large numbers of nanopublications by generating them from data and analysis which exists in the SUPERFAMILY database. No biological knowledge is necessary. This would suit somebody interested in web technologies or in the automatic generation of written material.
Artificial Life
Computer viruses are in almost every way indistinguishable from biological viruses. If you try to write down a strict definition of life, it is debateable whether or not viruses are actually alive; they are on the border of what is living and what is not. The aim of this project is to try to create more sophisticated forms of life in a digital environment (computer). To do this we begin by trying to come up with a definition of life which is strict enough to exclude viruses, then find a way to bring into existence within a digital environment some lifeform which satisfies the additional criteria which a computer virus falls short of.
This project should be considered experimental and is very open-ended. It would suit a student who is very intellectual and imaginative, and able to motivate themselves to tackle an abstract problem. Some very basic knowledge of evolution and DNA would be useful. In depth knowledge of a computer operating system, preferrably linux, would also help; the project may include setting up a cluster of networked machines as the environment.
Databases 1
This project is to normalise a large and complex database. There are two side to this: a plan must be made for the re-organising of tables in the relational database, but this must be done in conjunction with the software and interface which it interacts with.
This project requires good knowledge of SQL and will require modifying existing PERL code that serves CGI web pages and also a data analysis pipeline. This project is quite restricted in that the task is quite well-defined, however it should be sufficiently technically challenging.
Machine Learning 2
This project is to write a multi-class classifier for protein domain sub-families. The aim is to improve on the existing implementation described in [Gough 2006], whilst maintaining the efficiency of the software.
Hidden Markov models are used to assign domains to protein amino acid sequences (not part of this project). The HMMs detect the domain and classify it at the superfamily level, however many superfamilies are sub-divided into families. Given the superfamily, we would like to classify the domain into one of the known sub-families within the superfamily. As an input, the classifier will be given various scores between the unknown domain sequence, which is our query, and individual members of all the sub-families in a given superfamily. The task of the classifier which is to be written for this project is to combine the information in the scores in such a way as to optimise the classification on real data in a leave-one-out test.
Human genetic variation
Ths project is to do some analysis of early data coming out of the 1000 genomes project. Within our lifetime most people from rich countries will have their genome sequenced for medical reasons. Genome sequencing is advancing faster than Moore's law for computers and already we will soon have over 1000 human individuals sequenced.
This project is very much a bioinformatics research project. It is very investigative and will be challenging, suiting a student who wants a challenge and is interested in working in possibly one of the most important emerging areas of research this coming decade.
Cloud computing
This project is to set up and run hidden Markov model (HMM) software on a cloud computing system such as Amazon. This will involve learning to use the HMMER 3 software, getting it to work on a cloud, and setting up the SUPERFAMILY domain analysis pipeline. If this is successful it could lead to benchmarking and optimisation of HMMER3 and JackHMMER parameters.
WikiPedia page generation
This project is to generate content for a few thousand WikiPedia (and or ProtoPedia) pages about 3D protein structure families using data from this research group. The project will require working with SQL databases to extract data and parse into a format that can be used to generate pages. No knowledge of proteins/biology is required, but an interest in it would make the project more enjoyable.
There are several options to exptend the project such as web/text-mining for scientists to E-mail about individual pages which they are believed to be expert on, and thus invited to contribute to. Another option is to develop a system for vandalism-detection; this has been a problem in a similar previous project to put RFAM data into WikiPedia.
3D protein structure prediction
LiveBench and EVA are web tools both for assessing the performance of different protein structure prediction servers and for providing a collated 'meta'-service. The SUPERFAMILY database is one such structure prediction server. This project is to set up a dynamic web service based on SUPERFAMILY predictions to produce 3D models of proteins, which are connected with the two tools EVA and LiveBench. This project requires writing scripts that will interact with and control multiple tools and a database, and also communicate automatically with external servers. It may be required to produce output in 3D coordinates which may include interacting with 3rd party software for modelling.
This project would suit a student who is flexible and comfortable working with several different technologies at once.