Saturday, March 6, 2010

Thoughts on how we can focus ourselves, without biasing the data we are collecting

Another online discussion I've had in my first active days on the project, was with Diana Matthiesen. She had a lot of useful points for us, and one thing she emphasized in several posts on the Rootsweb forum was the need to try to keep our data and our project focused and useful.

Many people running larger project advised that we should make the joining process one where permission needs to be granted first, so that we can keep an eye on the data meeting the criteria we have. We plan to do that now. But Diana went further and suggested that we should consider splitting the project up.

She has a point. This project is not only amongst the largest of its type, it is also one of the older ones. When it started there were not many options for people interested in being involved in something like this. These days they can join all kinds of projects which might be more useful for them. Remember we are not a surname project and not a haplogroup project. Many people have joined because of interest in these things.

The project grew quickly, and was split, but things got messy and it was re-merged. Membership criteria were understandably allowed to be loose, but I think it was not the intention that we would reach a situation where our database is so enormous that it is difficult to handle, and yet it is more than half made up of people with uncertain links back to the British Isles, and even a large number of lineages which think they do not have a connection.

But could we really split?

My first online response was a bit negative, but only because Diana's suggestion was to split up by haplogroups. I see two problems with this:-

1. Focused haplogroup studies already exist, and not just here and there, but in a big way.

2. Haplogroup projects are not all equally well publicized and well supported, which in practice leads to them all having very different attendance. We want to know which haplotypes were most common compared to each other, and we will never be able to learn this by for example putting together a composite of data from all the different haplogroup projects.

There are however actually ways we could split without loosing track of our aims:

1. The most obvious, which is now quite likely, is to split into Y DNA and mitochondrial DNA.

2. Another thing we could consider, is splitting into a project or projects for people with pedigrees back to a county, and people without. (This might raise the question of what the un-pedigreed project is for; or I would say, it would make the question more clear to everyone, because it is already a question within part the current project. More about that below.)

3. In the longer run, it is possible to consider splitting into regions or counties or any other smaller units. Of course there are already some projects for some regions, and some of these have pedigree conditions like we do, but we already intend to synchronize our work as much as possible of course.

But, I hear people asking, wouldn't my point about introducing bias then apply to the counties, just like it does to haplogroups, with some counties attracting more attendance?

Sure, but this is already a big problem, sort of. Keep in mind that the bulk of all volunteer genetic genealogists are from North America, Australasia and so on, and these regions did not get the same amount of immigration from every part of the British Isles to begin with.

The good news is that we can solve this bias problem, because there are decent estimates for the population of Britain and Ireland going back in time, and split into counties and even parishes. For example even though the censuses genealogists use for Britain start in 1841, simpler counting censuses started in 1801. I plan to keep a running track of our "bias" compared to such data, so it is always in our mind. If we know the West Riding of Yorkshire should be 3% of the total, then even if for us it is 5% we have the situation under reasonable control.

There is another question of course, and I am receiving mails about it. What are we going to do about all the people who have joined over the years, but have no pedigree back to the British Isles. This is not a super urgent problem, but it is bigger than you might think because the shear number of them means that all other types of work are put off or made difficult. It is also an on-going problem. I receiving questions all the time from people wanting to join and no longer clear what our criteria are.

I think our biggest concern is really just to make sure there is no misunderstanding. I think that in genetic genealogy participation has many types. We are a community in a sense, and in terms of learning from this project or helping it, no pedigree is necessary. But putting your data in the database without the required county information is not helping achieve any aims.

Steve Bird raised the concern in a nice clear way: he is quite confident of his Y-DNA's British Isles background. He just has problems pinning down the exact Birds and the exact counties. For example he has close matches who can trace back, but they are not Birds. He is interested as a researcher himself to make sure his genetic lineage is represented.

I suggested that in cases like this, where you want to make sure your DNA is represented so to speak, the obvious thing to do is to encourage your closest matches to join. After all we do not even really look at the surname. We are not a surname project. For Steve at least, this solution made sense.

...but in any case it is clear that right now we have a situation where for historical reasons there aims and joining criteria have become fuzzy and need to be clarified. We'll be clearing things up in a step by step way. We'll post updates here and there might also be a few mail outs for any key changes.


  1. I strongly agree with the objective of limiting this project to those with established pedigrees to the British Isles, however, that begs a familiar question: how do you ensure that the pedigree is real and not suspected or based on family rumor?

    When it comes to specifying counties, that may be possible with the male lines, but it is often the case that female surnames are not to be found on old documents, let alone their county of birth or residence.

  2. Good question about quality control. I think at the moment we are taking a simple attitude. Academics often assume people know where their grandparents were born. We assume genealogists can go a few further generations.

    Some will be wrong, but the hope is that the errors won't have a specific bias. A bias is imaginable however. With St Patricks day in town in the US it is a good reminder of how us colonials sometimes WANT to be from a specific place!

  3. Hi,

    My mtHaplogroup/subgroup has been established via FTDNA Lab to be U5b2; PhyloTree is U5b2b2. I have never been able to construct my tree to go back far enough to point to a locality in the British Isles, although I'm quite certain that it came from there in the early 1700's. After guessing this place and that, my latest hypothesis is Scotland. That is a first, although I earlier thought it was Ulster. A Scottish origin for my haplotype could have come via Ulster. On the other hand, the few low level matches I have claim England. If the father of my earliest known female ancestress was Robertson (current speculationn; no proof), then that points to Perthshire. And back then, people tended to marry with their own kind. If that was so, then my haplogroup/-type may be aboriginal Pictish. It may also have died out in Scotland and only survives in the USA. As for my own little twig, it is dead end.