Saturday, July 9, 2011

Let's consider the regions

In the last post I tried to make a start on ordering our data and representing it graphically. This helps define some jobs we have.

First, we need definitions of regions which are big enough to be statistically significant. Second, we need to define haplogroups in the smallest possible clades that are still clear and meaningful. In this post I will just discuss the first challenge.

The obvious aim of the project is to eventually have a big data set for every traditional county in Britain and Ireland. But at this time, at least just using our own data, for some counties we have very little.

Therefore, I have adapted some fairly standard ways of defining regions, joining together counties until our data is rich enough. Unfortunately, for now this mean Wales is one region, and Ireland and Scotland are defined reasonably broadly. It will be interesting, of course, to compare our results in these regions to those of Wales DNA Project, the Ireland Y DNA project, and the Scottish DNA project. Eventually maybe we can work with them and other projects to develop bigger and better descriptions of the genetic diversity of Britain and Ireland.

I've used the so called "NUTS 3" regions of the republic of Ireland,

...but the only one I did not merge with several others was the Border Region, which surrounds modern Northern Ireland, which is of course part of the United Kingdom. Also, I re-united all of Tipperary, which today is in North and South parts. Both parts are in my "Western Ireland".

For Scotland, the best data we have is for the counties south of the narrow part between the Firth of Forth (near Edinburgh) and the mouth of the River Clyde (near Glasgow), the area which includes the bulk of the modern population. I was able to split this into two near and traditionally meaningful parts. On the east is the area between Edinburgh and England which was settled early by Northumbrian Anglo Saxons. In modern parlance, this is called the region of Lothian and the borders. On the west, from Glasgow down to England, was a Welsh speaking kingdom. In modern terms this region is called Strathclyde (the northern part) and "Dumfries and Galloway".

For some modern Scottish region names see Wikipedia.

For the northern part of Scotland I could only get big data sets by uniting most of the highlands into one region. Perthshire, and the "lowland" shires to its south and east, I have been able to separate out.

For England I've used modern definitions of regions,
...except that I have taken the opportunity, given our data, of inventing one new region which I call "East and North of London". This includes the Thames Valley counties of the South East England Region, and the inland part of the East of England region. This neatly allows us to split out the coastal region of East Anglia, which is of course interesting for anyone interested in trying to find signs of ancient movements of people.

There is also a good East Anglian DNA project we can compare to. I invite help and comments concerning how this project's data compares to those of other projects.

For now I have not yet attempted to develop anything with the various remote islands. We have some data but not much yet. (The Isle of Wight is however part of the region south of London. It is very close to the mainland.)

So here is a map:-

Sunday, July 3, 2011

A first quick effort

Let's see if this works! Please click on the picture below, and start those comments rolling in. What should we do to improve this graph, and what does it mean?

Note: the haplogroup assignments are by FT DNA and include their predictions.



















The data here is just our project data, but:-

  1. I combined STR and SNP data (two files that FT DNA's controls create for admins) into one sheet, lining up the data.
  2. I organized the countries and counties and invented the regions, which means I also sometimes corrected what people had down as their COUNTRY of origin, because they are often wrong. A lot of people apparently don't know which country some counties are in, or else they were hedging bets. I have assumed their COUNTY information is correct, because that is the data we always push people to double check in our project.
  3. I ran my own haplogroup prediction using Whit Athey's tool, but I have not used that information much yet.
  4. I removed everyone without a pedigree to a county. Maybe that was the most important step!
  5. I created a frequency table using pivot table functions, which I have e-mailed already to both of you, and a graphic representation of that frequency table, which now appears on the blog. (Two work sheets in this spreadsheet.)
  6. I created a short version of the haplogroup names so that they all line up and look the same, not depending on the SNPs tested.
And here is the data:-



And here are the regions I have used, in order to get big enough data sets, of people with pedigrees back to old counties:-



Here are a few first remarks:-
  • G levels highest in Wales in the northern part of the Republic of Ireland. Remember that people are now saying this is a Neolithic farmer (pre Celtic) marker, based on the relatively large number of G men found in old archaeological sites.
  • I2a in interesting patches: western Ireland, most of Scotland, NE England, and the extremity of SW England, but apparently almost absent in many areas neighbouring on these, like SE Scotland, NW England, and the counties neighbouring the extremity of the SW of England.
  • I2b almost invisible in southern Ireland and Wales, but high frequencies in southern Scotland, northern Ireland and also common in most of England.
  • I1a pretty common everywhere except in western Ireland, but if it is Anglo Saxon you would expect it to be higher in SE Scotland?
But I have to say that I haplogroup prediction from FT DNA and also from Whit's predictor can probably be improved upon. I have contacted the obvious people: Jim Cullen and Ken Nordtvedt. I haplogroups perhaps deserve their own post in the near future.

What members should do

Remember:

Make sure that on your FT DNA personal page you have good up-dated information about the country of origin and county of origin of your male line ancestors.

I can not emphasize enough that missing information or inaccurate information in these areas makes it hard to achieve the core aims of the project.

We should actually do something

My thanks to Ken Nordtvedt for writing to ask for data from this project in a specific format he wanted. It is a useful push for action.

It has to be said that we are still in a position of having a big mess of data which is difficult to use. Ideas on how to improve this are welcome.

One big issue is that we have an enormous number of participants who have no known male line connection to the British Isles at all. Many even know they are from somewhere else. The sheer number of such members does make all jobs difficult in my opinion, although I understand that people want to have their Y DNA in the database just to feel a link to the project.

Keep in mind that there is also an even bigger amount of people who are members but only believe that their ancestors were from the British Isles, not which country, or perhaps they know which country, but no more. And I am sure all experienced genealogists agree with me that we can expect most of these people are reporting what are essentially guesses. (Many family stories are just the guesses of a previous generation.)

Anyway, enough complaining. One thing we have long aimed to do is to create haplogroup frequency data in a more user friendly format. I am going to get to work and at least do some preliminary work.

To start with I've just made an excel sheet where I've deleted all people who have not reported a clear county of origin in Britain and Ireland. That makes it much easier! Only about 1500 people!

So I would like to ask for opinions on how to divide up the populations in terms of haplogroups? Many participants have of course not been tested for any SNPs, while some have been tested for all the latest new ones.

I am supposing I'll need to run part of the data through a prediction program like Whit Athey's. Should I also ignore all people with less than a certain number of markers?

Of the approx 1500 people who know a county of origin in their male line, a bit more than half only have a predicted haplogroup the way they appear in the FT DNA data. About 740 have had real SNP tests.

Sunday, March 14, 2010

An snapshot of how Britain was divided up in 1801

A good website for trying to see how populations changed in Britain is the Histpop website. There is a lot of detail, down to parish level. Genealogists of course know about the 1841 census, and the ones every 10 years after that, but the censuses before that are very good if you just want to know numbers, not names. For Britain this started in 1801, for Ireland unfortunately in 1821, which would have been after some pretty big emigrations.

I made the following summary for Wales, Scotland and England. It will be interesting to keep track of this table, comparing it to our own numbers of participants for each county.

Saturday, March 6, 2010

Thoughts on how we can focus ourselves, without biasing the data we are collecting

Another online discussion I've had in my first active days on the project, was with Diana Matthiesen. She had a lot of useful points for us, and one thing she emphasized in several posts on the Rootsweb forum was the need to try to keep our data and our project focused and useful.

Many people running larger project advised that we should make the joining process one where permission needs to be granted first, so that we can keep an eye on the data meeting the criteria we have. We plan to do that now. But Diana went further and suggested that we should consider splitting the project up.

She has a point. This project is not only amongst the largest of its type, it is also one of the older ones. When it started there were not many options for people interested in being involved in something like this. These days they can join all kinds of projects which might be more useful for them. Remember we are not a surname project and not a haplogroup project. Many people have joined because of interest in these things.

The project grew quickly, and was split, but things got messy and it was re-merged. Membership criteria were understandably allowed to be loose, but I think it was not the intention that we would reach a situation where our database is so enormous that it is difficult to handle, and yet it is more than half made up of people with uncertain links back to the British Isles, and even a large number of lineages which think they do not have a connection.

But could we really split?

My first online response was a bit negative, but only because Diana's suggestion was to split up by haplogroups. I see two problems with this:-

1. Focused haplogroup studies already exist, and not just here and there, but in a big way.

2. Haplogroup projects are not all equally well publicized and well supported, which in practice leads to them all having very different attendance. We want to know which haplotypes were most common compared to each other, and we will never be able to learn this by for example putting together a composite of data from all the different haplogroup projects.

There are however actually ways we could split without loosing track of our aims:

1. The most obvious, which is now quite likely, is to split into Y DNA and mitochondrial DNA.

2. Another thing we could consider, is splitting into a project or projects for people with pedigrees back to a county, and people without. (This might raise the question of what the un-pedigreed project is for; or I would say, it would make the question more clear to everyone, because it is already a question within part the current project. More about that below.)

3. In the longer run, it is possible to consider splitting into regions or counties or any other smaller units. Of course there are already some projects for some regions, and some of these have pedigree conditions like we do, but we already intend to synchronize our work as much as possible of course.

But, I hear people asking, wouldn't my point about introducing bias then apply to the counties, just like it does to haplogroups, with some counties attracting more attendance?

Sure, but this is already a big problem, sort of. Keep in mind that the bulk of all volunteer genetic genealogists are from North America, Australasia and so on, and these regions did not get the same amount of immigration from every part of the British Isles to begin with.

The good news is that we can solve this bias problem, because there are decent estimates for the population of Britain and Ireland going back in time, and split into counties and even parishes. For example even though the censuses genealogists use for Britain start in 1841, simpler counting censuses started in 1801. I plan to keep a running track of our "bias" compared to such data, so it is always in our mind. If we know the West Riding of Yorkshire should be 3% of the total, then even if for us it is 5% we have the situation under reasonable control.

There is another question of course, and I am receiving mails about it. What are we going to do about all the people who have joined over the years, but have no pedigree back to the British Isles. This is not a super urgent problem, but it is bigger than you might think because the shear number of them means that all other types of work are put off or made difficult. It is also an on-going problem. I receiving questions all the time from people wanting to join and no longer clear what our criteria are.

I think our biggest concern is really just to make sure there is no misunderstanding. I think that in genetic genealogy participation has many types. We are a community in a sense, and in terms of learning from this project or helping it, no pedigree is necessary. But putting your data in the database without the required county information is not helping achieve any aims.

Steve Bird raised the concern in a nice clear way: he is quite confident of his Y-DNA's British Isles background. He just has problems pinning down the exact Birds and the exact counties. For example he has close matches who can trace back, but they are not Birds. He is interested as a researcher himself to make sure his genetic lineage is represented.

I suggested that in cases like this, where you want to make sure your DNA is represented so to speak, the obvious thing to do is to encourage your closest matches to join. After all we do not even really look at the surname. We are not a surname project. For Steve at least, this solution made sense.

...but in any case it is clear that right now we have a situation where for historical reasons there aims and joining criteria have become fuzzy and need to be clarified. We'll be clearing things up in a step by step way. We'll post updates here and there might also be a few mail outs for any key changes.

Tuesday, March 2, 2010

How is this project aiming to do something new?

First of all thank you to all the people who contributed to several useful discussions around the internet when I announced that I was joining the project's admin team and starting this blog. Some of the points raised touched upon issues I already hoped to discuss at some early point here.

There were two good forum discussions:-

1. At DNA Forums.

2. At the "GENEALOGY-DNA" Rootsweb List. You need to look at multiple threads in both their February and March archives, but most are in February.

Most of the discussion hovered around our own "elephant in the room", which is the historical accumulation of membership we have now who do not meet our aimed-for criteria of having a pedigree link back to a name-able COUNTY within Britain or Ireland. I suppose I'll keep discussing this in everything I say.

A very nice topic for a blog was given by this post, which asked "What information is gained that would not be gained thru ysearch?" Thanks warwick!

First we have to break this question up: are we asking what extra information will be gained by an individual who joins the project, or are we asking as a community what information is gained by the project as a whole, for the community as a whole.

For an individual, especially if your main interest is genealogy, I can not emphasize enough that although this project's existence can and will help you indirectly, the most important thing you need for genealogy is a surname project, or any similar tightly-focused project. Genetic genealogy revolves around those. Do not treat the British Isles Project as a replacement for that. We are working on the big picture, and background information which can support surname projects.

However, as part of a community of people wanting such support information, the project definitely does aim to provide something different to not only ysearch, but also smgf, and yhrd.

...And being different is the key. Collecting data for the British Isles is not the type of job where one can realistically aim to have the best database in every way, so that no-one need ever use another database. All such data-collecting is so fraught with problems that what we really want are several different data collections so that we can compare them, and then get a feeling for where likely biases might exist in one or all of them.

So what we aim to do differently, is as follows (but maybe people can add a few and we can come up with a standard list). The following applies to Y DNA for now...
  • We are collecting detailed SNP results, which SMGF can not easily give you.
  • A large percentage of our data will have more STR markers tested than by SMGF.
  • Our data (or right now I should say the hard core of our data) will also be high quality in terms of having pedigrees back to counties of origin.
  • Ysearch unfortunately contains many doubled-up or dummy haplotypes, as well as errors in terms of marker conversions or ancestor information, and should not be used in raw form.
(Using it in filtered form means a lot of work, and in a sense you could say we are trying to create a filtered and therefore inevitably smaller equivalent of ysearch.)

This means we are going to have results which reflect something closer to the British Isles before industrialization, and going back towards the relative stability of the Middle Ages. (See the first post.)