Sons of the Indus: The Indians

Sep 26, 2021

The past decade has seen a revolution in archaeogenetics. Access to autosomal DNA from our pre-historic hominid forefathers has served as the final piece of the puzzle in many a questions about our own ethno-cultural genesis. This is partly due to how much cheaper it has become to sequence a genome due to advances in technology from the initial days of the HGP (from $100 million / per genome in 2001 to just under $1000 in 2016~) and partly due to improvements in gathering aDNA (ancient DNA) from samples. The literature on this has piled up since 2015- and is now vast. We must make use of it and understand the origins of the Indian people. This post will just serve to tabulate the recent discoveries and collate them. The title will be self-justified once the post is read in full.

It’s beyond the scope of this post to explain all of human prehistory here but we can note one pattern. The agricultural revolution (neolithic revolution) 11,000 ybp~ (years before 1950CE~) launched a wave of intrusion of migrants {who had developed farming, domesticated animals & started producing sedentary societies} into previously purely nomadic aboriginal hunter-gatherer (HG) societies who were descendants of the paleolithic homo sapiens of the region. This led to farmers eventually replacing the HGs or mixing with them to spread agriculture, techology language, culture and religion as part of the “Neolithic Package” – we see this happen in Europe with farmers from Asia Minor replacing the aboriginal HGs starting ~7000BCE & we see this all over Asia (Yellow River farmers expanding into Tibet or Yangtze River farmers expanding into SE-Asia, carrying the Austroasiatic languages to the region or even Japan with a three way admixture of Yayoi & Kofun period farmers with Jomon-like HGs).

India was no exception here. The aboriginal1 HGs of the country are now termed as the “AASI” (Ancient Ancestral South Indian) and are presumedly the descendants of paleolithic homo sapiens of the region. Their lineage is deeply diverged from other Eurasian lineages and is very closely related to the HG populations of SE-Asia (Hoabinhian HG - Vietnam), the aboriginals of Australia and the group that gave rise to the indigenous populations of the Andaman Islands (Onge, Jarawa). It’s important to note AASI is a “ghost population” in the sense its ancestry & cladal position is inferred and reconstructed but we don’t actually have any AASI whole genome sequences currently. As a result of this, the Onge are often used as proxy for Indian HGs or sometimes even the Paniya might fit better (A south indian tribal group with little west eurasian ancestry) as this paper suggests. Keep in mind, Indian HGs are not the same group that spawned the Onge or even SEA HGs but are closely related to them. So at best, we have imperfect proxies or reconstructions of supposed AASI ancestry. This will do for now. The harsh climate of India that degrades aDNA & the shoddy state of Indian archaeology makes it much harder; add in crematory practices of Hinduism and you can begin to see why this is a big problem for us. AASI ancestry is found in most Indian groups across a North-South cline and a caste-based cline that transcends and even breaks regional boundaries. So we have horizontal differentiation as well as vertical. Later on, we will see the same pattern when it comes to Steppe Pastoralist ancestry in India.

Fig 2. A putative facial reconstruction of an early Holocene hunter gatherer from Uttar Pradesh. The Sarai Nahar Rai man. This must have been what the AASI in that region looked like.

We now come to the second population of the pie. The farmers. Who were the carriers of the “Neolithic Package” to India? The earliest neolithic sites in the Indian subcontinent are in Mehrgarh (7000 BCE). It has for long been thought that Neolithic migrants from Iran brought winter rainfall crop farming (wheat, barley) to India and adapted it to the monsoon season locally before it spread across the country. This notion however has been refuted in recent genetic research that sequenced a female genome from Rakhigarhi (I6113) in India belonging to the mature Indus Valley Civilization (IVC) and found that it completely lacked any Iranian Zagros farmer ancestry. This means the Iran-HG like component of the IVC lineage diverged more than 10,000 ybp from Iranian-HGs and comes from a presently unsampled population who might have always been indigenous to the region. It also means farming was likely developed independently in India.

Fig 3. Chronological representation of Iran-like ancestry and its relation to the Indus-Valley Cline.

These early farmers in the NW of the subcontinent would mix with the AHGs (AASI) and form the cline along which the Harappan population was set. (65-90% Iran like HG + 10-35% AHG). This would have been our ancestry in the mature IVC period (2600-2000 BCE).

Fig 4. A reconstruction of a Harappan woman from the mature Indus Valley Civilization site at Rakhigarhi, Haryana. (4500 YBP). Most Indians receive nearly 45-55% of their ancestry from the Indus Valley people.

We now come to the third (but not the final) piece of the pie. The Steppe Pastoralist Ancestry. This ancestry is also pretty ubiquitious in all Indians but varies from as low as 3% to as high as 40% of total admixture in different groups. This ancestry likely entered the subcontinent around 2000-1800 BCE in the Middle-Late Bronze Age (MLBA) from the inner mountain corridor from Central Asia leading to multiple admixture events that culminated in the Steppe migrants mixing with the IVC locals to form the modern Indian cline (1200-1300 BCE). Steppe is a vague description, so we must be specific. Narasimhan et al 2019 uses a cluster that they label ‘Central_Steppe_MLBA’ to model the Steppe admixture in Indians, and this cluster contains individuals who harbor ancestry similar to the group of Indo-Aryans that must have come to India. This CentralSteppe group is different from the Andronovo or Sintashta. The admixture breakdown of this group is Yamnaya_EMBA (Early bronze age) ancestry at around 65-70%, European_EarlyFarmer (25-30%); this is further broken down into Eastern HG (Baltic) + Anatolian farmer, and the final component is West Siberian Hunter Gatherer ancestry (WSHG) at around 5-8%. Let us pause and be very clear here. The Steppe component actually introduces 4 different basal lineages into the Indian subcontinent. This can be rationalized by the fact that a group of Yamnaya steppe herders entered Europe, mixed with the European aboriginals (European_EarlyFarmer) and then moved to Central Asia. While on their way to India, they mixed more with Siberian HGs probably in the Pamir corridor to take on an additional lineage.

Fig. 5. A reconstruction of a Yamnaya man from Ishkinovka in the Southern Urals. (5300-4700 YBP). Most upper-caste Indians get anywhere from 10-30% of their ancestry from the Yamnaya people. YBP stands for ‘Years Before Present’, i.e - before 1950 CE

Fig. 6. A reconstruction of a Sintashta man from the Bulanovo Cemetery in the Urals (4400-3700 YBP). Most upper-caste Indians get anywhere from 20-40% of their ancestry from Sintashta-like people.

All credit must go to this group for the reconstructions posted here. On twitter, they can be followed here.2

When they came to India, they mixed with the Harappans to give us a total of 6 different lineages that formed the post-Harappan population.

Is this all? Not so. We must look at two additional lineages that are present in many Indians today. The first one is chiefly found in some Northern (and NE) Indians. This is Tibetan Neolithic ancestry (Chokhopani late neolithic, 2700 YBP) that likely derives from archaic admixture with the Sino-Tibetan tribesmen of Ladakh & Nepal. This is a minor lineage in most of India and barely even detectable in most populations. Notable exceptions are Bengalis, Assamese, Nepalese and some Kashmiris. The second one is Austroasiatic ancestry that is only found in tribal groups such as the Munda and Ho in eastern India. This is because these groups are descendants of Austroasiatic migrants to India from around 3000-3500 YBP.

Fig 7. A picture of Tibetan people. They speak Sino-Tibetan languages and most Eastern Indians get anywhere from 0-10% of their ancestry from Neolithic Tibet. Trace amounts are present in populations like Kashmiris as well. *

* The exception here are people from Nagaland, Manipur, Arunachal Pradesh and the rest of North-Eastern India (barring Assam). These NE Indians are almost entirely Tibeto-Burman in ancestry, and as expected, they speak Sino-Tibetan languages.

We now have a good picture of the ancestral lineages that make up the Indian subcontinent. What are the proportions of these lineages in modern Indians like? Let’s look at the data provided by Vagheesh Narasimhan in his most recent breakthrough paper which you can view here.

1) North West India + Pakistan

This data is directly from Narasimhan’s paper. It gives us a k = 3 admixture chart. As we can see, the Northernmost populations have around 55-65% Indus_Valley ancestry, 20-30% Steppe_MLBA ancestry, 10-20% AASI ancestry. The generic Punjabi sample has around 33% AASI ancestry and has been collected from non-Jatt, non-Khatri migrants in the UK. This might be reflective of the backward caste admixture of Punjab. (It actually matches perfectly the Punjabi_Lahore samples). Amongst the other NW Indian groups in this dataset below, the Khatri have the higher Steppe admixture at around 27% and the lowest AASI admixture at 13.8%. However, if we include the neighboring Pakistani populations, then the Kalash have the highest Steppe admixture in NWIP and the lowest AASI (but only by a marginal difference, 1-3% at most).

This dataset and model however does not look at a few things, and there are some caveats here IMO.

1) Tibetan ancestry which I thought important to check in the Northern populations.

2) Narasimhan’s model also misses out on some very important North Western groups such as the Ror of Harayana, the Kohistani, Kamboj and the Tibetans of Leh-Ladakh.

3) Possible BMAC ancestry in NW Indian groups. Now, I am aware that Narasimhan et al says it rules out BMAC ancestry in subcontinentals but the issue is that the NW populations do not give passing qpAdm models in his own paper, as he mentions in the supplementary material. So it maybe possible that the 3 way model works for 90% of Indian ethnic groups but fails for NWI&P groups, some of whom who may need BMAC ancestry to be modelled accurately. This would make some sense as well, considering the geography. See the attached picture below. Surely, Iranics such as Pathans cannot be modelled without BMAC ancestry? (found in all Iranians)

Hence, we will look at a different G25 model (my own) this time and see the difference.

On first glance, it seems as if the model works quite fine. You may ask how most of these populations have 0% Onge but showed 5-10% Onge in Narasimhan’s model. The answer for it is simple. Narasimhan’s model above used a single sample for the Indus ancestry called I8726. This sample was the most West-Eurasian shifted sample and had 92% Iran-like N/HG ancestry and only 8% Onge like ancestry. We are using all the IVC samples we have (Sis2 + Gonur, total of 11) which lie on a cline that is on average around 60% Indus/Iran like farmer + 30% Onge + 5% Tyumen (WSHG) + 5% Anatolian Farmer (debatable).

Let us try another model for the groups that give problematic fits (Kalash, Yusufzai, Uthmankhel, Tarkalani: these three are all Pakistani Pashtuns by the way), this time we will add in Gonur_BA1 (BMAC/Turan BA). These values make a lot more sense for Iranic groups such as Pashtuns, but I am not convinced about the Kalash.

Does an Indo-Aryan group, even an isolate one, truly have 31% BMAC ancestry? What if I model them with the I8726 sample that peaks in Indus farmer like ancestry which Narasimhan used? What would the result be? Voila, it seems as if this gives us the best possible fit. I think this tells us that the Kalash are products of a group of Indus farmers that initially had very low AASI ancestry. This group with a Steppe source that perhaps carried BMAC ancestry (unfortunately we have no such samples to test the theory) or perhaps the BMAC was a later intrusion. Whatever be the case, the Kalash are best modeled with I8726.

With this we come to a few conclusions about the genetics of North West Indians & Pakistanis.

On average they are 60-65% Indus and 25-30% Steppe with trace amounts of Onge or Tibetan ancestry
The ethnically Tibetan Balti people of Gilgit-Balitstan and Ladakh have 26% Tibetan ancestry
The Rors of Haryana have around 40% Steppe ancestry, the highest in South-Central Asia barring Yaghnobi Tajiks (who are 45% Steppe).
Pakistanis from middle-lower castes have around 15% higher Onge ancestry than the neighboring Pakistani Jats, Gujjars, Pashtuns etc.
Kashmiris, Pashtuns, Kohistanis and Kalash all have some Tibetan ancestry, 4-5% at the most. This makes sense considering close contact of Dardic Indo-Aryans with aboriginals like the Kiratas and in Pashtuns it might be East Eurasian admixture from Mongols or Turkics.
Pakistani Pashtuns cannot be modeled without BMAC ancestry, this is probably even more true of Afghani Pashtuns who would require Chalcolithic Iranian admixture likely.

The Baltis are a Tibetan group, this explains their elevated Tibetan ancestry. The Ror likely are just an isolate group of Indo-Aryan pastoralists that maintained endogamy and hence have higher Steppe admixture. I had at one point considered the fact that they might be mixed with Indo-Kushans, but think it’s not very likely now. The Punjabi from Lahore likely represent the average admixture of the lower castes (non Jat, non Ashraf) of the region.

2) Northern India + Central India + West India

While there are too many populations from this region to get into it, hopefully this provides a good overview of the bulk of Indo-Aryan speaking ethnic groups of India. One can notice how savarṇa castes (first 4 castes) generally have much lower AASI ancestry than the scheduled castes (dalits) of the region. The Lohana (an upper caste community) have the lowest AASI in the region at 8.9% and a solid 25% Steppe_Herder ancestry. Gujarati Brahmins are similar to the Lohana in this regard. The Bhumihar Brahmins of Bihar have the highest steppe ancestry in the region at 28.3%.

3) Southern India + Deccan

In Southern India, we start noticing that Steppe ancestry collapses dramatically across every caste group and at the same time AASI rises up fast, reaching as high as 65% in the Palliyar of Tamil Nadu. The Brahmins of South India are left with similar but slightly lower levels of Steppe ancestry than North Indian Brahmins but they remain the only group in the region with 15-20% of Steppe ancestry. One notable exception are the Coorghi or the Kodavas, who have relatively low AASI ancestry at 25.8% coupled with pretty high IVC ancestry at 64.4%. The Kodavas have been speculated to be “Scythians” in the colonial era due to their robust Caucasoid phenotype and martial tradition in the Kannada country. However, the likely answer is the Kodava must be direct descendants of the late Indus migrants from Gujarat who immigrated into the Deccan & Karnataka.

I took the liberty to model the Iyer Brahmins of Tamil Nadu as Narasimhan’s paper did not deal with them. For some reason, Tamil Brahmins have a pretty poor fit on a 3 way model of Indus, Onge, Steppe. When one uses an Onge rich South Indian caste like the Paniya as proxy for their AASI ancestry, then the fit is much much better.

These are the results I got.

The highest of the Iyer/TamBrahm samples seemed to have around 17% Steppe ancestry while the lowest had 10% Steppe ancestry. Do keep in mind that Paniya is 60% Onge, 40% IVC in itself, but IVC is 35% Onge as well. In total, Tamil Brahmins would be 15% Steppe, 32% Onge and 53% Indus farmer/Iran Ganj Dareh.

4) Eastern India + Nepal

Contrary to popular stereotypes, Nepalese Brahmins (A, A2) generally have fairly high Steppe ancestry (20-25%) and pretty low Tibeto-Burman ancestry (6-8%). Nepali Bahuns (B, C) have higher Tibeto-Burman ancestry but can be a mix (see the Ba48 sample that is more like Nepali Brahmins) and non-UC Nepalis (E, F, Newar, Kusunda, Gurung, Magar, Sherpa, Tharu, Tamang) are pretty much mostly Tibeto-Burmans who originally spoke Sino-Tibetan languages.

Bengali Brahmins have slightly lower Steppe ancestry but much higher AASI ancestry with a slight 2% Tibetan component (as expected). Bangladeshi Bengalis have very high AASI ancestry at nearly 60% and a higher 6% Tibeto-Burman component. The Brahmins and non-Brahmin Bengalis are very far apart. Unfortunately, no Kayastha or Vaidya samples are available to us yet so we cannot test for all Bhadralok. Manipuri Brahmins are big outliers with 43% Tibeto-Burman and 11% South Chinese Farmer ancestry.

5) Eastern Indian Tribals

Juang, Asur, Bhumji, Gond and Santhal are chiefly Austroasiatic speaking tribal groups. They had pretty poor fits with Onge but could be modelled successfully as a mixture of Khmers (an Austroasiatic people from Vietnam) + Paniyas (an ASI like population with no Steppe). The Gond are Dravidian speakers but it’s been long suspected they originally spoke an Austroasiatic language, and this is shown in the DNA.

I feel we have now achieved our purpose and understood the origins of Indian ancestral lineages as well as their relative proportions in different ethnocultural groups across the country. The title of the post should make a lot of sense now. Indus ancestry is by far the most prevalent ancestry in the entire Indian subcontinent and generally makes up anywhere from 1/3 to 2/3 of our genome. It is normal for some Indian groups to have as much as 70% Indus ancestry and even at the lowest we have at least 35% Indus ancestry everywhere except the extremes of Sino-Tibetan India or in the Austroasiatic tribal groups.

Keeping this in mind, the name of our Republic is fitting.

Shiṣṭagoṣṭhī | Collegium Eruditi

Discussion about this post