Abstract

Between 1914 and 1919 hundreds of thousands of New Zealand men (and a small number of women) were enlisted to serve in the New Zealand Expeditionary Force to aid the allied forces in their fight against Germany. By analysing biographical information available from the Auckland War Memorial Museum Cenotaph database I will find out about those who served, especially the 18,000 casualties, including where, how, and when did they die, what patterns can be found in their deaths and what generalities can be made about who was more likely to live or die through the war.

Introduction

The First World War was one of the major events of the twentieth century. More than 120,000 New Zealanders enlisted and 18,000 died. My grandfather and his two brothers served in the NZEF and left behind a stack of letters written home during the war. These letters led to my interest in the experiences of those who served. My grandfather was wounded in the chest by shell fragments three days after entering the trenches rendering him unfit for further service (and perhaps saving him from a grim death in New Zealand’s deadliest battle). He lived to marry the widow of his brother, who had survived months as a water carrier near the front lines, only to return to New Zealand and die of influenza 5 years after the war ended. Their younger brother was a brilliant scholar whose recurring illnesses kept him away from the font lines leading to a job offer as an Army instructor, then a scholarship at Cambridge university. Through sheer good luck these three brothers lived through the war, but two of their cousins were not so lucky. One died of disease contracted at the Trentham training camp and was buried before leaving New Zealand. His younger brother was caught up in the battle of Passchendale and died from the effects of gas poisoning. Learning about the diverse stories of these men showed me that there was no “typical” war story. The reasons men went to war, the places they served and the experiences they had that led to their survival – or their death – was as varied as the men themselves.

The Data Source

The Auckland War Memorial Museum Cenotaph Database pulls together information about New Zealand’s service personnel from many different sources. The site was originally conceived to memorialise those who died at war, but has since expanded to include all service personnel who have since died. It purports to include “almost all of those who served in WWI”. The data is searchable through a front-end on the museum website (http://www.aucklandmuseum.com/war-memorial/online-cenotaph/search) where it is assembled into an individual web page for each person who served. It is also available in raw format through an API (http://api.aucklandmuseum.com/).

Workflow

Accessing the Data

The cenotaph data is accessible through an API using either a simple string search or a SPARQL query. As I am unfamiliar with the SPARQL query language (and since queries of this type have greater restrictions on use) I decided to use a simple search for the string “World War I, 1914-1918”. In theory this should be the value of the “war” field for all of those who served in World War One. In fact the search returned 96,203 results – short of the 103,188 results obtained using the web site’s front-end. I’m not sure why there is a discrepancy in this number (if anything I would expect a general string search to return more results than a search on a single field). The front-end doesn’t offer the full results of a search in a downloadable format, so I had to make do with the 96,203 results from the API. These were returned in JSON format. I fetched them in page sizes of 1000 and saved them as local JSON files (the code for this is contained in the notebook “Download JSON”).

Converting to CSV

The JSON structure returned was hierarchical, so the next step was to flatten these files so they could be loaded into a dataframe (this was carried out in the notebook “Import from JSON”). I used a convenient library called FlattenJSON to convert the lists inside the JSON file into individual fields. This created an extremely unwieldy number of columns (one for every potential value of a list), so at this stage I also filtered the number of columns down to a subset of that looked like they could be useful or interesting. I then saved these flattened tables as individual CSV files.

Data wrangling and cleaning

This was carried out and documented in the notebook “Data Wrangling”. I first assembled the individual CSV files into one dataframe, then went though each set of fields and determined the volume and quality of data, whether any cleaning or converting needed to take place and whether or not the data would potentially be useful for analysis. A number of field sets that had multiple values were separated out into “normalized” dataframes, reducing the number of redundant columns (which were often mostly filled with NaN values). This drastically reduced the overall memory requirement. This set of dataframes was then re-saved as CSV files, ready for analysis in this notebook.

Quality of the Data and Assumptions

Considering the dataset has been assembled from many varied sources, it is inevitable that the quality is extremely variable. Transcription errors are evident throughout the data and I have picked these up as I have gone as best as I can.

One big question I had about the data is why a reasonable chunk of records is completely missing. Out of around 120,000 who enlisted, only 73,000 or so records remain after the data wrangling/cleaning phase. While this is still more than enough to proceed with analysis, there is a question of which records are missing and why. If they represent a particular type of record this could easily invalidate some of the conclusions. However since I have no knowledge of why these records are missing I will proceed with the assumption that they are randomly distributed through the data. This assumption applies to any field I have analysed where there is a large amount of missing data. Often it is unclear whether missing data implies there is semantically no value for that field or whether the value is unknown.

The biggest disappointment is in the date of birth field, which has values for only 5,700 or so records. The age of soldiers would seem to be a defining characteristic of their experience at war and it is a pity that this is accessible for such a small proportion of the total.

The most complete part of the data are the death fields. Out of 18,000 who died because of the war 15,000 of these deaths are recorded in the data, with well used fields for place, age and cause of death. For this reason I chose to mostly focus on this aspect.

In [1]:
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pylab import rcParams
In [2]:
# Set some Pandas options as you like
pd.set_option('html', True)
pd.set_option('max_columns', 40)
pd.set_option('max_rows', 300)

rcParams['figure.figsize'] = (12, 10)
rcParams['font.size'] = 20
rcParams['axes.facecolor'] = 'white'

#this line enables the plots to be embedded into the notebook
%matplotlib inline
matplotlib.style.use('ggplot')

Main Analysis

In [3]:
df = pd.read_csv('datasets/ww1.csv', parse_dates=['enlistment_ww1_dateOfEnlistment','dateOfDeath','birthDate'], index_col=0, low_memory=False)

Analysing ages

What is the distribution of the ages of the soldiers who died?

In [4]:
df[df['diedDuringWar']]['ageAtDeath'].describe()
Out[4]:
count    9841.000000
mean       27.427802
std         6.229984
min        16.000000
25%        23.000000
50%        26.000000
75%        31.000000
max        64.000000
Name: ageAtDeath, dtype: float64
In [5]:
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(df[df['diedDuringWar']]['ageAtDeath'], orient="h")
_ = plt.xlim(10,70)
_ = plt.title("Distribution of age at death", fontsize=20)
In [6]:
_ = df[df['diedDuringWar']]['ageAtDeath'].hist(bins=46, facecolor="gray", figsize=(12,5))
_ = plt.xlim(15,64)
_ = plt.xlabel("Age at Death")
_ = plt.ylabel("Count")
_ = plt.title("Histogram of age at death")

The age at death data is right skewed with a median of 26 and a mean of 27.4. Most of the the soldiers who died during the war were young men, the majority under thirty. A handful of outliers were older than their mid-forties.

Why were the men who died mostly young? It could be the case than younger are more reckless than older men – ready to face the action head-on. Older men could be more cautious and concerned with self-preservation.

However these ages may just reflect the ages of the soldiers as a whole. It makes sense that the majority of men who volunteered were quite young as they would have had more of a thirst for adventure and less responsibilities to tie them to home. Although conscription applied to men aged between 20 and 45, single men were the first to be conscripted, married men were not called up until October 1917 and men with children not until April 1818. (https://nzhistory.govt.nz/war/recruiting-and-conscription)

The oldest person to die in the war was in his sixties. Who was he and how did he die?

In [7]:
df[df['ageAtDeath']>60][['firstName','familyName','serviceNumber_0','causeOfDeath','placeOfDeath']]
Out[7]:
firstName familyName serviceNumber_0 causeOfDeath placeOfDeath
35329 Ferdinand Campion Batchelor 3/313 Died of Disease Dunedin/Otago/New Zealand

Ferdinand Campion Batchelor was a respected medical practitioner who served in Egypt and died on his return to New Zealand.

https://www.teara.govt.nz/en/1966/batchelor-ferdinand-campion

Is the distribution of the ages of men who died different from those who didn’t?

Since the date of death field can’t be compared between men who died and didn’t die during the war (obviously those who didn’t die during the war have no value for this field) I’ll instead compare the enlistment ages between the two groups.

Unfortunately there is much less data of ages of the men who didn’t die during the war so I’d first like make sure there is enough data for both groups to compare.

In [8]:
df[['diedDuringWar','enlistment_ww1_ageAtEnlistment']].groupby(['diedDuringWar']).count()
Out[8]:
enlistment_ww1_ageAtEnlistment
diedDuringWar
False 2204
True 1145

This is a small proportion of the total, but enough to be compare the two groups. I am assuming that records with this field are randomly distributed through the total population and not representative of a particular subset (though I am aware this assumption could be incorrect).

In [9]:
df[['enlistment_ww1_ageAtEnlistment','diedDuringWar']].groupby('diedDuringWar').describe()
Out[9]:
enlistment_ww1_ageAtEnlistment
diedDuringWar
False count 2204.000000
mean 27.829401
std 7.303854
min 15.000000
25% 22.000000
50% 26.000000
75% 32.000000
max 60.000000
True count 1145.000000
mean 26.421834
std 6.304570
min 14.000000
25% 21.000000
50% 25.000000
75% 30.000000
max 55.000000

The summary statistics seem to indicate that men who died were indeed younger than those didn’t have a lower mean, median and quartile values. A boxplot and histogram will give a more visual representation of this information.

In [10]:
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(x="enlistment_ww1_ageAtEnlistment", y="diedDuringWar", data=df, orient='h')
_ = plt.ylabel('Died During War')
_ = plt.xlabel('Age at Enlistment')
_ = plt.title("Boxplots of ages of those who died during war and those who survived", fontsize=15)
In [11]:
g = sns.FacetGrid(df, col="diedDuringWar", size=6, sharey=False)
_ = g.map(plt.hist, "enlistment_ww1_ageAtEnlistment", bins=20)

These plots show that in general men who died during the war were younger when they enlisted than those who lived through the war. The distribution of the ages of men who lived through the war has a longer tail and more outliers implying that older men were more likely to live through the war.

Is this because younger men are more reckless and have less regard for their own lives? Before coming to that conclusion, it is worth considering this is more depth. There could be other influencing factors on age at enlistment that lead to a higher death rate.

Age and Rank

A third variable that might have an effect on both age and death rate is rank. Presumably older men would hold higher ranks, and higher ranked men may be less likely to die (higher ranked officers may be less likely to be directly involved in battle).

To compare rank I will use the embarkation data, as this gives the most complete information on rank. I’ll compare the ages using the age at embarkation field. Each soldier may have multiple embarkations as some would have returned to New Zealand for a leave then embarked a second time (or more). I’ll look at the age of the soldier’s first embarkation, which generally occurred several months after their enlistment.

Not every soldier who died has embarkation data – some died during their training in New Zealand. We need to be aware that by using the embarkation fields we are actually excluding those who died before they left New Zealand.

The embarkations have already been ordered by date, so by removing the duplicates of the index (which is an index into the dataframe so represents individual records) we are left with each person’s first embarkation.

In [12]:
embarkations_df = pd.read_csv('datasets/ww1_embarkations.csv', index_col=0, parse_dates =['embarkation_embarkationDate'])

first_embarkations = embarkations_df.drop_duplicates(subset=['index'])

Looking at the rank field – how many separate ranks are there?

In [13]:
len(first_embarkations['embarkation_rankOnEmbarkation'].drop_duplicates())
Out[13]:
101

The ten most common ranks:

In [14]:
first_embarkations['embarkation_rankOnEmbarkation'].value_counts().head(10)
Out[14]:
Private              33150
Rifleman              8176
Trooper               7843
Corporal              3158
Lance Corporal        2962
Gunner                2653
Sergeant              2620
Sapper                1805
Driver                1285
Second Lieutenant      767
Name: embarkation_rankOnEmbarkation, dtype: int64

The rank entries look fairly consistent. There are 101 different ranks beings used but these are mostly related to different rank names in the different divisions. Since I want to be able to compare the ranks by their level I’ll only use New Zealand Army ranks (these are by far the most common), and assign each a rank order, grouping the four bottom ranks together then using ascending numbers for ascending ranks.

In [15]:
rank = ['Gunner','Trooper','Sapper','Private','Lance Corporal','Corporal','Sergeant','Second Lieutenant','Lieutenant',
       'Captain','Major','Lieutenant Colonel','Colonel','Major General','Lieutenant General']

rank_levels = pd.DataFrame({'rank': rank, 'rank_order': [1,1,1,1,2,3,4,5,6,7,8,9,10,11,12]})
rank_levels
Out[15]:
rank rank_order
0 Gunner 1
1 Trooper 1
2 Sapper 1
3 Private 1
4 Lance Corporal 2
5 Corporal 3
6 Sergeant 4
7 Second Lieutenant 5
8 Lieutenant 6
9 Captain 7
10 Major 8
11 Lieutenant Colonel 9
12 Colonel 10
13 Major General 11
14 Lieutenant General 12

I’ll use an inner join to merge the first_embarkations with rank_levels – this will add rank order to first embarkations data frame and remove the extraneous ranks.

In [16]:
embarkations_mainranks = first_embarkations.merge(rank_levels, left_on='embarkation_rankOnEmbarkation', right_on='rank')

Is there a difference in age at embarkation between the different ranks?

In [17]:
sns.color_palette("deep")
sns.set(style="darkgrid")

_ = sns.boxplot(x="embarkation_ageAtEmbarkation", y="rank_order", data=embarkations_mainranks, orient="h", palette="deep")
_ = plt.title("Boxplots of embarkation age by rank order", fontsize=20)

From rank 1 up to rank 6 the ages are very similar, though there is a general trend of the minimum ages increasing and a slight increase in the medians. Rank 7 though 10 show a definite increase in age with the rank level with an increase in spread. The middle halves of the data significantly increase with each higher rank.

There is an unfortunate lack of data for the very highest ranks (only 1 data point for rank 11 – Major General, and none for 12 – Lieutenant General). My conclusion would be that age does generally increase with increase in rank – not as much in the lower ranks, but significantly from Captain upwards.

Is there a difference in age at death between the ranks?

In [18]:
merged_embarkations = pd.merge(df, embarkations_mainranks, left_index=True, right_on='index', how="left")
In [19]:
sns.color_palette("deep")
sns.set(style="darkgrid")

_ = sns.boxplot(x="ageAtDeath", y="rank_order", data=merged_embarkations[merged_embarkations['diedDuringWar']], orient="h", palette="deep")
_ = plt.title("Boxplots of death age by rank order", fontsize=20)

This does in fact show a very similar pattern of increasing age with rank (excepting the first three ranks). This lends evidence to the theory that the rank of soldiers may be an important influence in the overall trend of age of death.

Does the death rate vary between the ranks?

A stereotype of war is of highly ranked men sending the lower ranked recruits into the fray while they wait the battle out in safety. If this is true we should see a reduction in death rate for higher ranked soldiers.

I’ll add another field called simply “died”, which is an integer casted from the diedDuringWar boolean field. This is an easy way of calculating death rates (the mean value is equivalent to death rate).

In [20]:
df['died'] = df['diedDuringWar'].astype(int)
merged_embarkations['died'] = merged_embarkations['diedDuringWar'].astype(int)

First I’ll examine the total numbers embarked of each rank compared to the numbers who died.

In [21]:
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = plt.title("Counts of those served and died by rank", fontsize=20)
_ = sns.countplot(y="rank", data=merged_embarkations, order=rank, color="Gray")
_ = sns.countplot(y="rank", data=merged_embarkations[merged_embarkations['diedDuringWar']], order=rank, color="Black")

It is interesting to visualize the distributions of the different ranks, but because of the wildly different total counts this plot isn’t very useful in determining whether the death rates vary between the ranks.

Next I’ll plot death rates for each rank by averaging the “died” field.

In [22]:
plt.figure(figsize=(12,8))
plt.title("Death rates by rank", fontsize=20)
_ = sns.barplot(x="rank_order", y="died", data=merged_embarkations, color="gray")

There is no obvious trend in this plot. The death rates are very similar up to rank 4 (Sergeant). Surprisingly rank 5 and 6 (Second Lieutenant and Lieutenant) have higher death rates, then there is a marked decrease for rank 7 (Captain). Rank 8, 9 and 10 (Major, Lieutenant Colonel, Colonel) have very similar death rates though slightly lower than ranks 1-4.

It is worth noting that ranks up to 4 (Sergeant) are non-commissioned ranks, while 5 (Second Lieutenant) and above are commissioned officers. My understanding is that commissioned officers were professionals recruited into the Army while the non-commissioned officers were mostly volunteers or conscripted men who had been promoted to a higher rank through good conduct. This may explain why ranks 1 through 4 show a different pattern to 5 and above.

Second Lieutenant and Lieutenant (rank 5 and 6) are the lowest of the commissioned ranks – these men may have been responsible for leading companies into battle. It is interesting that their death rates are markedly higher than even the lowest ranked soldiers. This could indicate that these men took their leadership roles seriously and were willing to risk themselves for the men they were leading.

The higher ranked men probably had less of a hands-on role in battle, but these rates show that they were by no means completely removed from the dangers of war. These men probably would have also served for much longer than the volunteers of ranks 1-4 (many of who joined up mid-way through the war), so may have been more likely to be involved in a higher number of deadly battles and hence have more chance of being killed.

Is there any interaction between age, rank and probability of dying during the war?

First I’ll look at the three age fields together – age at enlistment, embarkation and death, dropping the records that don’t have both an age of enlistment and embarkation.

In [23]:
rank_ages = pd.pivot_table(merged_embarkations.dropna(subset=['enlistment_ww1_ageAtEnlistment',
                                                                    'embarkation_ageAtEmbarkation'
                                                                    ])
                           ,index='rank_order', values=['enlistment_ww1_ageAtEnlistment',
                                                                    'embarkation_ageAtEmbarkation',
                                                                    'ageAtDeath'], 
                           aggfunc=np.median, columns="diedDuringWar")
rank_ages
Out[23]:
enlistment_ww1_ageAtEnlistment embarkation_ageAtEmbarkation ageAtDeath
diedDuringWar False True False True False True
rank_order
1 24.0 23.0 24 23.5 NaN 25.0
2 22.0 22.5 22 23.5 NaN 24.0
3 23.5 22.0 24 23.0 NaN 24.0
4 24.0 23.5 24 24.0 NaN 27.0
5 25.0 28.0 26 28.0 NaN 28.5
6 24.0 26.0 24 26.0 NaN 27.0
7 31.0 32.5 31 32.5 NaN 35.0
8 38.5 38.0 40 38.0 NaN 38.5
9 49.0 55.0 49 55.0 NaN 56.0
10 57.5 NaN 58 NaN NaN NaN
In [24]:
rcParams['figure.figsize'] = 12,8
_ = rank_ages.plot(kind="line", color=['green','green','gray','gray','black','black'], style=['-','--','-','--','-','--'])

_ = plt.ylabel("Median Age")
_ = plt.xlabel("Rank Level")
_ = plt.legend(loc=2,title="Age at..., Died During War")

This clearly shows a pattern of increasing age to rank. The ages are fairly even up to rank 4 (Lance Corporal), then increase at rank 5 (Second Lieutenant) and decrease for rank 6 (Lieutenant). This decrease is unexpected and may indicate that there is something unusual about this rank that I don’t understand (one possibility is that outstanding leaders could have been promoted straight to this rank, short-cutting the lower ranks). As expected, the ages then climb steeply as the rank gets higher.

The dotted lines on this graph indicate the median age of men who died during the war. These are generally higher (with the exception of rank level 3 – Corporal) than for the men who didn’t die. This is a strong indication that rank does have an influence on the ages of men who died.

Looking at just the age at embarkation in more detail:

In [25]:
plt.figure(figsize=(12,8))
_ = sns.pointplot(x="rank_order", y="embarkation_ageAtEmbarkation", hue="diedDuringWar", 
              data=merged_embarkations.dropna(subset=['embarkation_ageAtEmbarkation']), linestyles=['-','--'], ci=None)

This pattern is slightly different to the plot above, probably reflecting the increase in amount of data (this data includes records that don’t have an enlistment age).

The main trend for men who died during the war is that they were generally older the higher their rank. However in ranks 5 and 6 (Second Lieutenant and Lieutenant) the men who lived were actually younger than those who died. These are the lowest of the commissioned ranks so many of these younger men may have been very inexperienced and given less responsibility, which kept them out of the main action.

We have seen that rank is strong influence on age and that men in higher ranks were slightly less likely to die. To answer the question of whether younger men were more likely to die because they were more reckless, my overall judgment is that that rank does explain some, but not all of the effect of age on death rate. It might be somewhat true that younger men were more reckless, but it is also somewhat true that they were generally lower ranked and hence more likely to be involved in conflict.

Date of embarkation and survival rate

What affect did the date of embarkation have on chances of surviving the war?

Another reason for the difference between the ages of those who died and didn’t die might be that the the men who enlisted earlier in the war were more likely to have died (simply because they were at war longer). I can test this by looking at boxplots of the years that the men were born (rather than their age).

In [26]:
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(x=df['birthDate'].dt.year, y=df["diedDuringWar"], orient='h')
_ = plt.ylabel('Died During War')
_ = plt.xlabel('Year of Birth')
_ = plt.xlim(1840, 1905)
_ = plt.title("Year of birth of those who survived the war and those who died", fontsize=15)

These distributions are almost identical apart from a few outliers (representing much older men who didn’t die).

We can get an idea of the affect of enlisting early on the death rate by comparing the distributions of numbers of embarkations by date for the men who died and didn’t die during the war.

In [27]:
embarkation_dates = merged_embarkations[merged_embarkations['embarkation_embarkationDate'].dt.year >= 1914]['embarkation_embarkationDate'].dropna()
merged_embarkations.loc[:,'days_from_1914'] = (embarkation_dates - pd.to_datetime("1914-01-01")).dt.days
In [28]:
g = sns.FacetGrid(merged_embarkations, col="diedDuringWar", size=6, sharey=False)
_ = g.map(plt.hist, "days_from_1914", bins=30)

The distribution for the men who didn’t die reflects the embarkation distribution as a whole. Many men enlisted almost straight away and were sent away very early on in the war. This dropped off, then geared up again 1000 days into the war – this corresponds to late 1916, just after conscription was introduced.

The distribution for men who did die is much more right skewed. More of the men who died enlisted in the first half than the second half of the war.

Next I’ll look at death rates by date of embarkation by grouping the embarkation date and taking the mean of the “died” field.

In [29]:
embarkation_series = merged_embarkations[['died','embarkation_embarkationDate']].groupby(merged_embarkations['embarkation_embarkationDate']).mean()

embarkation_series.head()
Out[29]:
died
embarkation_embarkationDate
1899-10-21 00:00:00 0.0
1912-08-01 1.0
1914-01-01 0.4
1914-08-04 0.0
1914-08-11 1.0

Those first two dates are way before the war started, so I’ll delete them.

In [30]:
embarkation_series.drop(pd.Timestamp('1899-10-21 00:00:00'), inplace=True)
embarkation_series.drop(pd.Timestamp('1912-08-01 00:00:00'), inplace=True)

Resample into quarters to create a time series.

In [31]:
quarterly_embrkations = embarkation_series.resample('Q', how="mean")
In [32]:
rcParams['figure.figsize'] = 15, 8
_ = plt.plot(quarterly_embrkations)

Death rates for those who embarked up until December 1916 were high, though variable. The middle of 1915 was an especially bad time to embark (astoundingly, over half died), but those who embaked towards the end of that year had a much healthier survival rate (20%). From December 1916 onwards the death rates started lowering. Less than 10% of those who embarked mid 1917 until the start of 1918 died. The peak in mid 1918 is surprising. By this time the war was coming to an end (the war ended 11 November 1918). Maybe these “fresh” troops were used extensively in the final push through France, replacing the troops who had already spent many months in the trenches.

So men who enlisted earlier were generally more likely to die than those who enlisted later, though the relationship between time served and date rate is not as clear cut as I would have anticipated. It seems that more factors were involved than just an increasing probability of being killed with each passing day.

Were younger men more reckless than older men?

If younger men were more reckless than older men we should see a pattern of younger men dying sooner after joining up.

To test this if this is the case I’ll create a field that records the number of days between first embarkation and date of death.

In [33]:
deaths_df = df[df['diedDuringWar']]
embarkation_death = deaths_df.merge(first_embarkations, left_index=True, right_on="index")[['index','embarkation_embarkationDate',
                                                                                           'dateOfDeath','ageAtDeath',
                                                                                          'enlistment_ww1_ageAtEnlistment',
                                                                                          'embarkation_ageAtEmbarkation']]
embarkation_death['embarkationToDeath'] = embarkation_death['dateOfDeath'].subtract(embarkations_df['embarkation_embarkationDate']).dt.days

Examining the field to test for the validity of the data:

In [34]:
embarkation_death['embarkationToDeath'].describe()
Out[34]:
count    14913.000000
mean       444.418628
std        314.677184
min       -300.000000
25%        210.000000
50%        333.000000
75%        601.000000
max       1881.000000
Name: embarkationToDeath, dtype: float64

The minimum value of -300 indicates there are some data errors here. How many negative values are there?

In [35]:
len(embarkation_death[embarkation_death['embarkationToDeath']<0])
Out[35]:
3

Only 3 records have a negative number of days until death so proceed, but remove those values.

In [36]:
embarkation_death.loc[embarkation_death[embarkation_death['embarkationToDeath'] < 0].index, 'embarkationToDeath'] = np.NaN

First I’ll look at the overall distribution of days from embarkation to death.

In [37]:
_ = embarkation_death['embarkationToDeath'].plot(kind="hist", bins=50, color="gray")
_ = plt.xlabel('Number of days from Embarkation to Death')
_ = plt.title("Distribution of number of days from embarkation to death", fontsize=20)

This is a right skewed distribution with a long right tail. The median of 333 days indicates that half of the troops in this data died within about 10 months of leaving New Zealand. However a quarter of those who died survied longer than 19 months (600 days). A handful survived longer than 4 years. The shape of this plot suggests this data could be a good fit for a gamma distribution.

Next I’ll plot the relationship between age at death and number of days from embarkation until death by using a scatter plot.

In [38]:
_ = plt.figure(figsize=(12,8))
_ = plt.scatter(embarkation_death['ageAtDeath'], embarkation_death['embarkationToDeath'],  color="black", alpha="0.5")
_ = plt.xlabel('Age at Death')
_ = plt.ylabel('Number of days from Embarkation to Death')
_ = plt.xlim(14,70)
_ = plt.ylim(0,2000)

Visually, there does look to be general increasing trend of number of days to death and age, but this is probably related to the increasing minimum ages, which is actually just a consequence of the troops getting older the longer they have been at war. Plotting days to death against age at enlistment would show a clearer result.

In fact I’ll plot it against age at embarkation since there are more results for this field and the first embarkation date would be quite closely related to enlistment date. This would exclude deaths that happened before embarkation, but those deaths would have have been related to battle so I feel there is justification in excluding them.

First I’ll plot age at enlistment against age at embarkation to test my assumption that there is a direct relationship between these two fields.

In [39]:
_ = plt.scatter(merged_embarkations['enlistment_ww1_ageAtEnlistment'], merged_embarkations['embarkation_ageAtEmbarkation'], color="black", alpha="0.1")
_ = plt.xlabel("Age at Enlistment")
_ = plt.xlabel("Age at First Embarkation")
_ = plt.title("Relationship between age at enlistment and first embarkation", fontsize=15)

As expected there is a very strong relationship here, with only a few outliers (some of which are obviously bad data since it is unlikely embarkation age would be younger than enlistment age).

In [40]:
_ = plt.figure(figsize=(12,8))
_ = plt.scatter(embarkation_death['embarkation_ageAtEmbarkation'], embarkation_death['embarkationToDeath'], color="black", alpha=0.5)
_ = plt.ylabel('Number of days from Embarkation to Death')
_ = plt.xlabel('Age at Embarkation')
_ = plt.ylim(0,1800)
_ = plt.xlim(14,60)
_ = plt.title("Relationship between age at embarkation and number of days from embarkation to death", fontsize=15)

There doesn’t seem to be any strong relationship between these two fields. There is certainly a high concentration of younger soldiers who were killed not long after their embarkation date, but this reflects the right skewness of both the ages and days to death data.

An improvement to this would be to only include deaths that occurred as a result of the fighting (excluding disease and sickness related deaths for example). I’ll come back to this later when I examine cause of death.

Place of death

What patterns are there in the places that casualties occured?

First I need to organise the place of death field into manageable categories.

In [41]:
df['placeOfDeath'].value_counts().head(10)
Out[41]:
France                          2540
Somme/Northern France/France    2327
Gallipoli/Turkey                2085
Ypres/Belgium                   1930
Belgium                         1469
Bapaume/France                   523
Havrincourt/France               395
Le Cateau/France                 395
New Zealand                      362
Palestine/Middle East/Asia       324
Name: placeOfDeath, dtype: int64

The field is generally presented as a series of more specific localities like town/country or country/region. Unfortunately there looks to be some inconsistencies and generalisations (for instance just “France” with no more specific information for a large number of records).

I’ll split the field into the second most specific (“place”) and most specific (“region”) fields.

In [42]:
def last_item(alist):
    if alist != alist: # Check for NaN
        return alist
    if len(alist) > 0:
        return alist.pop()
    else:
        return np.NaN

places = df['placeOfDeath'].str.split('/')
regions = places.apply(last_item)
place_names = places.apply(last_item)

df['placeOfDeath_region'] = regions
df['placeOfDeath_place'] = place_names

Set an “Unspecificed” value to the place fields where the placeOfDeath only gives a general region.

In [43]:
df.loc[df[df['placeOfDeath_region'].notnull() & df['placeOfDeath_place'].isnull()].index,'placeOfDeath_place'] = "Unspecified"

A categorical column will allow the places to be sorted by country order instead of name.

In [44]:
place_order = df[['placeOfDeath_place','placeOfDeath_region']].sort_values(by='placeOfDeath_region')['placeOfDeath_place'].drop_duplicates().dropna()

df['placeOfDeath_place'] = pd.Categorical(df['placeOfDeath_place'], categories=place_order, ordered=False)

Find places that have a death count of more than 100, display in a table and plot.

In [45]:
place_counts = df[['placeOfDeath_region','placeOfDeath_place','diedDuringWar']].groupby(['placeOfDeath_region','placeOfDeath_place']).count()
place_counts[place_counts['diedDuringWar'] > 100]
Out[45]:
diedDuringWar
placeOfDeath_region placeOfDeath_place
Asia Middle East 329
Belgium Unspecified 1469
Ypres 1956
Messines 208
Egypt Unspecified 242
England Unspecified 131
France Unspecified 2540
Bapaume 523
Northern France 2336
Le Cateau 395
Havrincourt 395
Le Quesnoy 120
New Zealand Unspecified 362
Wellington 179
Turkey Gallipoli 2183
United Kingdom Unspecified 258
In [46]:
_ = plt.figure(figsize=(12,8))
_ = place_counts[place_counts>100].dropna().unstack().plot(kind="bar", stacked=True, 
                                                           color=['gray','#4C72B0',
                                                                   '#55A868','#c44e52',
                                                                   '#8172b2','#ccb974','#64b5cd','#4a5468','#4c7054',
                                                                  '#895d5e',
                                                                  '#421cc0'])
_ = plt.title("Death counts by region", fontsize=20)
<matplotlib.figure.Figure at 0x269563c8>

Overall France had by far the highest number of deaths, followed by Belgium and Turkey. The three specific places that stand out as having a high number deaths are Ypres, Northern France and Gallipoli. These correspond with the three major campaigns of the war – Gallipoli, Flanders Fields and The Somme. It is perhaps surprising that a large number of deaths occurred in New Zealand (and in particular Wellington). The main training camp of Trentham was situated in Upper Hutt, so Wellington has probably been given as the location of the deaths that occurred here. I will explore this further later.

When and where did the most deaths occur?

I’ll use a strip plot to plot death by date in places where 100 or more deaths occurred.

In [47]:
war_deaths = df.reset_index()[df.reset_index()['diedDuringWar']][['dateOfDeath',
                                                                  'placeOfDeath_region',
                                                                  'placeOfDeath_place',
                                                                  'causeOfDeath',
                                                                  'ageAtDeath','index']].dropna(how="all")
war_deaths['placeOfDeath_full'] = war_deaths['placeOfDeath_place'].astype("string").str.cat(war_deaths['placeOfDeath_region'], sep=", ")

war_deaths.sort_values(by="placeOfDeath_region", inplace=True)

most_death_places = war_deaths['placeOfDeath_full'].value_counts()[war_deaths['placeOfDeath_full'].value_counts() > 100].index.values
war_deaths_subset = war_deaths[war_deaths['placeOfDeath_full'].isin(most_death_places)]
In [48]:
_ = plt.figure(figsize=(12,8))
_ = sns.stripplot(x=war_deaths_subset['dateOfDeath'], y=war_deaths_subset['placeOfDeath_full'], jitter=True, linewidth=0, 
                  size=5, alpha=.03, color="black")
_ = plt.title('Date spread of deaths by place', fontsize=20)

This plot clearly shows the pattern of deaths according to the place they occurred, darker regions indicating times of higher concentrations of deaths. Long and deadly campaigns like Gallipoli (April 1915 – January 1916) and the Battle of the Somme (July to November 1916) can be clearly identified. The darkest splotches highlight the most intense conflicts and their length – the Battle of Messines (Ypres and Belguim, mid 1917), Battle of Passchendale (October 1917), Battle of Bapaume (August 1918) and the short but intense battles that liberated the French towns of Le Cateau, Le Quesnoy and Havrincourt. The pattern for France (with no specific place) shows almost continuous deaths from early in 1916, gradually dropping off, then heightening as the push was made through France in 1918. The deaths in New Zealand and Wellington show a gradual trickle of deaths becoming slightly more concentrated from the start to the finish of the war with a short burst late in 1918, which may relate to a disease outbreak. The deaths in the UK show several more concentrated patches that may also be disease outbreaks.

Cause of death

What is the distribution of cause of death?

A quick look at the Cause of Death field:

In [49]:
war_deaths['causeOfDeath'].value_counts()
Out[49]:
Killed in Action                                                                            9456
Died of wounds                                                                              3497
Died of Disease                                                                             1272
Died of Sickness                                                                             378
Accidental Death                                                                             272
Died after Discharge from Wounds Inflicted or Disease Contracted while on Active Service      93
Killed on Active Service                                                                      29
Killed or died while a Prisoner of War                                                        17
Suicide                                                                                        5
Court Martialled, Executed, Pardoned                                                           5
Died from Natural Causes                                                                       4
Name: causeOfDeath, dtype: int64

This data in this field very consistently entered.

I’m not sure about some of these categorisations though. Why is there a distinction between disease and sickness? What does Killed on Active Service mean if it’s not being killed in action or accidentally? What are natural causes if it’s not sickness or disease?

Let’s look at plot of the counts of the different causes of deaths.

In [50]:
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = sns.countplot(y="causeOfDeath", data=df, color="Gray")
_ = plt.title("Counts of death by cause", fontsize=20)

Unsurprisingly Killed in Action was by far the most common way to die, followed by Died of wounds. However a surprising large number also died from sickness and disease as well as accidental deaths.

I’ll filter by the most common cause of death values in order to produce a cleaner plot against place of death.

In [51]:
war_deaths['causeOfDeath'].replace('Died after Discharge from Wounds Inflicted or Disease Contracted while on Active Service','Died after Discharge',inplace=True)
cause_subset = war_deaths['causeOfDeath'].value_counts().head(6).index.values
In [52]:
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = sns.countplot(x="placeOfDeath_region", hue="causeOfDeath", data=war_deaths[war_deaths['causeOfDeath'].isin(cause_subset)], 
              order=['New Zealand','England','United Kingdom','Turkey','France','Belgium','Egypt','Asia'], palette="deep")
_ = plt.title("Cause of death by region", fontsize=20)