Abstract
Between 1914 and 1919 hundreds of thousands of New Zealand men (and a small number of women) were enlisted to serve in the New Zealand Expeditionary Force to aid the allied forces in their fight against Germany. By analysing biographical information available from the Auckland War Memorial Museum Cenotaph database I will find out about those who served, especially the 18,000 casualties, including where, how, and when did they die, what patterns can be found in their deaths and what generalities can be made about who was more likely to live or die through the war.
Introduction
The First World War was one of the major events of the twentieth century. More than 120,000 New Zealanders enlisted and 18,000 died. My grandfather and his two brothers served in the NZEF and left behind a stack of letters written home during the war. These letters led to my interest in the experiences of those who served. My grandfather was wounded in the chest by shell fragments three days after entering the trenches rendering him unfit for further service (and perhaps saving him from a grim death in New Zealand’s deadliest battle). He lived to marry the widow of his brother, who had survived months as a water carrier near the front lines, only to return to New Zealand and die of influenza 5 years after the war ended. Their younger brother was a brilliant scholar whose recurring illnesses kept him away from the font lines leading to a job offer as an Army instructor, then a scholarship at Cambridge university. Through sheer good luck these three brothers lived through the war, but two of their cousins were not so lucky. One died of disease contracted at the Trentham training camp and was buried before leaving New Zealand. His younger brother was caught up in the battle of Passchendale and died from the effects of gas poisoning. Learning about the diverse stories of these men showed me that there was no “typical” war story. The reasons men went to war, the places they served and the experiences they had that led to their survival – or their death – was as varied as the men themselves.
The Data Source
The Auckland War Memorial Museum Cenotaph Database pulls together information about New Zealand’s service personnel from many different sources. The site was originally conceived to memorialise those who died at war, but has since expanded to include all service personnel who have since died. It purports to include “almost all of those who served in WWI”. The data is searchable through a front-end on the museum website (http://www.aucklandmuseum.com/war-memorial/online-cenotaph/search) where it is assembled into an individual web page for each person who served. It is also available in raw format through an API (http://api.aucklandmuseum.com/).
Workflow
Accessing the Data
The cenotaph data is accessible through an API using either a simple string search or a SPARQL query. As I am unfamiliar with the SPARQL query language (and since queries of this type have greater restrictions on use) I decided to use a simple search for the string “World War I, 1914-1918”. In theory this should be the value of the “war” field for all of those who served in World War One. In fact the search returned 96,203 results – short of the 103,188 results obtained using the web site’s front-end. I’m not sure why there is a discrepancy in this number (if anything I would expect a general string search to return more results than a search on a single field). The front-end doesn’t offer the full results of a search in a downloadable format, so I had to make do with the 96,203 results from the API. These were returned in JSON format. I fetched them in page sizes of 1000 and saved them as local JSON files (the code for this is contained in the notebook “Download JSON”).
Converting to CSV
The JSON structure returned was hierarchical, so the next step was to flatten these files so they could be loaded into a dataframe (this was carried out in the notebook “Import from JSON”). I used a convenient library called FlattenJSON to convert the lists inside the JSON file into individual fields. This created an extremely unwieldy number of columns (one for every potential value of a list), so at this stage I also filtered the number of columns down to a subset of that looked like they could be useful or interesting. I then saved these flattened tables as individual CSV files.
Data wrangling and cleaning
This was carried out and documented in the notebook “Data Wrangling”. I first assembled the individual CSV files into one dataframe, then went though each set of fields and determined the volume and quality of data, whether any cleaning or converting needed to take place and whether or not the data would potentially be useful for analysis. A number of field sets that had multiple values were separated out into “normalized” dataframes, reducing the number of redundant columns (which were often mostly filled with NaN values). This drastically reduced the overall memory requirement. This set of dataframes was then re-saved as CSV files, ready for analysis in this notebook.
Quality of the Data and Assumptions
Considering the dataset has been assembled from many varied sources, it is inevitable that the quality is extremely variable. Transcription errors are evident throughout the data and I have picked these up as I have gone as best as I can.
One big question I had about the data is why a reasonable chunk of records is completely missing. Out of around 120,000 who enlisted, only 73,000 or so records remain after the data wrangling/cleaning phase. While this is still more than enough to proceed with analysis, there is a question of which records are missing and why. If they represent a particular type of record this could easily invalidate some of the conclusions. However since I have no knowledge of why these records are missing I will proceed with the assumption that they are randomly distributed through the data. This assumption applies to any field I have analysed where there is a large amount of missing data. Often it is unclear whether missing data implies there is semantically no value for that field or whether the value is unknown.
The biggest disappointment is in the date of birth field, which has values for only 5,700 or so records. The age of soldiers would seem to be a defining characteristic of their experience at war and it is a pity that this is accessible for such a small proportion of the total.
The most complete part of the data are the death fields. Out of 18,000 who died because of the war 15,000 of these deaths are recorded in the data, with well used fields for place, age and cause of death. For this reason I chose to mostly focus on this aspect.
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
# Set some Pandas options as you like
pd.set_option('html', True)
pd.set_option('max_columns', 40)
pd.set_option('max_rows', 300)
rcParams['figure.figsize'] = (12, 10)
rcParams['font.size'] = 20
rcParams['axes.facecolor'] = 'white'
#this line enables the plots to be embedded into the notebook
%matplotlib inline
matplotlib.style.use('ggplot')
Main Analysis
df = pd.read_csv('datasets/ww1.csv', parse_dates=['enlistment_ww1_dateOfEnlistment','dateOfDeath','birthDate'], index_col=0, low_memory=False)
Analysing ages
What is the distribution of the ages of the soldiers who died?
df[df['diedDuringWar']]['ageAtDeath'].describe()
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(df[df['diedDuringWar']]['ageAtDeath'], orient="h")
_ = plt.xlim(10,70)
_ = plt.title("Distribution of age at death", fontsize=20)
_ = df[df['diedDuringWar']]['ageAtDeath'].hist(bins=46, facecolor="gray", figsize=(12,5))
_ = plt.xlim(15,64)
_ = plt.xlabel("Age at Death")
_ = plt.ylabel("Count")
_ = plt.title("Histogram of age at death")
The age at death data is right skewed with a median of 26 and a mean of 27.4. Most of the the soldiers who died during the war were young men, the majority under thirty. A handful of outliers were older than their mid-forties.
Why were the men who died mostly young? It could be the case than younger are more reckless than older men – ready to face the action head-on. Older men could be more cautious and concerned with self-preservation.
However these ages may just reflect the ages of the soldiers as a whole. It makes sense that the majority of men who volunteered were quite young as they would have had more of a thirst for adventure and less responsibilities to tie them to home. Although conscription applied to men aged between 20 and 45, single men were the first to be conscripted, married men were not called up until October 1917 and men with children not until April 1818. (https://nzhistory.govt.nz/war/recruiting-and-conscription)
The oldest person to die in the war was in his sixties. Who was he and how did he die?
df[df['ageAtDeath']>60][['firstName','familyName','serviceNumber_0','causeOfDeath','placeOfDeath']]
Ferdinand Campion Batchelor was a respected medical practitioner who served in Egypt and died on his return to New Zealand.
https://www.teara.govt.nz/en/1966/batchelor-ferdinand-campion
Is the distribution of the ages of men who died different from those who didn’t?
Since the date of death field can’t be compared between men who died and didn’t die during the war (obviously those who didn’t die during the war have no value for this field) I’ll instead compare the enlistment ages between the two groups.
Unfortunately there is much less data of ages of the men who didn’t die during the war so I’d first like make sure there is enough data for both groups to compare.
df[['diedDuringWar','enlistment_ww1_ageAtEnlistment']].groupby(['diedDuringWar']).count()
This is a small proportion of the total, but enough to be compare the two groups. I am assuming that records with this field are randomly distributed through the total population and not representative of a particular subset (though I am aware this assumption could be incorrect).
df[['enlistment_ww1_ageAtEnlistment','diedDuringWar']].groupby('diedDuringWar').describe()
The summary statistics seem to indicate that men who died were indeed younger than those didn’t have a lower mean, median and quartile values. A boxplot and histogram will give a more visual representation of this information.
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(x="enlistment_ww1_ageAtEnlistment", y="diedDuringWar", data=df, orient='h')
_ = plt.ylabel('Died During War')
_ = plt.xlabel('Age at Enlistment')
_ = plt.title("Boxplots of ages of those who died during war and those who survived", fontsize=15)
g = sns.FacetGrid(df, col="diedDuringWar", size=6, sharey=False)
_ = g.map(plt.hist, "enlistment_ww1_ageAtEnlistment", bins=20)
These plots show that in general men who died during the war were younger when they enlisted than those who lived through the war. The distribution of the ages of men who lived through the war has a longer tail and more outliers implying that older men were more likely to live through the war.
Is this because younger men are more reckless and have less regard for their own lives? Before coming to that conclusion, it is worth considering this is more depth. There could be other influencing factors on age at enlistment that lead to a higher death rate.
Age and Rank
A third variable that might have an effect on both age and death rate is rank. Presumably older men would hold higher ranks, and higher ranked men may be less likely to die (higher ranked officers may be less likely to be directly involved in battle).
To compare rank I will use the embarkation data, as this gives the most complete information on rank. I’ll compare the ages using the age at embarkation field. Each soldier may have multiple embarkations as some would have returned to New Zealand for a leave then embarked a second time (or more). I’ll look at the age of the soldier’s first embarkation, which generally occurred several months after their enlistment.
Not every soldier who died has embarkation data – some died during their training in New Zealand. We need to be aware that by using the embarkation fields we are actually excluding those who died before they left New Zealand.
The embarkations have already been ordered by date, so by removing the duplicates of the index (which is an index into the dataframe so represents individual records) we are left with each person’s first embarkation.
embarkations_df = pd.read_csv('datasets/ww1_embarkations.csv', index_col=0, parse_dates =['embarkation_embarkationDate'])
first_embarkations = embarkations_df.drop_duplicates(subset=['index'])
Looking at the rank field – how many separate ranks are there?
len(first_embarkations['embarkation_rankOnEmbarkation'].drop_duplicates())
The ten most common ranks:
first_embarkations['embarkation_rankOnEmbarkation'].value_counts().head(10)
The rank entries look fairly consistent. There are 101 different ranks beings used but these are mostly related to different rank names in the different divisions. Since I want to be able to compare the ranks by their level I’ll only use New Zealand Army ranks (these are by far the most common), and assign each a rank order, grouping the four bottom ranks together then using ascending numbers for ascending ranks.
rank = ['Gunner','Trooper','Sapper','Private','Lance Corporal','Corporal','Sergeant','Second Lieutenant','Lieutenant',
'Captain','Major','Lieutenant Colonel','Colonel','Major General','Lieutenant General']
rank_levels = pd.DataFrame({'rank': rank, 'rank_order': [1,1,1,1,2,3,4,5,6,7,8,9,10,11,12]})
rank_levels
I’ll use an inner join to merge the first_embarkations with rank_levels – this will add rank order to first embarkations data frame and remove the extraneous ranks.
embarkations_mainranks = first_embarkations.merge(rank_levels, left_on='embarkation_rankOnEmbarkation', right_on='rank')
Is there a difference in age at embarkation between the different ranks?
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(x="embarkation_ageAtEmbarkation", y="rank_order", data=embarkations_mainranks, orient="h", palette="deep")
_ = plt.title("Boxplots of embarkation age by rank order", fontsize=20)
From rank 1 up to rank 6 the ages are very similar, though there is a general trend of the minimum ages increasing and a slight increase in the medians. Rank 7 though 10 show a definite increase in age with the rank level with an increase in spread. The middle halves of the data significantly increase with each higher rank.
There is an unfortunate lack of data for the very highest ranks (only 1 data point for rank 11 – Major General, and none for 12 – Lieutenant General). My conclusion would be that age does generally increase with increase in rank – not as much in the lower ranks, but significantly from Captain upwards.
Is there a difference in age at death between the ranks?
merged_embarkations = pd.merge(df, embarkations_mainranks, left_index=True, right_on='index', how="left")
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(x="ageAtDeath", y="rank_order", data=merged_embarkations[merged_embarkations['diedDuringWar']], orient="h", palette="deep")
_ = plt.title("Boxplots of death age by rank order", fontsize=20)
This does in fact show a very similar pattern of increasing age with rank (excepting the first three ranks). This lends evidence to the theory that the rank of soldiers may be an important influence in the overall trend of age of death.
Does the death rate vary between the ranks?
A stereotype of war is of highly ranked men sending the lower ranked recruits into the fray while they wait the battle out in safety. If this is true we should see a reduction in death rate for higher ranked soldiers.
I’ll add another field called simply “died”, which is an integer casted from the diedDuringWar boolean field. This is an easy way of calculating death rates (the mean value is equivalent to death rate).
df['died'] = df['diedDuringWar'].astype(int)
merged_embarkations['died'] = merged_embarkations['diedDuringWar'].astype(int)
First I’ll examine the total numbers embarked of each rank compared to the numbers who died.
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = plt.title("Counts of those served and died by rank", fontsize=20)
_ = sns.countplot(y="rank", data=merged_embarkations, order=rank, color="Gray")
_ = sns.countplot(y="rank", data=merged_embarkations[merged_embarkations['diedDuringWar']], order=rank, color="Black")
It is interesting to visualize the distributions of the different ranks, but because of the wildly different total counts this plot isn’t very useful in determining whether the death rates vary between the ranks.
Next I’ll plot death rates for each rank by averaging the “died” field.
plt.figure(figsize=(12,8))
plt.title("Death rates by rank", fontsize=20)
_ = sns.barplot(x="rank_order", y="died", data=merged_embarkations, color="gray")
There is no obvious trend in this plot. The death rates are very similar up to rank 4 (Sergeant). Surprisingly rank 5 and 6 (Second Lieutenant and Lieutenant) have higher death rates, then there is a marked decrease for rank 7 (Captain). Rank 8, 9 and 10 (Major, Lieutenant Colonel, Colonel) have very similar death rates though slightly lower than ranks 1-4.
It is worth noting that ranks up to 4 (Sergeant) are non-commissioned ranks, while 5 (Second Lieutenant) and above are commissioned officers. My understanding is that commissioned officers were professionals recruited into the Army while the non-commissioned officers were mostly volunteers or conscripted men who had been promoted to a higher rank through good conduct. This may explain why ranks 1 through 4 show a different pattern to 5 and above.
Second Lieutenant and Lieutenant (rank 5 and 6) are the lowest of the commissioned ranks – these men may have been responsible for leading companies into battle. It is interesting that their death rates are markedly higher than even the lowest ranked soldiers. This could indicate that these men took their leadership roles seriously and were willing to risk themselves for the men they were leading.
The higher ranked men probably had less of a hands-on role in battle, but these rates show that they were by no means completely removed from the dangers of war. These men probably would have also served for much longer than the volunteers of ranks 1-4 (many of who joined up mid-way through the war), so may have been more likely to be involved in a higher number of deadly battles and hence have more chance of being killed.
Is there any interaction between age, rank and probability of dying during the war?
First I’ll look at the three age fields together – age at enlistment, embarkation and death, dropping the records that don’t have both an age of enlistment and embarkation.
rank_ages = pd.pivot_table(merged_embarkations.dropna(subset=['enlistment_ww1_ageAtEnlistment',
'embarkation_ageAtEmbarkation'
])
,index='rank_order', values=['enlistment_ww1_ageAtEnlistment',
'embarkation_ageAtEmbarkation',
'ageAtDeath'],
aggfunc=np.median, columns="diedDuringWar")
rank_ages
rcParams['figure.figsize'] = 12,8
_ = rank_ages.plot(kind="line", color=['green','green','gray','gray','black','black'], style=['-','--','-','--','-','--'])
_ = plt.ylabel("Median Age")
_ = plt.xlabel("Rank Level")
_ = plt.legend(loc=2,title="Age at..., Died During War")
This clearly shows a pattern of increasing age to rank. The ages are fairly even up to rank 4 (Lance Corporal), then increase at rank 5 (Second Lieutenant) and decrease for rank 6 (Lieutenant). This decrease is unexpected and may indicate that there is something unusual about this rank that I don’t understand (one possibility is that outstanding leaders could have been promoted straight to this rank, short-cutting the lower ranks). As expected, the ages then climb steeply as the rank gets higher.
The dotted lines on this graph indicate the median age of men who died during the war. These are generally higher (with the exception of rank level 3 – Corporal) than for the men who didn’t die. This is a strong indication that rank does have an influence on the ages of men who died.
Looking at just the age at embarkation in more detail:
plt.figure(figsize=(12,8))
_ = sns.pointplot(x="rank_order", y="embarkation_ageAtEmbarkation", hue="diedDuringWar",
data=merged_embarkations.dropna(subset=['embarkation_ageAtEmbarkation']), linestyles=['-','--'], ci=None)
This pattern is slightly different to the plot above, probably reflecting the increase in amount of data (this data includes records that don’t have an enlistment age).
The main trend for men who died during the war is that they were generally older the higher their rank. However in ranks 5 and 6 (Second Lieutenant and Lieutenant) the men who lived were actually younger than those who died. These are the lowest of the commissioned ranks so many of these younger men may have been very inexperienced and given less responsibility, which kept them out of the main action.
We have seen that rank is strong influence on age and that men in higher ranks were slightly less likely to die. To answer the question of whether younger men were more likely to die because they were more reckless, my overall judgment is that that rank does explain some, but not all of the effect of age on death rate. It might be somewhat true that younger men were more reckless, but it is also somewhat true that they were generally lower ranked and hence more likely to be involved in conflict.
Date of embarkation and survival rate
What affect did the date of embarkation have on chances of surviving the war?
Another reason for the difference between the ages of those who died and didn’t die might be that the the men who enlisted earlier in the war were more likely to have died (simply because they were at war longer). I can test this by looking at boxplots of the years that the men were born (rather than their age).
sns.color_palette("deep")
sns.set(style="darkgrid")
_ = sns.boxplot(x=df['birthDate'].dt.year, y=df["diedDuringWar"], orient='h')
_ = plt.ylabel('Died During War')
_ = plt.xlabel('Year of Birth')
_ = plt.xlim(1840, 1905)
_ = plt.title("Year of birth of those who survived the war and those who died", fontsize=15)
These distributions are almost identical apart from a few outliers (representing much older men who didn’t die).
We can get an idea of the affect of enlisting early on the death rate by comparing the distributions of numbers of embarkations by date for the men who died and didn’t die during the war.
embarkation_dates = merged_embarkations[merged_embarkations['embarkation_embarkationDate'].dt.year >= 1914]['embarkation_embarkationDate'].dropna()
merged_embarkations.loc[:,'days_from_1914'] = (embarkation_dates - pd.to_datetime("1914-01-01")).dt.days
g = sns.FacetGrid(merged_embarkations, col="diedDuringWar", size=6, sharey=False)
_ = g.map(plt.hist, "days_from_1914", bins=30)
The distribution for the men who didn’t die reflects the embarkation distribution as a whole. Many men enlisted almost straight away and were sent away very early on in the war. This dropped off, then geared up again 1000 days into the war – this corresponds to late 1916, just after conscription was introduced.
The distribution for men who did die is much more right skewed. More of the men who died enlisted in the first half than the second half of the war.
Next I’ll look at death rates by date of embarkation by grouping the embarkation date and taking the mean of the “died” field.
embarkation_series = merged_embarkations[['died','embarkation_embarkationDate']].groupby(merged_embarkations['embarkation_embarkationDate']).mean()
embarkation_series.head()
Those first two dates are way before the war started, so I’ll delete them.
embarkation_series.drop(pd.Timestamp('1899-10-21 00:00:00'), inplace=True)
embarkation_series.drop(pd.Timestamp('1912-08-01 00:00:00'), inplace=True)
Resample into quarters to create a time series.
quarterly_embrkations = embarkation_series.resample('Q', how="mean")
rcParams['figure.figsize'] = 15, 8
_ = plt.plot(quarterly_embrkations)
Death rates for those who embarked up until December 1916 were high, though variable. The middle of 1915 was an especially bad time to embark (astoundingly, over half died), but those who embaked towards the end of that year had a much healthier survival rate (20%). From December 1916 onwards the death rates started lowering. Less than 10% of those who embarked mid 1917 until the start of 1918 died. The peak in mid 1918 is surprising. By this time the war was coming to an end (the war ended 11 November 1918). Maybe these “fresh” troops were used extensively in the final push through France, replacing the troops who had already spent many months in the trenches.
So men who enlisted earlier were generally more likely to die than those who enlisted later, though the relationship between time served and date rate is not as clear cut as I would have anticipated. It seems that more factors were involved than just an increasing probability of being killed with each passing day.
Were younger men more reckless than older men?
If younger men were more reckless than older men we should see a pattern of younger men dying sooner after joining up.
To test this if this is the case I’ll create a field that records the number of days between first embarkation and date of death.
deaths_df = df[df['diedDuringWar']]
embarkation_death = deaths_df.merge(first_embarkations, left_index=True, right_on="index")[['index','embarkation_embarkationDate',
'dateOfDeath','ageAtDeath',
'enlistment_ww1_ageAtEnlistment',
'embarkation_ageAtEmbarkation']]
embarkation_death['embarkationToDeath'] = embarkation_death['dateOfDeath'].subtract(embarkations_df['embarkation_embarkationDate']).dt.days
Examining the field to test for the validity of the data:
embarkation_death['embarkationToDeath'].describe()
The minimum value of -300 indicates there are some data errors here. How many negative values are there?
len(embarkation_death[embarkation_death['embarkationToDeath']<0])
Only 3 records have a negative number of days until death so proceed, but remove those values.
embarkation_death.loc[embarkation_death[embarkation_death['embarkationToDeath'] < 0].index, 'embarkationToDeath'] = np.NaN
First I’ll look at the overall distribution of days from embarkation to death.
_ = embarkation_death['embarkationToDeath'].plot(kind="hist", bins=50, color="gray")
_ = plt.xlabel('Number of days from Embarkation to Death')
_ = plt.title("Distribution of number of days from embarkation to death", fontsize=20)
This is a right skewed distribution with a long right tail. The median of 333 days indicates that half of the troops in this data died within about 10 months of leaving New Zealand. However a quarter of those who died survied longer than 19 months (600 days). A handful survived longer than 4 years. The shape of this plot suggests this data could be a good fit for a gamma distribution.
Next I’ll plot the relationship between age at death and number of days from embarkation until death by using a scatter plot.
_ = plt.figure(figsize=(12,8))
_ = plt.scatter(embarkation_death['ageAtDeath'], embarkation_death['embarkationToDeath'], color="black", alpha="0.5")
_ = plt.xlabel('Age at Death')
_ = plt.ylabel('Number of days from Embarkation to Death')
_ = plt.xlim(14,70)
_ = plt.ylim(0,2000)
Visually, there does look to be general increasing trend of number of days to death and age, but this is probably related to the increasing minimum ages, which is actually just a consequence of the troops getting older the longer they have been at war. Plotting days to death against age at enlistment would show a clearer result.
In fact I’ll plot it against age at embarkation since there are more results for this field and the first embarkation date would be quite closely related to enlistment date. This would exclude deaths that happened before embarkation, but those deaths would have have been related to battle so I feel there is justification in excluding them.
First I’ll plot age at enlistment against age at embarkation to test my assumption that there is a direct relationship between these two fields.
_ = plt.scatter(merged_embarkations['enlistment_ww1_ageAtEnlistment'], merged_embarkations['embarkation_ageAtEmbarkation'], color="black", alpha="0.1")
_ = plt.xlabel("Age at Enlistment")
_ = plt.xlabel("Age at First Embarkation")
_ = plt.title("Relationship between age at enlistment and first embarkation", fontsize=15)
As expected there is a very strong relationship here, with only a few outliers (some of which are obviously bad data since it is unlikely embarkation age would be younger than enlistment age).
_ = plt.figure(figsize=(12,8))
_ = plt.scatter(embarkation_death['embarkation_ageAtEmbarkation'], embarkation_death['embarkationToDeath'], color="black", alpha=0.5)
_ = plt.ylabel('Number of days from Embarkation to Death')
_ = plt.xlabel('Age at Embarkation')
_ = plt.ylim(0,1800)
_ = plt.xlim(14,60)
_ = plt.title("Relationship between age at embarkation and number of days from embarkation to death", fontsize=15)
There doesn’t seem to be any strong relationship between these two fields. There is certainly a high concentration of younger soldiers who were killed not long after their embarkation date, but this reflects the right skewness of both the ages and days to death data.
An improvement to this would be to only include deaths that occurred as a result of the fighting (excluding disease and sickness related deaths for example). I’ll come back to this later when I examine cause of death.
Place of death
What patterns are there in the places that casualties occured?
First I need to organise the place of death field into manageable categories.
df['placeOfDeath'].value_counts().head(10)
The field is generally presented as a series of more specific localities like town/country or country/region. Unfortunately there looks to be some inconsistencies and generalisations (for instance just “France” with no more specific information for a large number of records).
I’ll split the field into the second most specific (“place”) and most specific (“region”) fields.
def last_item(alist):
if alist != alist: # Check for NaN
return alist
if len(alist) > 0:
return alist.pop()
else:
return np.NaN
places = df['placeOfDeath'].str.split('/')
regions = places.apply(last_item)
place_names = places.apply(last_item)
df['placeOfDeath_region'] = regions
df['placeOfDeath_place'] = place_names
Set an “Unspecificed” value to the place fields where the placeOfDeath only gives a general region.
df.loc[df[df['placeOfDeath_region'].notnull() & df['placeOfDeath_place'].isnull()].index,'placeOfDeath_place'] = "Unspecified"
A categorical column will allow the places to be sorted by country order instead of name.
place_order = df[['placeOfDeath_place','placeOfDeath_region']].sort_values(by='placeOfDeath_region')['placeOfDeath_place'].drop_duplicates().dropna()
df['placeOfDeath_place'] = pd.Categorical(df['placeOfDeath_place'], categories=place_order, ordered=False)
Find places that have a death count of more than 100, display in a table and plot.
place_counts = df[['placeOfDeath_region','placeOfDeath_place','diedDuringWar']].groupby(['placeOfDeath_region','placeOfDeath_place']).count()
place_counts[place_counts['diedDuringWar'] > 100]
_ = plt.figure(figsize=(12,8))
_ = place_counts[place_counts>100].dropna().unstack().plot(kind="bar", stacked=True,
color=['gray','#4C72B0',
'#55A868','#c44e52',
'#8172b2','#ccb974','#64b5cd','#4a5468','#4c7054',
'#895d5e',
'#421cc0'])
_ = plt.title("Death counts by region", fontsize=20)
Overall France had by far the highest number of deaths, followed by Belgium and Turkey. The three specific places that stand out as having a high number deaths are Ypres, Northern France and Gallipoli. These correspond with the three major campaigns of the war – Gallipoli, Flanders Fields and The Somme. It is perhaps surprising that a large number of deaths occurred in New Zealand (and in particular Wellington). The main training camp of Trentham was situated in Upper Hutt, so Wellington has probably been given as the location of the deaths that occurred here. I will explore this further later.
When and where did the most deaths occur?
I’ll use a strip plot to plot death by date in places where 100 or more deaths occurred.
war_deaths = df.reset_index()[df.reset_index()['diedDuringWar']][['dateOfDeath',
'placeOfDeath_region',
'placeOfDeath_place',
'causeOfDeath',
'ageAtDeath','index']].dropna(how="all")
war_deaths['placeOfDeath_full'] = war_deaths['placeOfDeath_place'].astype("string").str.cat(war_deaths['placeOfDeath_region'], sep=", ")
war_deaths.sort_values(by="placeOfDeath_region", inplace=True)
most_death_places = war_deaths['placeOfDeath_full'].value_counts()[war_deaths['placeOfDeath_full'].value_counts() > 100].index.values
war_deaths_subset = war_deaths[war_deaths['placeOfDeath_full'].isin(most_death_places)]
_ = plt.figure(figsize=(12,8))
_ = sns.stripplot(x=war_deaths_subset['dateOfDeath'], y=war_deaths_subset['placeOfDeath_full'], jitter=True, linewidth=0,
size=5, alpha=.03, color="black")
_ = plt.title('Date spread of deaths by place', fontsize=20)
This plot clearly shows the pattern of deaths according to the place they occurred, darker regions indicating times of higher concentrations of deaths. Long and deadly campaigns like Gallipoli (April 1915 – January 1916) and the Battle of the Somme (July to November 1916) can be clearly identified. The darkest splotches highlight the most intense conflicts and their length – the Battle of Messines (Ypres and Belguim, mid 1917), Battle of Passchendale (October 1917), Battle of Bapaume (August 1918) and the short but intense battles that liberated the French towns of Le Cateau, Le Quesnoy and Havrincourt. The pattern for France (with no specific place) shows almost continuous deaths from early in 1916, gradually dropping off, then heightening as the push was made through France in 1918. The deaths in New Zealand and Wellington show a gradual trickle of deaths becoming slightly more concentrated from the start to the finish of the war with a short burst late in 1918, which may relate to a disease outbreak. The deaths in the UK show several more concentrated patches that may also be disease outbreaks.
Cause of death
What is the distribution of cause of death?
A quick look at the Cause of Death field:
war_deaths['causeOfDeath'].value_counts()
This data in this field very consistently entered.
I’m not sure about some of these categorisations though. Why is there a distinction between disease and sickness? What does Killed on Active Service mean if it’s not being killed in action or accidentally? What are natural causes if it’s not sickness or disease?
Let’s look at plot of the counts of the different causes of deaths.
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = sns.countplot(y="causeOfDeath", data=df, color="Gray")
_ = plt.title("Counts of death by cause", fontsize=20)
Unsurprisingly Killed in Action was by far the most common way to die, followed by Died of wounds. However a surprising large number also died from sickness and disease as well as accidental deaths.
I’ll filter by the most common cause of death values in order to produce a cleaner plot against place of death.
war_deaths['causeOfDeath'].replace('Died after Discharge from Wounds Inflicted or Disease Contracted while on Active Service','Died after Discharge',inplace=True)
cause_subset = war_deaths['causeOfDeath'].value_counts().head(6).index.values
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = sns.countplot(x="placeOfDeath_region", hue="causeOfDeath", data=war_deaths[war_deaths['causeOfDeath'].isin(cause_subset)],
order=['New Zealand','England','United Kingdom','Turkey','France','Belgium','Egypt','Asia'], palette="deep")
_ = plt.title("Cause of death by region", fontsize=20)
Turkey, France and Belgium’s deaths were primarily from being killed in action, though a significant number of deaths from disease occurred in France. Egypt and Asia show up as areas of conflict, but with less involvement from New Zealand troops than the main theatres of the war. In fact more died from sickness in Egypt than from being killed in action. New Zealand shows a reasonably large number of deaths from both sickness and disease. A significant number of accidental deaths also occurred in New Zealand – more than England, in fact. Presumably these were accidents in training, though this is surprising as the men would have spent more time training (and on leave) in England than in New Zealand.
Now I’ll categorise the “Died from wounds” and “Killed in Action” deaths together and compare the deaths caused by battle to non-battle related deaths on a date strip plot, grouped by place of death.
war_deaths_subset['battle_death'] = war_deaths_subset['causeOfDeath'].isin(['Died of wounds','Killed in Action'])
_ = sns.factorplot(y="placeOfDeath_full", x="dateOfDeath", col="battle_death", data=war_deaths_subset, kind="strip",
jitter=True, linewidth=0, size=6, alpha=.05, color="black")
Non battle-related deaths (most of these were from sickness and disease) were a constant background to the war especially in the battle-zones of France and Belguim (though not to the same extent in Northern France and Gallipoli). The stream of deaths in England isn’t too surprising as men who became sick on the battlefields would often have been shipped back to the English hospitals. The small number of deaths from wounds in New Zealand would also have been men who died from wounds inflicted in battle after returning to New Zealand. It’s surprising to see such a steady stream of non-battle deaths in New Zealand.
Disease outbreaks can be seen on the plot on the left as dark patches. This is noticeable in late 1918 in France, the UK and shortly after in New Zealand. These would have been related to the spanish flu epidemic (the French troop staging area Etaples has recently been identified as being the center of the 1918 flu pandemic according the the Wikipedia page: https://en.wikipedia.org/wiki/1918_flu_pandemic#Hypotheses_about_source_).
I’m unsure of what the dark patch in early 1918 in the UK signifies. The flu didn’t reach Britain until May 1918, so this must have been an outbreak of another disease (unfortunately the medical notes for these deaths don’t give any more information than “Died of Disease”).
Deaths in New Zealand
What caused New Zealand deaths?
I’ll look more closely at the New Zealand deaths
nz_deaths = war_deaths[war_deaths['placeOfDeath_region']=='New Zealand'][['dateOfDeath','causeOfDeath','placeOfDeath_full','placeOfDeath_place']]
_= plt.figure(figsize=(12,8))
_= plt.xticks(rotation=90)
_= sns.countplot(y="causeOfDeath", hue="placeOfDeath_full", data=nz_deaths[nz_deaths['causeOfDeath'].isin(cause_subset)],
hue_order=['Unspecificed, New Zealand','Auckland, New Zealand','Wellington, New Zealand',
'Wellington Region, New Zealand', 'Canterbury, New Zealand', 'Otago, New Zealand',
'Manawatu-Wanganui, New Zealand'], palette="deep")
_= plt.title("Counts of places of death in New Zealand by cause", fontsize=20)
The large number of deaths from sickness in Wellington (as opposed to disease in other parts of the country) stands out in this plot. A large number of these probably occurred at the Trentham training camp, so maybe this is due to a difference in classification of deaths between the army and general medical practitioners.
nz_deaths_subset = nz_deaths[nz_deaths['causeOfDeath'].isin(cause_subset)]
_ = sns.stripplot(x=nz_deaths_subset['dateOfDeath'], y=nz_deaths_subset['causeOfDeath'], jitter=True,
linewidth=0, size=5, alpha=.5, color="black")
_ = plt.title("Date spread of New Zealand deaths by cause of death", fontsize=20)
Disease has a fairly even spread across the dates with a concentration in late 1918. Sickness has several short bursts of deaths, also with a large concentration in late 1918. Accidental Death and Died after Discharge also show a concentration at this time. It seems likely that this is related to a disease epidemic, but it’s strange that accidental deaths should have increased during an epidemic. Is there another reason for this sudden jump in death numbers?
The medical notes may provide more clues.
medical_df = pd.read_csv('datasets/ww1_medical.csv', index_col=0)
medical_df.head()
The medical notes record lots of different information, but for now I am interested in the information they gives about cause of death.
medical_df[medical_df['medical'].str.contains('Cause')]['medical'].value_counts()
For each cause of death there is a corresponding medical note that gives some more information about the cause.
medical_df[medical_df['medical'].str.contains('Cause')][['medical','medicalNotes']].dropna().head(8)
In some cases no extra information is given, but for others can be very specific. This a “free text” field, so some further work will be needed to extract analysable information from it.
I’ll start with the Sickness and Disease categories to get a better idea of what kinds of illnesses soldiers died from and whether these varied with the place they died.
sickness_notes = medical_df[medical_df['medical'].str.contains('Died of Sickness/Cause of Death')]['medicalNotes'].dropna()
# split each note by word, then flatten it into a list
nested = [str.split(item,' ') for item in sickness_notes.str.lower().str.replace('[^\w ]','').values]
sickness_words = [item for sublist in nested for item in sublist]
filter_words = ['illness','disease','sickness','died','of','from','and','following','followed','an','at','the','after','whilst',
'his','ill','acute','discharge','or','to',
'a','had','have','auckland','hospital','food','on','nzef','by']
sickness_words = pd.Series(sickness_words)
# filter out some common words
word_counts = sickness_words[~sickness_words.isin(filter_words)].value_counts()
word_counts[word_counts > 3].index.values
disease_notes = medical_df[medical_df['medical'].str.contains('Died of Disease/Cause of Death')]['medicalNotes'].dropna()
# split each note by word, then flatten it into a list
nested = [str.split(item,' ') for item in disease_notes.str.lower().str.replace('[^\w ]','').values]
disease_words = [item for sublist in nested for item in sublist]
filter_words = ['illness','disease','sickness','died','of','from','and','following','followed','an','at','the','after','whilst',
'his','ill','acute','discharge','or','to',
'a','had','have','auckland','hospital','food','on','nzef','by','spanish']
disease_words = pd.Series(disease_words)
# filter out some common words
word_counts = disease_words[~disease_words.isin(filter_words)].value_counts()
word_counts[word_counts > 3].index.values
Immediately I can see that Died of Disease is more to do with medical conditions such as heart failure and cancer and not usually related to infectious disease (which is actually categorised as Sickness) – although there is some overlap, for instance pneumonia has been listed as both sickness and disease.
Some of the notes contain multiple conditions, but I’m just going to find the first that matches the list. I’ll order the list so that the actual sicknesses come first and the complications come later (influenza will be matched before pneumonia for example).
(I would like to be able to do an analysis on multiple matches, but that’s going to complicate it in a way that I don’t have time to figure out how to overcome).
sicknesses = ['influenza', 'measles', 'enteric', 'typohid', 'tuberculosis',
'malaria','pleurisy','pneumonia','broncho','dysentery','bronchial',
'peritonitis','jaundice','poisoning','fever', 'heart failure','cancer','meningitis','tuberculosis','stroke','coronary','pneumonia','carcinoma','cardiac','appendicitis',
'bronchopneumonia','haemorrhage','myocarditis','leukemia']
merged_medical = war_deaths.merge(medical_df, on="index")
def sickness_match(notes):
if notes != notes:
return np.NaN
else:
for word in sicknesses:
if word in notes.lower():
return word
merged_medical['sickness'] = merged_medical['medicalNotes'].apply(sickness_match)
Now I’ll plot these as a strip plot to see if there are any patterns in when these occurred by date, and compare the New Zealand deaths against the deaths that occurred overseas.
merged_medical['nz_death'] = merged_medical['placeOfDeath_region']=="New Zealand"
_ = sns.factorplot(y="sickness", x="dateOfDeath", col="nz_death", data=merged_medical, kind="strip",
linewidth=0, size=6, alpha=.1, color="black")
Although this is only representative of the diseases that killed the soldiers (not all deaths had a specific cause of death recorded in the medical notes), we can see some patterns here. The influenza epidemic lasted much longer overseas than it did in New Zealand, with a concentration of influenza and pneumonia deaths from mid 1918 to early 1919, which signals the Spanish ‘flu epidemic (pneumonia was a common complication of influenza). The influenza epidemic is very prominent in the New Zealand data and marks a sudden, severe but short outbreak of the disease in late 1918. Influenza was nearly absent from the New Zealand data before this time, which indicates that the disease was carried to New Zealand in the returning troop ships. Influenza had long been a common disease in the country, but this particular strain was remarkable in that it was deadliest to young adults, which explains why so many of the returning troops were a victim to it.
Overseas, pneumonia was a regular killer right through the war, with several periods of increased cases. These could be related to periods when conditions in the trenches were worst (wet and cold), though the last upsurge was undoubtedly due to the influenza epidemic. Enteric fever (related to typhoid) and dysentery were common killers during the second half of 1915 (when the Gallipoli campaign was in progress) but the numbers were greatly reduced for the rest of the war. Typhoid is spread by flies, which were a notorious nuisance in the trenches of Gallipoli.
In New Zealand a measles outbreak can be seen mid 1915 (this is what killed my grandfather’s cousin who died at Trentham in June 1915). Mengingitis was a regular killer though the war both in New Zealand and overseas, its spread related to the close quarters of the troops’ living conditions.
Why were there increased numbers of accidental deaths in New Zealand during the influenza epidemic?
Looking at the specific causes of accidental deaths for dates of death during late 1918 could answer this question.
nz_medical = merged_medical[merged_medical['nz_death']]
nz_medical[(nz_medical['medical'] == 'Accidental Death/Cause of Death') &
nz_medical['dateOfDeath'].between('1918-08-01','1918-12-01')]['medicalNotes'].drop_duplicates()
There is in fact only one unique entry in the medical notes field for this cause of death and time period. I’ll access an individual value to expand the string out.
nz_medical[(nz_medical['medical'] == 'Accidental Death/Cause of Death') &
nz_medical['dateOfDeath'].between('1918-08-01','1918-12-01')]['medicalNotes'].ix[13291]
Oddly it seems that “accidental” cause of death actually includes disease contracted during training. This definitely explains the jump in “accidental” deaths during this period. A large number of deaths under this category are probably related to influenza. This indicates that the true number of influenza deaths is under-represented in the data.
Embarkation age and days to death
Is age at embarkation related to the number of days until death (revisited)?
Now that I have the Cause of Death data accessible I will re-plot the age at embarkation against number of days until death but using only deaths with cause of death as “Killed in Action” or “Died of wounds”. In this way I can better answer the question of whether younger soldiers were more likely to be killed in battle sooner.
embarkation_cause = war_deaths.merge(embarkation_death, on="index")[['index',
'embarkation_embarkationDate',
'dateOfDeath_x','dateOfDeath_y',
'embarkation_ageAtEmbarkation',
'causeOfDeath','embarkationToDeath']].dropna()
battle_deaths = embarkation_cause[(embarkation_cause['causeOfDeath'] == 'Killed in Action')
| (embarkation_cause['causeOfDeath'] == 'Died of wounds')]
g=sns.FacetGrid(battle_deaths, hue="causeOfDeath", size=6, aspect=1.2, palette={'Killed in Action':'red','Died of wounds':'blue'})
g.map(plt.scatter, "embarkation_ageAtEmbarkation","embarkationToDeath", s=20, edgecolor="w", alpha=0.6).add_legend()
_ = plt.ylabel('Number of days from Embarkation to Death')
_ = plt.xlabel('Age at Embarkation')
_ = plt.ylim(0,1600)
_ = plt.xlim(14,60)
_ = plt.title("Relationship between age at embarkation and number of days from embarkation to death", fontsize=15)
Visually this doesn’t give any improved evidence of a relationship between age at embarkation and days until death but I will run a regression model to make sure.
import statsmodels.formula.api as smf
the_data = battle_deaths[['embarkation_ageAtEmbarkation','embarkationToDeath']]
#generate the x-axis values that are in range for the CW values
x = pd.DataFrame({'embarkation_ageAtEmbarkation': np.linspace(the_data['embarkation_ageAtEmbarkation'].min(), the_data['embarkation_ageAtEmbarkation'].max(), len(the_data['embarkation_ageAtEmbarkation']))})
mod = smf.ols(formula='embarkationToDeath ~ 1 + embarkation_ageAtEmbarkation',
data=the_data.dropna()).fit()
#plot the actual data
plt.scatter(the_data['embarkation_ageAtEmbarkation'], the_data['embarkationToDeath'], s=20, alpha=0.6, color="black")
plt.xlabel('Age at Embarkation'); plt.ylabel('Days from Embarkation to Death')
#render the regression line by predicting the ys using the generated model from above
plt.plot(x.embarkation_ageAtEmbarkation, mod.predict(x), 'b-', label='Linear $R^2$=%.2f' % mod.rsquared, alpha=0.9)
#give the figure a meaningful legend
plt.legend(loc='upper left', framealpha=0.5, prop={'size':'small'})
plt.title("Relationship between age and days to death", fontsize=20)
mod.summary()
A very small R-sqared and a negative slope shows that there is certainly no increasing relationship between Age at Embarkation and Day from Embarkation to Death.
However since the data is right skewed a log transformation will help to makes the data more symmetrical, which well help to reveal any relationships.
_ = plt.scatter(battle_deaths['embarkation_ageAtEmbarkation'].apply(np.log), battle_deaths['embarkationToDeath'].apply(np.log), s=20, alpha=0.6, color="black")
_ = plt.ylabel('log(Number of days from Embarkation to Death)')
_ = plt.xlabel('log(Age at Embarkation)')
_ = plt.title("Relationship between age at embarkation and number of days from embarkation to death (log transformed)", fontsize=15)
The points on this plot appear to be randomly scattered, so there is no evidence of any any relationship. I conclude that the age of soldiers at embarkation had no effect on how long they lived through the war.
Women at war
My final set of questions are about the approximately 500 women who served with the New Zealand Expeditionary Force.
How old were they?
women = df[df['female']]
df_with_first_embarkation = pd.merge(df, first_embarkations, left_index=True, right_on='index', how="left")
_ = sns.boxplot(x="embarkation_ageAtEmbarkation", y="female", data=df_with_first_embarkation, orient='h')
_ = plt.ylabel('Is Female')
_ = plt.xlabel('Age at Enlistment')
The women who served were significantly older that the men, and with a much smaller spread. They were nearly all between 25 and 45 and half of them were between about 28 and 35. It seems that the recruitment targeted a very specific age of woman.
What did they do?
all_embarkations_women = pd.merge(women, embarkations_df, left_index=True, right_on="index")
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = sns.countplot(data=all_embarkations_women, x='embarkation_rankOnEmbarkation', palette="deep")
Nearly all served as nurses, but surprisingly at least one was a motor driver!
Who was the woman who worked as a motor driver?
all_embarkations_women[all_embarkations_women['embarkation_rankOnEmbarkation'] == 'Motor Driver'][['firstName','familyName']]
Beatrice Enid Bell served as an ambulance driver in England and France. She was the daughter of Sir Francis Bell (the first New Zealand born Prime Minister).
http://navymuseum.co.nz/worldwar1/people/wren-enid-bell/
What an amazing story her war experiences would be!
How many died? Where and how?
sum(women['diedDuringWar'])
18 died in total. What proportion is that of the women who served?
float(sum(women['diedDuringWar'])) / float(len(women))
How does this compare to the overall proportion who died?
float(sum(df['diedDuringWar'])) / float(len(df))
A total of 18 women died during the war, 3% of those who served. This is much lower than the overall total of 21%, but it is higher than I would have expected considering the women would have been kept well away from the front lines.
_ = plt.figure(figsize=(12,8))
_ = plt.xticks(rotation=90)
_ = sns.countplot(y="causeOfDeath", data=women)
It’s not surprising that Sickness and Disease feature prominently here, since as nurses they would have been exposed to many infectious diseases. However the large number were accidental deaths is surprising. I discovered above that this can sometimes mean disease, but I’d like to look into that more. One women committed suicide, which is shocking and very sad.
Why did such a large number of women die from accidental death?
_ = plt.figure(figsize=(10,6))
_ = plt.xticks(rotation=90)
_ = sns.countplot(y="placeOfDeath_region", hue="causeOfDeath", data=women[['placeOfDeath_region','causeOfDeath']].dropna())
That so many accidental deaths occurred in the Aegean Sea makes me wonder if these were related to a single incident. I’ll look at the medical notes for those deaths to find out.
aegean_death_notes = pd.merge(women[['causeOfDeath','dateOfDeath']][women['placeOfDeath_region'] == 'Aegean Sea'],
medical_df[medical_df['medical']=='Accidental Death/Cause of Death'], left_index=True, right_on="index")
aegean_death_notes[['dateOfDeath','medicalNotes']]
These nurses all drowned on the 23rd of October 1915. This is tragic story I was not previously aware of. A web search gives details of the sinking of the Marquette, a transport ship torpedoed by a German submarine.
https://nzhistory.govt.nz/page/new-zealand-nurses-lost-in-marquette-sinking
Conclusion and Discussion
While it’s interesting to describe patterns and trends, a unique feature of this data is the stories that lie behind the anomalies and outliers. Every soldier had a unique experience of war, but some experiences were remarkable. An alternate approach to this data would be hunt out and document these stories – for instance, who were the soldiers who died almost immediately after embarkation, or served the entire war only to die of disease shortly after their return?
I have taken an fascinating stroll (or maybe a brisk stride) through this data but have really only scratched the surface of the insights it could offer on those who served. I have found some generalities (those who died were younger than those who lived), disproved a theory (younger men were more reckless on the battlefield) and found patterns in where and when the casualties occurred. But perhaps the most interesting findings have been the serendipitous discoveries of unusual and interesting stories hidden in the data – insights into the individual experiences of just a few of the numerous men and women who served and gave their lives for our country.