HNG Stage 0 Task - Data Analysis

HNG Stage 0 Task - Data Analysis


Technical Report


Introduction

In this article, I will be exploring the Titanic dataset to analyze the survival rate of passengers that were on the ship when it sunk. The dataset can be found on Kaggle using this link. The review is meant to present observations I made through a quick analysis of the dataset to highlight key trends I observe at first glance.

Observation

  • Initial Analysis

    Importpandasand load dataset

    I started off by importing the pandas library for data manipulation and analysis and loaded the train_data dataset, saving it as a data frame.

      import pandas as pd
    
      train_data = pd.read_csv("train.csv")
    

    Then using the .head() method, I displayed the first 10 rows of the dataset for a first glance.

      train_data.head(10)
    

    From the result, it is observed that the dataset has 12 columns namely, PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.

    Drop columns not needed

    Columns that would not be needed for this task were then dropped from the data. The columns that were dropped were Name, SibSp, Parch, Ticket, Cabin and Embarked. View first few rows of the data to see the changes made.

      columns_to_drop = ['Name', 'Ticket', 'Cabin', 'Embarked', 'SibSp', 'Parch']
      clean_train_data = train_data.drop(columns = columns_to_drop)
    
      clean_train_data.head()
    

    Find missing values

    I searched for missing values in the data and replaced them. Of all the column left, I observed there were missing values in the Age column. What I did was to find the average of the Age column and replace all missing values with that average.

      clean_train_data.isnull().sum()
    

      median_age = clean_train_data['Age'].median()
      median_age
    

      clean_train_data['Age'].fillna(median_age, inplace=True)
      clean_train_data.head()
    

    Check for missing values again to see changes take effect.

      clean_train_data.isnull().sum()
    

    Further Analysis and Survival Rate

    Find total number of males on board per the dataset.

      total_males = clean_train_data['Sex'].value_counts()['male']
      total_males
    

    There was a total of 577 males on the Titanic.

    Find total number of females on board per the dataset.

      total_females = clean_train_data['Sex'].value_counts()['female']
      total_females
    

    There was a total of 314 females on the Titanic.

    Find total number of Survivors.

    From this point, I decided to find out the total number of people who survived the unfortunate sinking of the Titanic.

      total_survivors = clean_train_data['Survived'].sum()
      total_survivors
    

    Out of the 891 people, only 342 people survived. This shows that the survival rate was lower than the death rate

    Survivors by Gender

    I also tried to analyze the Survived and Sex columns to see which portion of those who survived were male and which were female

      survivors = clean_train_data[clean_train_data['Survived'] == 1]
      survivors_by_gender = survivors['Sex'].value_counts()
      survivors_by_gender
    

    From the above code snippet and output, we observe that of the 342 people that survived, 233 were female and 109 were male. The larger number of females surviving shows that females had a higher chance of survival as compared to males.

    Survival Rate by Gender

    Now it would be interesting to see how a passenger's gender affected their rate of survival using visualizations. I'll start by importing seaborn and matplotlib.pyplot libraries as sns and plt respectively

      import seaborn as sns
      import matplotlib.pyplot as plt
    

    Calculate survival rate by Gender

      survival_by_gender = train_data.groupby('Sex')['Survived'].mean()
      survival_by_gender
    

    Plot graph to visualize the rate of a Passenger surviving based on their Gender

      plt.figure(figsize=(3, 4))
      sns.barplot(x=survival_by_gender.index, y=survival_by_gender.values, palette='viridis')
      plt.title('Survival Rates by Gender')
      plt.xlabel('Gender')
      plt.ylabel('Survival Rate')
      plt.show()
    

    Survival Rate by Passenger Class

    Now, I will also visualize the Passenger's rate of survival based on their Class on the Titanic.

    Calculate survival rate by Class

      survival_by_class = clean_train_data.groupby('Pclass')['Survived'].mean()
      survival_by_class
    

    Plot the graph to visualize the rate of a Passenger's survival based on their Class.

      plt.figure(figsize=(3, 3))
      sns.barplot(x=survival_by_class.index, y=survival_by_class.values, palette='viridis')
      plt.title('Survival Rates by Passenger Class')
      plt.xlabel('Passenger Class')
      plt.ylabel('Survival Rate')
      plt.show()
    

Conclusion

The observation and simple analysis of the Titanic dataset provides insight into patterns of a Passenger's survival based on their Sex and Class. It can be observed that the economic standing of passengers has an impact on their rate of survival as majority of those that survived were in first class. Also, I observed that the gender of a passenger played a huge role in their survival as majority of the survivors were female.

This article was assigned as an onboarding task in the HNG 11 Internship. I would recommend joining the internship using this link. If you want a certificate at the end, use this link instead.