Getting Started with Pandas DataFrame

A DataFrame is a spreadsheet like data structure. We can think of it as a collection of rows and columns. This row-column structure is useful for many different kinds of data. The most widely used DataFrame implementation in Python is from the Pandas package. First we will learn how to create DataFrames. We will also learn how to do some basic data analysis with them. Finally, we will compare the DataFrame to the ndarray data structure and learn why data frames are useful in other packages such as Seaborn.

How to Create a DataFrame

There two major ways to create a DataFrame. We can directly call DataFrame() and pass it data in a dictionary, list or array. Alternatively we can use several functions to load data from a file directly into a DataFrame. While it is very common in data science to load data from file, there are also many occasions where we need to create DataFrame from other data structures. We will first learn how to create a DataFrame from a dictionary.

import pandas as pd
d = {"Frequency": [20, 50, 8],
     "Location": [2, 3, 1],
     "Cell Type": ["Interneuron", "Interneuron", "Pyramidal"]}
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, index=row_names)
print(df)

"""
    Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""

In our dictionary the keys are used as the column names. The data under each key then becomes the column. The row names are defined separately by passing a collection to the index parameter of DataFrame. We can get column and row names with the columns and index attributes.

df.columns
# Index(['Freq (Hz)', 'Loc (cm)', 'Cell Type'], dtype='object')
df.index
# Index(['C1', 'C2', 'C3'], dtype='object')

We can also change column and row names through those same attributes.

df.index = ["Cell_1", "Cell_2", "Cell_3"]
df.columns = ["Freq (Hz)", "Loc (cm)", "Cell Type"]
"""
        Freq (Hz)  Loc (cm)    Cell Type
Cell_1         20         2  Interneuron
Cell_2         50         3  Interneuron
Cell_3          8         1    Pyramidal
"""

These names are useful because they give us a descriptive way of indexing into columns and rows. If we use indexing syntax on the DataFrame, we can get individual columns.

df['Freq (Hz)']
"""
Cell_1    20
Cell_2    50
Cell_3     8
Name: Freq (Hz), dtype: int64
"""

Row names are not found this way and using a row key will raise an error. However, we can get rows with the df.loc attribute.

df['Cell_1']
# KeyError: 'Cell_1'
df.loc['Cell_1']
"""
Freq (Hz)             20
Loc (cm)               2
Cell Type    Interneuron
Name: Cell_1, dtype: object
"""

We could also create a DataFrame from other kinds of collections that are not dictionaries. For example we can use a list.

d = [[20, 2, "Interneuron"],
     [50, 3, "Interneuron"],
     [8, 1, "Pyramidal"]]
column_names = ["Frequency", "Location", "Cells"]
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, columns=column_names, index=row_names)
print(df)
"""
    Frequency  Location        Cells
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""

In that case there are no dictionary keys that could be use to infer the column names. This means we need to pass the column_names to the columns parameter. Mostly anything that structures our data in a two-dimensional way can be used to create a DataFrame. Next we will learn about functions that allow us to load different file types as a DataFrame.

Loading Files as a DataFrame

The list of file types Pandas can read and write is rather long and you can find it here. I only want to cover the most commonly used .csv file here. They have the particular advantage that they can also be read by humans, because they are essentially text files. They are also widely supported by a variety of languages and programs. First, let’s create our file. Because it is a text file, we can write a literal string to file.

text_file = open("example.csv", "w")
text_file.write(""",Frequency,Location,Cell Type
                 C1,20,2,Interneuron
                 C2,50,3,Interneuron
                 C3,8,1,Pyramidal""")
text_file.close()

In this file columns are separated by commas and rows are separated by new lines. This is what .csv means, it stands for comma-separated values. To load this file into a DataFrame we need to pass the file name and which column contains the row names. Pandas assumes by default that the first row contains the column names.

df = pd.read_csv("example.csv", index_col=0)
print(df)
"""
     Frequency  Location    Cell Type
 C1         20         2  Interneuron
 C2         50         3  Interneuron
 C3          8         1    Pyramidal
"""

There are many more parameters we can specify for read_csv in case we have a file that is structured differently. In fact we can load files that have a value delimiter other than the comma, by specifying the delimiter parameter.

text_file = open("example.csv", "w")
text_file.write("""-Frequency-Location-Cell Type
                 C1-20-2-Interneuron
                 C2-50-3-Interneuron
                 C3-8-1-Pyramidal""")
text_file.close()
df = pd.read_csv("example.csv", index_col=0, delimiter='-')
print(df)
"""
     Frequency  Location    Cell Type
 C1         20         2  Interneuron
 C2         50         3  Interneuron
 C3          8         1    Pyramidal
"""

We specify '-' as the delimiter and and it also works. Although the function is called read_csv it is not strictly bound to comma separated values. We can also skip rows, columns and specify many more options you can learn about from the documentation. For well structured .csv files however, we need very few arguments as shown above. Next we will learn how to do basic calculations with the DataFrame.

Basic Math with DataFrame

A variety of functions such as df.mean(), df.median() and df.std() are available to do basic statistics on our DataFrame. By default they all return values per column. That is because columns are assumed to contain our variables (or features) and each row contains a sample.

df.mean()
"""
Freq (Hz)    26.0
Loc (cm)      2.0
dtype: float64
"""

df.median()
"""
Freq (Hz)    20.0
Loc (cm)      2.0
dtype: float64
"""

df.std()
"""
Freq (Hz)    21.633308
Loc (cm)      1.000000
dtype: float64
"""

One big advantage of the column is that within a column the data type is clearly defined. Within a row on the other hand different data types can exist. In our case we have two numeric types and a string. When we call these statistical methods, numeric types are ignored. In our case that is 'Cell Type'. Technically we can also use the axis parameter to calculate these statistics for each sample but this is not always useful and has to again ignore one of the columns.

df.mean(axis=1)
"""
C1    11.0
C2    26.5
C3     4.5
dtype: float64
"""

We can also use other mathematical operators. They are applied element-wise and their effect will depend on the data type of the value.

print(df * 3)
"""
         Frequency  Location                      Cell Type
 C1         60         6  InterneuronInterneuronInterneuron
 C2        150         9  InterneuronInterneuronInterneuron
 C3         24         3        PyramidalPyramidalPyramidal
"""

Often times these operations make more sense for individual columns. As explained above we can use indexing to get individual columns and we can even assign new results to an existing or new column.

norm_freq = df['Frequency'] / df.mean()['Frequency']
norm_freq
"""
 C1    0.769231
 C2    1.923077
 C3    0.307692
Name: Frequency, dtype: float64
"""
df['Norm Freq'] = norm_freq
print(df)
"""
     Frequency  Location    Cell Type  Norm Freq
 C1         20         2  Interneuron   0.769231
 C2         50         3  Interneuron   1.923077
 C3          8         1    Pyramidal   0.307692
"""

If you are familiar with NumPy, most of these DataFrame operations will seem very familiar because they mostly work like array operations. Because Pandas builds on NumPy, most NumPy functions (for example np.sin) work on numeric columns. I don’t want to go deeper and instead move on to visualizing DataFrames with Seaborn.

Seaborn for Data Visualization

Seaborn is a high-level data visualization package that builds on Matplotlib. It does not necessarily require a DataFrame. It can work with other data structures such as ndarray but it is particularly convenient with DataFrame. First, let us get a more interesting data set. Luckily Seaborn comes with some nice example data sets and they conveniently load into Pandas DataFrame.

import seaborn as sns
df = sns.load_dataset('iris')
type(df)
# pandas.core.frame.DataFrame
print(df)
"""
     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]
"""

print(df.columns)
"""
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
"""

The Iris data set contains information about different species of iris plants. It contains 150 samples and 5 features. The 'species' feature tells us what species a particular sample belongs to. The names of those columns are very useful when we structure our plots in Seaborn. Let’s first try a basic bar graph.

sns.set(context='paper',
        style='whitegrid',
        palette='colorblind',
        font='Arial',
        font_scale=2,
        color_codes=True)
fig = sns.barplot(x='species', y='sepal_length', data=df)

We use sns.barplot and we have to pass our DataFrame to the data parameter. Then for x and y we define which column name should appear there. We put 'species' on the x-axis so that is how data is aggregated inside the bars. Setosa, versicolor and virginica are the different species. The sns.set() function defines multiple parameters of Seaborn and forces a certain style on the plots that I personally prefer. Bar graphs have grown out of fashion and for good reason. They are not very informative about the distribution of their underlying values. I prefer the violin plot to get a better idea of the distribution.

fig = sns.violinplot(x='species', y='sepal_length', data=df)

We even get a small box plot within the violin plot for free. Seaborn works its magic through the DataFrame column names. This makes plotting more convenient but also makes our code more descriptive than it would be with pure NumPy. Our code literally tells us, that 'species' will be on the x-axis.

Summary

We learned that we can create a DataFrame from a dictionary or another kind of collection. The most important features are the column and row names. Columns organize features and rows organize samples by convention. We can also load files into a DataFrame. For example we can use read_csv to load .csv or other text based files. We can also use methods like df.mean() to get basic statistics of our DataFrame. Finally, Seaborn is very useful to visualize a DataFrame.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s