# Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

## How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")


By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images


Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

## The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")


When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

## Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity =$' + str(perplexity_01),
r'$\sigma = 10; Perplexity =$' + str(perplexity_10)))


Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
desired_perplexity = perplexity,
verbose=False)


As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

## Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
'degrees_of_freedom': 1,
'n_samples': 1000,
'n_components':2}

embedding_init,
0,
1000,
kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
'tsne_2': tsne_result[:,1],
'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.

# Smoothing Data by Rolling Average with NumPy

Time series data often comes with some amount of noise. One of the easiest ways to get rid of noise is to smooth the data with a simple uniform kernel, also called a rolling average. The title image shows data and their smoothed version. The data is the second discrete derivative from the recording of a neuronal action potential. Derivatives are notoriously noisy. We can get the result shown in the title image with np.convolve

import numpy as np

kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved = np.convolve(data, kernel, mode='same')


Convolution is a mathematical operation that combines two arrays. One of those arrays is our data and we convolve it with the kernel array. During convolution we center the kernel at a data point. We multiple each data point in the kernel with each corresponding data point, sum up all the results and that is the new data point at the center. Let’s look at an example.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel)
# array([ 2,  7, 13, 23, 24, 18, 19,  4])


For this result to make sense you must know, that np.convolve flips the kernel around. So step by step the calculations go as follows:

[4, 3, 2, 1]  # The flipped kernel
x
[2, 3, 1, 4, 1]  # The data
2= 2
[2]

[4, 3, 2, 1]
x  x
[2, 3, 1, 4, 1]
4+ 3= 7
[2, 7]

[4, 3, 2, 1]
x  x  x
[2, 3, 1, 4, 1]
6+ 6+ 1=13
[2, 7, 13]

[4, 3, 2, 1]
x  x  x  x
[2, 3, 1, 4, 1]
8+ 9+ 2+ 4= 23
[2, 7,13,23]
# This continues until the arrays stop touching. You get the idea.


One thing you’ll notice is that the edges are problematic. There is really no good way to avoid that. Data points at the edges only see part of the kernel but the mode parameter defines what should happen at the edges. I prefer the 'same' mode because it means that the new array will have the same shape as the original data, which makes plotting easier. However, if you start to use more complicated kernels, the edges might become virtually useless. In that case, mode should be 'valid'. Then, the values at the edges that did not see the entire kernel are discarded. The output array is smaller in shape than the input array.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel, mode='valid')
array([23, 24])


The default behavior you saw above is called 'full'. It keeps all data points, so the output array is larger in shape than the input array. You might also have noticed that the size of the kernel is very important. Actually, we need to divide the array of ones by its length. Can you guess what would happen if we forgot about dividing it?

If you guessed that the signal would become larger in magnitude you guessed right. We would be summing up all data points in the kernel. By dividing it we ensure that we take the average of the data points. But the kernel size is even more important. If we make the kernel larger the outcome changes dramatically.

kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved_10 = np.convolve(data, kernel, mode='same')

kernel_size = 20
kernel = np.ones(kernel_size) / kernel_size
data_convolved_20 = np.convolve(data, kernel, mode='same')

plt.plot(data_convolved_20)
plt.plot(data_convolved_10)
plt.legend(("Kernel Size 10", "Kernel Size 20"))


The larger we make the kernel, the smaller sharp peaks become. The peaks are also shifted in time. To be specific, a rolling mean is a low-pass filter. This means that is leaves low frequency signals alone, while making high frequency signals smaller. Sharp increases in the data have a high frequency. If we make the kernel larger, the filter attenuates high frequency signals more. This is exactly how the rolling average works. It gets rid of high frequency noise. It also means that we must be careful not to distort the signal too much with the rolling average filter.

# Threshold Detection in NumPy

Many signals are easily detected by their size. We will learn how to detect the indices where signals cross a threshold with NumPy. These are our practice signals.

import numpy as np
import matplotlib.pyplot as plt
data = np.array([0, 0, 0, 5, 5, 5, 5, 0, 0, 0, 0, 4, 4, 4, 0, 0, 0])
plt.plot(data, marker='o')


We will perform two simple steps to detect the threshold crossings: 1. Make the data binary, in a way that they are true when larger than the threshold and false when lower or equal. 2. Take the difference of the binary signal. This gives us a boolean array that is true when the threshold was crossed. We can combine those steps into one line.

threshold = 2
threshold_crossings = np.diff(data > threshold, prepend=False)


Plotting shows us that threshold_crossings is true after the threshold was crossed.

plt.plot(data2, marker='o')
plt.plot(thr_crossings, marker='o')
plt.legend(("Data", "Threshold Crossings"))


To get the indices of the threshold crossings we can use np.argwhere(), which returns the true indices from a boolean array.

np.argwhere(threshold_crossings)[:,0]
# array([ 3,  7, 11, 14], dtype=int64)


Threshold crossings occur at 3, 7, 11 and 14. Sometimes we only need the upward or downward crossings. We can simply isolate those by slicing the indiced array.

np.argwhere(threshold_crossings)[::2,0]  # Upward crossings
# array([ 3, 11], dtype=int64)
np.argwhere(thr_crossings)[1::2,0]  # Downward crossings
# array([ 7, 14], dtype=int64)


Sometimes we want to find the point before the threshold is crossed, rather than after. There is one simple trick in np.diff instead of setting prepend=False, we set append=False.

threshold = 2
post_threshold_crossings = np.diff(data > threshold, prepend=False)
pre_threshold_crossings = np.diff(data > threshold, append=False)
plt.plot(data2, marker='o')
plt.plot(post_threshold_crossings, marker ='o')
plt.plot(pre_threshold_crossings, marker='o')
plt.legend(("Data", "Post Crossings", "Pre Crossings"))


Make sure to check out the documentation for np.diff and bonus points if you can figure out why exactly this works. Detecting threshold crossings is an easy but important part in most of my analysis pipelines and now you can do it too.

# Comparisons and Logic Functions in NumPy

• We can compare arrays with scalars and other arrays using Pythons standard comparison operators
• There are two different boolean representations of an array. arr.all is True if all elements are True, arr.any is True if any element is True.
• To perform logical functions we need the special NumPy functions np.logical_and, np.logical_or, np.logical_not, np.logical_xor

## Introduction

Logic functions allow us to check if logical statements about our arrays are true or false. Luckily, logic functions are very consistent with other array functions. They are performed element-wise by default and they can be performed on specific axes.

## Comparing Arrays and Scalars

The same way we can use the arithmetic operators we can use all the logical operators: >, >=, <,< <=, ==, !=,

import numpy as np
arr = np.array([-1, 0, 1])
arr > 0
# array([False, False,  True])
arr >= 0
array([False,  True,  True])
arr < 0
# array([ True, False, False])
arr <= 0
# array([ True,  True, False])
arr == 0
# array([False,  True, False])
arr != 0
# array([ True, False,  True])


## Comparing Arrays with Arrays

The result of a logic operation is a boolean array containing binary values of True or False. This particular case shows us logic operations of arrays with scalars. All boolean operations are performed element-wise, so each element is compared against the scalar. We can also compare arrays to arrays if their shape allows it.

arr_one = np.array([-1, 0, 1])
arr_two = np.array([1, 0, -1])
arr_one > arr_two
# array([False, False,  True])
arr_one >= arr_two
# array([False,  True,  True])
arr_one < arr_two
# array([ True, False, False])
arr_one <= arr_two
# array([ True,  True, False])
arr_one == arr_two
# array([False,  True, False])
arr_one != arr_two
array([ True, False,  True])


## Truth Value of Arrays and Elements

Sometimes we need to check if an array contains any elements that are considered True in a boolean context. While the boolean value of array elements is well defined, the truth value of an entire array is not defined.

arr = np.array([0, 1])
bool(arr[0])
# False
bool(arr[1])
# True
bool(arr)
# ValueError: The truth value of an array with more
# than one element is ambiguous. Use a.any() or a.all()


NumPy error messages are great. This one is so great that it even tells us which method we need to use to get at the truth value of an array. We can use arr.any() to find out if any of the elements evaluate to True or arr.all() to find out if all elements are True

arr_one = np.array([0, 0])
arr_one.any()
# False
arr_one.all()
# False
arr_two = np.array([0, 1])
arr_two.any()
# True
arr_two.all()
# False
arr_three = np.array([1, 1])
arr_three.any()
# True
arr_three.any()
# True


This can be useful to find out whether an array is empty

arr = np.array([])
arr.any()
# False


## Logical Operations

Finally, we need to look at four more logical operations: and, or, not & xor.
Unfortunately we can’t just use the Python keywords. The reason is in the error message above: “ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”. It is ambiguous because NumPy does not know if we want to perform the operation element-wise or if we want to perform the operation on the truth value of the array. NumPy does not try to guess which one we mean, so it throws the error. To get these logical functions we need to call some more explicit NumPy functions.

arr_one = np.array([0,1,1])
arr_two = np.array([0,0,1])
np.logical_and(arr_one, arr_two)
# array([False, False,  True])
np.logical_or(arr_one, arr_two)
# array([False,  True,  True])
np.logical_not(arr_one)
# array([ True, False, False])
np.logical_xor(arr_one, arr_two)
# array([False,  True, False])


## Summary

We are now well equipped to deal with arrays. We can compare arrays with scalar values and other arrays using the the standard comparison operators. We can also perform logical operations on arrays with the special NumPy functions (logical_and, logical_or, logical_not and logcal_xor). Finally we can get two different boolean values of an arrays using arr.all and arr.any.

# NumPy Array Data Type

• Any array has a data type (dtype)
• The dtype determines what kind of data is stored in the array
• Not all operations work for all dtypes

## Introduction to Data Types

Having a data type (dtype) is one of the key features that distinguishes NumPy arrays from lists. In lists, the types of elements can be mixed. One index of a list can contain an integer, another can contain a string. This is not the case for arrays. In an array, each element must be of the same type. This gives the array some of its efficiency, because operations can know in advance, what kind of data they will find in each element simply by looking up the data type. At the same time it makes arrays slightly less flexible, because some operations are undefined for some data types and we cannot assign any kind of data to an array. But how does NumPy decide what data type an array should have in the first place?

## Guessing or Defining the dtype

So far we were able to create arrays effortlessly without knowing what dtype even means. That is because NumPy will just take a guess, what the dtype should be, based on the input it gets for the array.

arr = np.array([4, 3, 2])
arr.dtype
# dtype('int32')
arr = np.array([4, 3.0, 2])
arr.dtype
# dtype('float64')
arr = np.array([4, '3', 2])
arr.dtype
# dtype('<U11')


In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number. Floats are a more complex data type in Python, which means that all other data types have to follow the more complex one. Therefore, all elements of the array are converted to floats and are stored with the dtype float64. Strings are an even more complex dtype. Because ‘3’ is a string in the final example, the dtype becomes ‘<U11’. U stands for unicode, a type of string encoding and the number indicates the length of the string. In all three cases NumPy guesses the dtype according to the content of the list. This works well most of the time but we can also explicitly define the dtype.

arr = np.array([4, 3, 2], dtype=np.float)
arr.dtype
# dtype('float64')
arr = np.array([4, 3, 2], dtype=np.str)
arr.dtype
# dtype('<U1')
arr = np.array([4, 3, 2], dtype=np.bool)
arr.dtype
# dtype('bool')


Converting arrays to other dtypes can be necessary because some operations will not work on arrays of mixed types. A dtype that is particularly problematic is the np.object dtype. It is the most flexible dtype but it can cause a lot of problems for both experts and beginners.

## np.object and the Curse of Flexibility

Most dtypes are very specific. They let you know if the array contains a number (np.int, np.float) or a string (all unicode ‘U’ dtypes). Not so much np.object. It tells you that whatever is inside the array is a thing. Because everything is an object anyway. This can make an array as flexible as a list. Anything can be stored. That is also where the problems come in.

arr = np.array([[3,2,1],[2,5]])
arr.dtype
# dtype('O')  # 'O' means object
arr + 5
# TypeError: can only concatenate list (not "int") to list


Suddenly, the plus operation between an array and a scalar fails. What went wrong? Starting from the top, NumPy decides to assign the dtype of np.object to arr because the nested list entries have different lengths. Think of it this way: this array can neither be a (2, 3) nor a(2, 2) array of dtype integer. Therefore, NumPy makes it a (2,) array of dtype object. So the array contains two lists, the first one is of length 3 and the second one of length 2. NumPy generally turns anything that is more complex than a string into np.object. A list is one of those that gets turned into np.object. The error then occurs because the plus operation is not defined for a list with an integer. But that also means, that the operation will work, if the objects contained in the array so happen to work with the operation.

arr = np.array([3,2,1], dtype=np.object)
arr.dtype
dtype('O')
arr + 5
array([8, 7, 6], dtype=object)


This is one of the main problem of the np.object dtype. Operations work only sometimes and to know if an operation will work, each element has to be checked. With other dtypes, we know which operations will work just by just looking at it.

## Summary

The dtype is one of the concepts that is closely related to the internal workings of NumPy. There is a lot that could be said about the details but effective beginners only need to remember a few points. First, the dtype determines what is stored in the array. All elements of an array have to conform to a specific type and dtype tells us which one. Second, NumPy guesses the dtype based on the literal data unless we specify which dtype we want. Guessing works most of the time but sometimes explicit types conversion is necessary. Third, operations that we know and love from numeric types (np.int, np.float) may not work on other types (np.str, np.obect). This is particularly annoying for beginners. If you have hard to debug errors, find out what dtype your arrays actually have.

# Array Indexing with NumPy

• Indexing is used to retrieve or change elements of a an array
• Slice syntax (start:stop:step) gets a range of elements
• Integer and boolean arrays can get an arbitrary set of elements

## Introduction to Array Indexing

Indexing is an important feature that allows us to retrieve and reassign specific parts on an array. You probably already know the basics of indexing from Python lists and tuples. You can index into NumPy arrays the same way you index into those sequences but NumPy indexing comes with many extra features we will learn about here. First, lets look at single value indexing.

## Single Value Indexing

We can use indexing to get single (scalar) values from an array. Indexing is always done with square brackets and we always start counting at 0.

import numpy as np
arr = np.arange(10,15)
arr
array([10, 11, 12, 13, 14])
arr[0]
10
arr[4]
14


Note that single value indexing does not return an array with a single entry but rather a numpy integer. To get a single value from a multi dimensional array we need to use multiple indices that are separated by commas.

arr = np.arange(20)
arr = arr.reshape((2,2,5))
arr
array([[[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9]],

[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
arr[0,0,1]
1
arr[1,0,4]
14


I recommend this way of indexing but you can also use multiple square brackets like you would for Python sequences.

arr
array([[[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9]],

[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
arr[0][0][0]
0
arr[1][0][4]
14


We can also use indexing to reassign elements of an array.

arr = np.arange(10,15)
arr
# array([10, 11, 12, 13, 14])
arr[1] = 20
arr
# array([10, 20, 12, 13, 14])


## Slice Indexing

To retrieve a single value, our indices need to resolve all dimensions of the array and arrive at a single value. Whenever one dimension remains unspecified, we get an array (array view technically).

arr = np.array([[[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9]],
[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
arr[0, 1]
array([5, 6, 7, 8, 9])
arr[1]
array([[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])


To take an entire dimension we can use the colon.

arr = np.array([[[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9]],
[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
arr[0, :, 0]
array([0, 5])
arr[:, 0, 0]
array([ 0, 10])


The colon is very useful for indexing in general, because it allows us to take a slice of values instead of a single value. The syntax of the slice follows start:stop:step. If we leave out start, the slice starts at 0. If we leave out stop, it goes to the end of the dimension. If we leave out step, the step defaults to 1.

arr = np.array([[[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9]],
[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]]])
arr[0, 0, 1:5:2]
# array([1, 3])
arr[0, 0, 1:4]
# array([1, 2, 3])
arr[0, 0, 1:]
# array([1, 2, 3, 4])
arr[0, 0, :3]
# array([0, 1, 2])
arr[0, 0, :]
# array([0, 1, 2, 3, 4])


## Index Array

So far we learned that we can use integers and slices for indexing. Now we learn that we can also use arrays to index into an array. When we use an array to index, that array has to either contain integers or boolean values. Lets take a look at integer array indexing first.

arr = np.arange(10,50,3)
idc = np.arange(5)
idc.dtype
dtype('int32')
arr[idc]
array([10, 13, 16, 19, 22])
idc = np.arange(5,8)
arr[idc]
array([25, 28, 31])
idc = np.array([1,2,4])
arr[idc]
array([13, 16, 22])


Note that in the examples where we generate index arrays with arange, we could achieve the same result with a slice as shown above and save one line of code. Integer arrays are most useful when they are generated by a process that is more complicated than the arange method. One example is the np.argwhere method we will learn more about in a later post.

## Boolean Array

Boolean arrays also deserve at least one post of their own but here I will give you a teaser. We only want to retrieve those values, that satisfy a larger than condition.

arr = np.array([[[ 0,  1,  2,  3,  4],
[10, 11, 12, 13, 14]],
[[5,  6,  7,  8,  9],
[15, 16, 17, 18, 19]]])
boolean_idc = arr > 10
boolean_idc
array([[[False, False, False, False, False],
[False,  True,  True,  True,  True]],

[[False, False, False, False, False],
[ True,  True,  True,  True,  True]]])
arr[boolean_idc]
array([11, 12, 13, 14, 15, 16, 17, 18, 19])



## Summary

We learned that indexing is useful to retrieve values and reassign parts of an array. There are several ways to index. First, we can use single integers to get to an element of a certain dimension. We can also use slices with the colon syntax start:stop:step to get at a sequence of elements. Furthermore, there are two advances indexing techniques, where we can use arrays containing integers or booleans to find an arbitrary collection of elements.

• Broadcasting is triggered when an arithmetic operation is done on two arrays of different shape
• The goal of broadcasting is to make both arrays the same shape by performing transformations on the shape of the smaller array
• Once arrays have the same shape, the operation is applied element-wise
• If the arrays cannot be broadcast an error is raised

When we try to add two arrays together with the plus operator, addition is performed element-wise. That means, each element is added to a corresponding element is the other array. However, this only works when both arrays have the same shape. If two arrays have different shapes, a process called broadcasting tries to resolve the difference between the arrays by performing a series of transformations on the shape of the array with lower dimensionality. To understand broadcasting we need to understand the steps broadcasting performs. Lets look at a quick example.

import numpy as np
arr_one = np.array([[4, 3, 2, 5, 6, 2],
[30, 34, 1, 50, 60, 56],
[22, 34, 32, 21, 12, 6]])
arr_two = np.array([1, 10, 20, 30, 40, 50])
arr_one.shape
(3, 6)
arr_two.shape
(6,)
arrs_plus = arr_one + arr_two
arrs_plus
array([[  5,  13,  22,  35,  46,  52],
[ 31,  44,  21,  80, 100, 106],
[ 23,  44,  52,  51,  52,  56]])


This one works despite both arrays having different shapes, even different number of elements. This next one does not work under seemingly similar circumstances.

arr_one = np.array([[4, 3, 2, 5, 6, 2],
[30, 34, 1, 50, 60, 56],
[22, 34, 32, 21, 12, 6]])
arr_two = np.array([4, 40, 20])
arr_one.shape
(3, 6)
arr_two.shape
(3,)
arrs_plus = arr_one + arr_two
ValueError: operands could not be broadcast together with shapes (3,6) (3,)


What happened here? In the first example we add an array of shape (6,) to an array of shape (3, 6) and it works. In the second example we add an array of shape (3,) to a (3, 6) array and get an error. In both examples, the arrays have different shapes. Therefor, broadcasting is triggered and to understand what happens we need to understand the broadcasting sequence. Lets first work through the working example.

# Broadcasting rules in order
"""
Rule #1: The array with fewer dimensions is broadcast to match
Rule #2: Array shapes are aligned to the right.
(3, 6)
(,6)
Rule #3: All array dimensions must be equal or one
Here 6 is equal to six, so we don't get an error and continue
Rule #4: Array dimensions are expanded in the leftward direction
(6, 3)
(1, 3)
Rule #5: Array dimensions of size 1 are duplicated to match.
(6, 3)
(6, 3)
Done. Array operation can now be executed element wise.
"""


These are the five broadcasting rules that are followed in order. They explain why operations between a (3,) and a (3, 6) array fail. After aligning both array shapes we encounter a problem. 3 does not equal 6, so rule #3 is violated and gives us the ValueError. There are several ways to make this operation work. However, it is best to first understand array indexing before delving into those. We will learn about array indexing in the next post.

## Summary

If it weren’t for broadcasting we would have to manually convert the shape of arrays so that they are equal before we can perform arithmetic operations on them. Luckily, we learned that broadcasting always happens when we want to perform an operation on two arrays of different shapes. It tries to resolve the difference in shape by performing a series of steps on the smaller array. When broadcasting finishes successfully the operation can be performed element-wise. If broadcasting fails an error is raised.

# NumPy Arrays and Shape

• The same values can be stored in arrays with different shapes
• Array methods can perform different operations depending on the array shape
• The methods .reshape and .flatten change the shape of an array

## Introducing Array Shape

Any array has a shape and the shape of an array is important for what kind of operations we can perform. Array shape is sometimes hard to imagine, even for experienced programmers so let’s just look at some code.

import numpy as np
my_array = np.array([3, 2, 5, 6, 3, 4])
my_array.shape
(6,)
my_array_reshaped = my_array.reshape((2,3))
my_array_reshaped.shape
(2, 3)
my_array_reshaped
array([[3, 2, 5],
[6, 3, 4]])


Here we create an array with 6 elements and my_array.shape tells us that these 6 elements are arranged in a single dimension that has a length of 6. We then reshape the array with its .reshape method into an array with two rows and three columns. This doesn’t look immediately useful but imagine we did an experiment under control and experimental condition with three replicates each. You’d clearly want a structure that represents this. Also, we went from a vector to a matrix with just one line of code. The most important part of array shape is that we can perform array methods only on specific dimensions. To do so we just need to pass the axis argument.

my_array = np.array([[3, 2, 5],
[6, 3, 4]])
dim0_sum = my_array.sum(axis=0)
dim0_sum
array([9, 5, 9])
dim1_sum = my_array.sum(axis=1)
dim1_sum
array([10, 13])


Remember that we start out with a (2, 3) array, 2 rows and 3 columns. When we call sum(axis=0) on that array the 0th dimension is eliminated. The array goes from a (2, 3) shape to a (3, ) shape. It does so by calculating the sum across the 0th dimension. Likewise, when we pass sum(axis=1) the 1st dimension gets eliminated in the same way and the array becomes a (2, ) array. The same concept works of course for arrays of any dimension. But lets get back to array shapes. An array cannot be converted to any shape its shape and limit the shapes it can take.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5,6))
my_array_reshaped.shape
(5, 6)
my_array_reshaped = my_array.reshape((5,7))
ValueError: cannot reshape array of size 30 into shape (5,7)


Converting from (30,) to (5, 7) didn’t work for one simple reason. 5 times 7 is 35, not 30. In other words, the new array has more elements than the original array and NumPy will not just invent new elements to make reshaping work. If the number of elements checks out, we can reshape not only to two-dimensional arrays but to any dimension.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_reshaped.shape
(5, 2, 3)
my_array_reshaped
array([[[ 0,  1,  2],
[ 3,  4,  5]],

[[ 6,  7,  8],
[ 9, 10, 11]],

[[12, 13, 14],
[15, 16, 17]],

[[18, 19, 20],
[21, 22, 23]],

[[24, 25, 26],
[27, 28, 29]]])


Of course we can also reshape from higher to lower dimensions.

my_array = np.array([[3, 2, 5],
[6, 3, 4]])
my_array_reshaped = my_array.reshape((6,))
my_array_reshaped.shape
(6,)
my_array.shape
(2, 3)


If you want combine all dimensions into one single dimension, you can use the .flatten method.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_flattened = my_array_reshaped.flatten()
my_array_flattened.shape
(30,)


## Why we need array shapes

We saw how to manipulate array shape and how array methods can use the shape of an array. Lets think a bit about the real world usage of array shape. Let’s say you are working on an image processing project. You are lucky and the images are already pre-processed in a way that each image has 64 pixels in both dimensions. So each image is an array of shape (64, 64) but your dataset consists of 1000 images. So you want your dataset to be stored as a (1000, 64, 64) array. But then your image processing project becomes a volume processing project. So each volume has 100 slices. So you need a (1000, 100, 64, 64) array. But wait. You are actually working on video files. There are 20000 frames for each volume. So you need a (1000, 20000, 100, 64, 64) array. It is rare that you will have to go beyond five dimensions, but you can. In several fields it is very easy to end up with five dimensional arrays (think fMRI).

## Summary

Here we learned that the shape of an array is useful to store high dimensional data meaningfully and to have array methods operate only on specific dimensions. The .reshape method is important to change the shape of an existing array and the .flatten method can collapse an array into a single dimension. In the next blog post we will learn about broadcasting. Broadcasting is a mechanisms that is triggered whenever we perform an arithmetic operation on two arrays of different shapes (dimensionality). If two arrays have identical shape the operation is performed element-wise. If they have different shapes broadcasting performs a series of transformations on the lower dimensional array to make both arrays identical in shape and finally perform the operation element-wise.

# Arithmetic Operations in NumPy

• NumPy arrays come with many useful methods
• All arithmetic operations that are used on arrays are performed element-wise
• NumPy code is almost always faster than native Python (.append is a notable exception)

NumPy arrays are so useful because they allow us to do math on them very efficiently. For example, NumPy arrays come with many useful methods. One such method is the sum method, which calculates the sum of all values in the array

import numpy as np
my_array = np.array([4, 3, 1])
my_array.sum()
8


There are many other methods like this and they are extremely useful. Here is a list of the most commonly used methods.

my_array = np.array([4, 3, 1])
my_array.sum()  # Calculate the sum array values
8
my_array.mean()  # Calculate the mean of array values
2.6666666666666665
my_array.std()  # Calculate the standard deviation of array values
1.247219128924647
my_array.max()  # Find the maximum value
4
my_array.min()  # Find the minimum value
1


To learn about all array methods you can call the dir() function on any array, which will list all its methods. Alternatively you can check out the documentation for the array https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

Another useful property of arrays is that they do math when they appear together with any of the arithmetic operators (+, -, *, /, **, //, %).

my_array = np.array([4, 3, 1])
my_array_plus = my_array + 2
my_array_plus
array([6, 5, 3])


Here, the array appeared together with a scalar value, the single number 2. That number was added to each value. However, we can do the same thing with two arrays, if the have the same shape.

array_one = np.array([4, 3, 1])
array_two = np.array([1, 2, 4])
array_plus_array = array_one + array_two
array_plus_array
array([5, 5, 5])


In this case, addition is again performed element-wise. Each element in array_one is added to a corresponding element in array_two. The fact that the array performs useful math in this context might seem unremarkable but remember how the native Python list behaves.

list_one = [4, 3, 1]
list_two = [1, 2, 4]
list_plus_list = list_one + list_two
list_plus_list
[3, 2, 1, 1, 2, 4]
array_plus_array = np.array(list_one) + np.array(list_two)
array_plus_array
array([5, 5, 5])


If you are in full numerical computation mode this behavior of list might seem stupid to you. But remember: Python is a general purpose programming language and list is a general purpose container to store a sequence of objects. There could be anything in those lists and addition might not be a meaningful operation for those objects. This behavior always works, a list can be concatenated to another list regardless of the objects they store. That’s why we have NumPy. Python has to implement objects in a way that suits its general purpose. NumPy implements behavior in a way that we would expect while we do numerical stuff.

### A word on performance

This is one of the rare occasions where it is worthwhile to talk about performance. When you are getting started, I strongly recommend against thinking too much about performance. Write functioning code first, then worry about readability, maintainability, reproducibility etc. etc. and worry about performance last (trust me on this one). But some of you will be working with large amounts of data and you will be delighted to hear that NumPy is much faster than native Python.

my_array = np.random.rand(100000)  # A large array with 100000 elements
my_list = list(my_array)
timeit sum(my_list)
18.1 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit my_array.sum()
90.3 µs ± 6.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


The native Python version of sum is orders of magnitude slower than the NumPy version. You might have noticed that I created a very large array to demonstrate this. Actually the performance difference will increase with increasing array size, you can verify this for yourself. The take home message here is that whenever you can replace native Python with NumPy, you gain performance. But don’t worry about optimizing your NumPy code. One exception is the .append method, but more on that later.

### Summary

We learned two essential things and one kind of interesting side-note. The first essential lesson is that arrays come with many methods that allow us to do useful math. We learned some of those methods and as you keep working with NumPy those will become second nature. The second thing we learned is that arithmetic operators are applied element-wise to arrays. This means that a scalar value is applied to each element in an array and whenever two arrays of the same shape appear together with an operator each element is applied to each corresponding element. We will learn the details of array shapes in the next blog post. Finally, we also learned that NumPy code is almost always much faster than native Python code. This is good to know. However, especially in the beginning you should focus on anything but performance.

# Getting Started with NumPy

• The array is the central NumPy object
• Pass any sequence type object to the np.array() constructor to create an array
• Use functions like np.zeros, np.arange and np.linspace to create arrays
• Use np.random to create arrays with randomly generated values

NumPy is a Python package for numerical computing. Python is not specifically designed to deal with large amounts of data but NumPy can make data analysis both more efficient and readable than it is with pure Python. Without NumPy, we would simply store numbers in a list and perform operations on those numbers by looping through the list. NumPy brings us an object called the array, which is essential to anything data related in Python and most other data analysis packages in one way or another build on the NumPy array. Here we will learn several ways to create NumPy arrays but first let’s talk about installing NumPy.

## Setting up NumPy

I highly recommend installing Python with a data science platform such as https://www.anaconda.com/ that comes with NumPy and other science critical packages.
To find out if you already have NumPy installed with your distribution try to import it

import numpy as np


If that does not work, try to install NumPy with the package installer for Python (pip) by going to your commdand line. There try:

pip install numpy

Finally, you can take a look at the docs for installation instructions. https://scipy.org/install.html

## Three ways to create arrays

Now let’s create our first array. An array is a sequence of numbers so we can convert any Python sequence to an array. One of the most commonly used Python sequence is the list. To convert a Python list to an array we simply pass a list to the numpy array constructor

import numpy as np
my_list = [4, 2, 7, 9]
my_array = np.array(my_list)


This creates a NumPy array with the entries 4, 2, 7, 9. We can do the same with a tuple.

my_tuple = (4, 2, 7, 9)
my_array = np.array(my_list)


Of course we can also convert nested sequences to arrays and it works exactly the same way.

my_nested_list = [[4, 2, 7, 9], [3, 2, 5, 8]]
my_array = np.array(my_nested_list)


This is the first way to create arrays. Pass a sequence to the np.array constructor. The second way is to use numpy functions to create arrays. One such function is np.zeros.

zeros = np.zeros((3, 4))
np.array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])


np. zeros gives us an array where each entry is 0 and it requires one argument: the shape of the array we want to get from it. Here we got an array with three rows and four columns, because we pass it the tuple (3, 4). This function is useful if you know how many values you need (the structure) but you do not know which values should be in there yet. So you can pre-initialize an all zero array and then assign the actual values to the array as you compute them. Another array creation function is called np.arange

arange= np.arange(5, 30, 2)
arange
array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])


np.arange gives us a sequence starting at 5, stopping at 30 (not including 30) and going in steps of 2. This function is very useful to generate sequences that can be used to index into another array. We will learn more about indexing in a future blog post. A function very similar to np.arange is np.linspace.

linspace = np.linspace(5, 29, 13)
linspace
array([ 5.,  7.,  9., 11., 13., 15., 17., 19., 21., 23., 25., 27., 29.])


Instead of taking the step size between values, linspace takes the number of entries in the output array. Also, the final value is inclusive (29 is the final value). Finally the third way to generate numpy arrays is with the np.random module. First, lets look at np.random.randint

randint = np.random.randint(5, 30, (3,4))
array([[26, 17, 26, 24],
[20, 16, 29, 25],
[25, 21, 26, 26]])


This creates an array containing random integers between 5 and 30 (non-inclusive) with 3 rows and 4 columns. If you try this code at home the values of your array will (most probably) look different but the shape should be the same. Finally lets look at np.random.randn

randn = np.random.randn(4,5)  # Random numbers from normal distribution
randn
array([[-2.34229894, -1.43985814, -0.51260701, -2.58213476,  1.61196437],
[-0.69767456, -0.0950676 , -0.22415381, -0.90219875,  0.33513859],
[ 0.56432586, -1.62877834, -0.60056852,  1.37310251, -1.20494281],
[-0.20589457,  1.34870661, -0.89139339, -0.40300812, -0.15703367]])


np.random.randn gives us an array with numbers randomly drawn from the standard normal distribution, a gaussian distribution with mean of 0 and variance 1. Each argument we pass to the function creates another dimension. In this case we get 4 rows and 5 columns.

## Summary

We learned how to create arrays, the central NumPy object. Working with NumPy means to work with arrays and now that we know how to create them we are well prepared to get working. In the next blog post we will take a look at some of the basic arithmetic functions we can perform on arrays and show that they are both more efficient and readable than Python builtin functions.