NumPy Arrays and Shape

  • The same values can be stored in arrays with different shapes
  • Array methods can perform different operations depending on the array shape
  • The methods .reshape and .flatten change the shape of an array

Introducing Array Shape

Any array has a shape and the shape of an array is important for what kind of operations we can perform. Array shape is sometimes hard to imagine, even for experienced programmers so let’s just look at some code.

import numpy as np
my_array = np.array([3, 2, 5, 6, 3, 4])
my_array.shape
(6,)
my_array_reshaped = my_array.reshape((2,3))
my_array_reshaped.shape
(2, 3)
my_array_reshaped
array([[3, 2, 5],
       [6, 3, 4]])

Here we create an array with 6 elements and my_array.shape tells us that these 6 elements are arranged in a single dimension that has a length of 6. We then reshape the array with its .reshape method into an array with two rows and three columns. This doesn’t look immediately useful but imagine we did an experiment under control and experimental condition with three replicates each. You’d clearly want a structure that represents this. Also, we went from a vector to a matrix with just one line of code. The most important part of array shape is that we can perform array methods only on specific dimensions. To do so we just need to pass the axis argument.

my_array = np.array([[3, 2, 5],
                     [6, 3, 4]])
dim0_sum = my_array.sum(axis=0)
dim0_sum
array([9, 5, 9])
dim1_sum = my_array.sum(axis=1)
dim1_sum
array([10, 13])

Remember that we start out with a (2, 3) array, 2 rows and 3 columns. When we call sum(axis=0) on that array the 0th dimension is eliminated. The array goes from a (2, 3) shape to a (3, ) shape. It does so by calculating the sum across the 0th dimension. Likewise, when we pass sum(axis=1) the 1st dimension gets eliminated in the same way and the array becomes a (2, ) array. The same concept works of course for arrays of any dimension. But lets get back to array shapes. An array cannot be converted to any shape its shape and limit the shapes it can take.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5,6))
my_array_reshaped.shape
(5, 6)
my_array_reshaped = my_array.reshape((5,7))
ValueError: cannot reshape array of size 30 into shape (5,7)

Converting from (30,) to (5, 7) didn’t work for one simple reason. 5 times 7 is 35, not 30. In other words, the new array has more elements than the original array and NumPy will not just invent new elements to make reshaping work. If the number of elements checks out, we can reshape not only to two-dimensional arrays but to any dimension.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_reshaped.shape
(5, 2, 3)
my_array_reshaped
array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23]],

       [[24, 25, 26],
        [27, 28, 29]]])

Of course we can also reshape from higher to lower dimensions.

my_array = np.array([[3, 2, 5],
                     [6, 3, 4]])
my_array_reshaped = my_array.reshape((6,))
my_array_reshaped.shape
(6,)
my_array.shape
(2, 3)

If you want combine all dimensions into one single dimension, you can use the .flatten method.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_flattened = my_array_reshaped.flatten()
my_array_flattened.shape
(30,)

Why we need array shapes

We saw how to manipulate array shape and how array methods can use the shape of an array. Lets think a bit about the real world usage of array shape. Let’s say you are working on an image processing project. You are lucky and the images are already pre-processed in a way that each image has 64 pixels in both dimensions. So each image is an array of shape (64, 64) but your dataset consists of 1000 images. So you want your dataset to be stored as a (1000, 64, 64) array. But then your image processing project becomes a volume processing project. So each volume has 100 slices. So you need a (1000, 100, 64, 64) array. But wait. You are actually working on video files. There are 20000 frames for each volume. So you need a (1000, 20000, 100, 64, 64) array. It is rare that you will have to go beyond five dimensions, but you can. In several fields it is very easy to end up with five dimensional arrays (think fMRI).

Summary

Here we learned that the shape of an array is useful to store high dimensional data meaningfully and to have array methods operate only on specific dimensions. The .reshape method is important to change the shape of an existing array and the .flatten method can collapse an array into a single dimension. In the next blog post we will learn about broadcasting. Broadcasting is a mechanisms that is triggered whenever we perform an arithmetic operation on two arrays of different shapes (dimensionality). If two arrays have identical shape the operation is performed element-wise. If they have different shapes broadcasting performs a series of transformations on the lower dimensional array to make both arrays identical in shape and finally perform the operation element-wise.

Arithmetic Operations in NumPy

  • NumPy arrays come with many useful methods
  • All arithmetic operations that are used on arrays are performed element-wise
  • NumPy code is almost always faster than native Python (.append is a notable exception)

NumPy arrays are so useful because they allow us to do math on them very efficiently. For example, NumPy arrays come with many useful methods. One such method is the sum method, which calculates the sum of all values in the array

import numpy as np
my_array = np.array([4, 3, 1])
my_array.sum()
8

There are many other methods like this and they are extremely useful. Here is a list of the most commonly used methods.

my_array = np.array([4, 3, 1])
my_array.sum()  # Calculate the sum array values
8
my_array.mean()  # Calculate the mean of array values
2.6666666666666665
my_array.std()  # Calculate the standard deviation of array values
1.247219128924647
my_array.max()  # Find the maximum value
4
my_array.min()  # Find the minimum value
1

To learn about all array methods you can call the dir() function on any array, which will list all its methods. Alternatively you can check out the documentation for the array https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

Another useful property of arrays is that they do math when they appear together with any of the arithmetic operators (+, -, *, /, **, //, %).

my_array = np.array([4, 3, 1])
my_array_plus = my_array + 2
my_array_plus
array([6, 5, 3])

Here, the array appeared together with a scalar value, the single number 2. That number was added to each value. However, we can do the same thing with two arrays, if the have the same shape.

array_one = np.array([4, 3, 1])
array_two = np.array([1, 2, 4])
array_plus_array = array_one + array_two
array_plus_array
array([5, 5, 5])

In this case, addition is again performed element-wise. Each element in array_one is added to a corresponding element in array_two. The fact that the array performs useful math in this context might seem unremarkable but remember how the native Python list behaves.

list_one = [4, 3, 1]
list_two = [1, 2, 4]
list_plus_list = list_one + list_two
list_plus_list
[3, 2, 1, 1, 2, 4]
array_plus_array = np.array(list_one) + np.array(list_two)
array_plus_array
array([5, 5, 5])

If you are in full numerical computation mode this behavior of list might seem stupid to you. But remember: Python is a general purpose programming language and list is a general purpose container to store a sequence of objects. There could be anything in those lists and addition might not be a meaningful operation for those objects. This behavior always works, a list can be concatenated to another list regardless of the objects they store. That’s why we have NumPy. Python has to implement objects in a way that suits its general purpose. NumPy implements behavior in a way that we would expect while we do numerical stuff.

A word on performance

This is one of the rare occasions where it is worthwhile to talk about performance. When you are getting started, I strongly recommend against thinking too much about performance. Write functioning code first, then worry about readability, maintainability, reproducibility etc. etc. and worry about performance last (trust me on this one). But some of you will be working with large amounts of data and you will be delighted to hear that NumPy is much faster than native Python.

my_array = np.random.rand(100000)  # A large array with 100000 elements
my_list = list(my_array)
timeit sum(my_list)
18.1 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit my_array.sum()
90.3 µs ± 6.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The native Python version of sum is orders of magnitude slower than the NumPy version. You might have noticed that I created a very large array to demonstrate this. Actually the performance difference will increase with increasing array size, you can verify this for yourself. The take home message here is that whenever you can replace native Python with NumPy, you gain performance. But don’t worry about optimizing your NumPy code. One exception is the .append method, but more on that later.

Summary

We learned two essential things and one kind of interesting side-note. The first essential lesson is that arrays come with many methods that allow us to do useful math. We learned some of those methods and as you keep working with NumPy those will become second nature. The second thing we learned is that arithmetic operators are applied element-wise to arrays. This means that a scalar value is applied to each element in an array and whenever two arrays of the same shape appear together with an operator each element is applied to each corresponding element. We will learn the details of array shapes in the next blog post. Finally, we also learned that NumPy code is almost always much faster than native Python code. This is good to know. However, especially in the beginning you should focus on anything but performance.

Getting Started with NumPy

  • The array is the central NumPy object
  • Pass any sequence type object to the np.array() constructor to create an array
  • Use functions like np.zeros, np.arange and np.linspace to create arrays
  • Use np.random to create arrays with randomly generated values

NumPy is a Python package for numerical computing. Python is not specifically designed to deal with large amounts of data but NumPy can make data analysis both more efficient and readable than it is with pure Python. Without NumPy, we would simply store numbers in a list and perform operations on those numbers by looping through the list. NumPy brings us an object called the array, which is essential to anything data related in Python and most other data analysis packages in one way or another build on the NumPy array. Here we will learn several ways to create NumPy arrays but first let’s talk about installing NumPy.

Setting up NumPy

I highly recommend installing Python with a data science platform such as https://www.anaconda.com/ that comes with NumPy and other science critical packages.
To find out if you already have NumPy installed with your distribution try to import it

import numpy as np

If that does not work, try to install NumPy with the package installer for Python (pip) by going to your commdand line. There try:

pip install numpy

Finally, you can take a look at the docs for installation instructions. https://scipy.org/install.html

Three ways to create arrays

Now let’s create our first array. An array is a sequence of numbers so we can convert any Python sequence to an array. One of the most commonly used Python sequence is the list. To convert a Python list to an array we simply pass a list to the numpy array constructor

import numpy as np
my_list = [4, 2, 7, 9]
my_array = np.array(my_list)

This creates a NumPy array with the entries 4, 2, 7, 9. We can do the same with a tuple.

my_tuple = (4, 2, 7, 9)
my_array = np.array(my_list)

Of course we can also convert nested sequences to arrays and it works exactly the same way.

my_nested_list = [[4, 2, 7, 9], [3, 2, 5, 8]]
my_array = np.array(my_nested_list)

This is the first way to create arrays. Pass a sequence to the np.array constructor. The second way is to use numpy functions to create arrays. One such function is np.zeros.

zeros = np.zeros((3, 4))
np.array([[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]])

np. zeros gives us an array where each entry is 0 and it requires one argument: the shape of the array we want to get from it. Here we got an array with three rows and four columns, because we pass it the tuple (3, 4). This function is useful if you know how many values you need (the structure) but you do not know which values should be in there yet. So you can pre-initialize an all zero array and then assign the actual values to the array as you compute them. Another array creation function is called np.arange

arange= np.arange(5, 30, 2)
arange
array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])

np.arange gives us a sequence starting at 5, stopping at 30 (not including 30) and going in steps of 2. This function is very useful to generate sequences that can be used to index into another array. We will learn more about indexing in a future blog post. A function very similar to np.arange is np.linspace.

linspace = np.linspace(5, 29, 13)
linspace
array([ 5.,  7.,  9., 11., 13., 15., 17., 19., 21., 23., 25., 27., 29.])

Instead of taking the step size between values, linspace takes the number of entries in the output array. Also, the final value is inclusive (29 is the final value). Finally the third way to generate numpy arrays is with the np.random module. First, lets look at np.random.randint

randint = np.random.randint(5, 30, (3,4))
array([[26, 17, 26, 24],
       [20, 16, 29, 25],
       [25, 21, 26, 26]])

This creates an array containing random integers between 5 and 30 (non-inclusive) with 3 rows and 4 columns. If you try this code at home the values of your array will (most probably) look different but the shape should be the same. Finally lets look at np.random.randn

randn = np.random.randn(4,5)  # Random numbers from normal distribution
randn
array([[-2.34229894, -1.43985814, -0.51260701, -2.58213476,  1.61196437],
       [-0.69767456, -0.0950676 , -0.22415381, -0.90219875,  0.33513859],
       [ 0.56432586, -1.62877834, -0.60056852,  1.37310251, -1.20494281],
       [-0.20589457,  1.34870661, -0.89139339, -0.40300812, -0.15703367]])

np.random.randn gives us an array with numbers randomly drawn from the standard normal distribution, a gaussian distribution with mean of 0 and variance 1. Each argument we pass to the function creates another dimension. In this case we get 4 rows and 5 columns.

Summary

We learned how to create arrays, the central NumPy object. Working with NumPy means to work with arrays and now that we know how to create them we are well prepared to get working. In the next blog post we will take a look at some of the basic arithmetic functions we can perform on arrays and show that they are both more efficient and readable than Python builtin functions.