What’s in a Name?

  • To store an object, we assign it a name using the equal sign
  • Names must not start with a number and they cannot contain special characters (except for underscores)
  • When you choose names, be consistent and descriptive

Names

You have probably heard of variables. In mathematics, a variable is a placeholder for something that is not fully defined just yet. In most programming languages, we say: “We define the variable”. In Python we rarely talk about variables. We instead talk about names. We would say: “We assign a name”. Specifically, we assign a name to an object. To do so, we use the equal sign. While the vocabulary is slightly different, the result is very similar in all programming languages. Let’s look at an example.

my_name = 'Daniel'
my_name
# 'Daniel'

Here we assign the name my_name to the string object 'Daniel'. From now on we can use the name to refer to the object. We can perform operations on the name and Python will replace the name with the object it refers to.

my_name = 'Daniel'
my_name * 5
# 'DanielDanielDanielDanielDaniel'

The multiplication operator applited to a string concatenates that array multiple times. Here we get the same string five times. Note, that we use the name, instead of the literal object. We can also reassign the same name to a different object.

my_name = 'Daniel'
my_name
# 'Daniel'
my_name = 5 * my_name
my_name
# 'DanielDanielDanielDanielDaniel'

First, my_name refers to 'Daniel', then it refers to the result of the operation
my_name * 5, which is 'DanielDanielDanielDanielDaniel'. When we choose a name, there are very few rules that we must follow. However, there are many more rules that we definitely should follow. We take a look at the mandatory rules first.

Naming Rules

When we choose a name, there are many things to consider. But there are also some things that we are not allowed to do. For example a name cannot start with a number.

1st_name = 5
# SyntaxError: invalid syntax
first_name = 5
first_name
# 5
name_1 = 5
name_1
# 5

In other places of the name, numbers are allowed. Special characters are not allowed at any place of a name.

name$one = 5
# SyntaxError: invalid syntax

The only special character that is allowed is the underscore. Underscores are conventionally used in Python to structure names visually.

division_and_multiplication = 5 / 10 * 2
division_and_multiplication
# 1.0

Besides the visually structuring names, underscores also have special meaning when leading a name. For example, all objects in Python have special double underscore methods. We will learn all about them later.

Good Naming Practices

If you are just getting started with Python, you probably have a lot of things on your mind. Actually, if you follow the above rules, you can choose any name you want. However, it is very important to learn good naming practices in Python. Rule number one is consistency. Avoid wildly different ways to visually structure names within the same project. Don’t start with my_name and end with MyName. If you stay consistent within your projectsyou are already ahead of the curve. For beginners I recommend making names all lower case and structuring only using underscores. Also avoid numbers. Write name_one instead of name_1. Also, don’t be afraid of making names long. It is more important names are descriptive than short. length_of_segment is much better than los, even if there is more typing involved. Don’t be afraid to type. Be afraid of coming back to your code one year after you wrote it and trying to read completely unreadable, meaningless names.

Those are enough rules for now. If you want to adopt a perfect style right from the beginning you should read the official style guide for python called PEP8. There are generally no laws about Python style but this is the most authoritative document on coding style in Python and many people reading your code will expect you to follow the guidelines in this document. If you follow the few rules I describe above, you are already following the most important practices regarding naming.

Summary

To store objects we assign them a name. We can do that with the equal sign. The name is on the left, the object is on the right. name = object Needless to say, we’ll be spending a lot of time assigning names. Once a name is assigned, we can use the name instead of the object. Python will replace the name with the object in any operation. There are very few mandatory rules when choosing a name. For one, names cannot start with a number. Second, special characters, except for the underscore (_) are forbidden in names. There are some more rules that are good to follow. The most important one is to be consistent. Do not arbitrarily change your style within a project. If you do this you are already doing well.

Doing the Math with Python

  • We can use Python as a basic calculator for addition, subtraction, multiplication, division and exponent
  • We can use the equal sign to save results
  • The two major numeric types are floating point (float) and integer (int)

Python as a Calculator

Computers are good at math and so is Python. We can use Python like a calculator with the standard arithmetic operators for addition (+), subtraction (-), multiplication (*), division (/) and exponent (**).

An IPython console running in Spyder. Shown are the standard arithmetic operators for addition, subtraction, multiplication, division and exponent.

When we combine multiple operators on one line they are executed in the order we know from school. First comes the exponent. Then multiplication and division. Finally addition and subtraction.

2 ** 2 * 5  # Exponent precedes multiplication
# 20
2 ** 2 / 5  # Exponent precedes division
# 0.8
2 * 2 + 5  # Multiplication precedes addition
# 9
2 * 2 - 5  # Multiplication precedes subtraction
# -1
2 / 2 + 5  # Division precedes addition
# 6.0
2 / 2 - 5  # Division precedes subtraction
# 0.2857142857142857

If we want to change the order of operation, we can use parentheses. Let’s use them to see how the results above would look if the order of operation was reversed.

2 ** (2 * 5)
# 1024
2 ** (2 / 5)
# 1.3195079107728942
2 * (2 + 5)
# 14
2 * (2 - 5)
# -6
2 / (2 + 5)
# 0.2857142857142857
2 / (2 - 5)
# -0.6666666666666666

Parentheses can do many different things in Python, which we will talk about later. When used with arithmetic operators they change the order of operation. Whatever appears in parentheses is resolved first. Many calculators have a basic memory function and we can easily save results in Python.

Saving Results

To save results of mathematical operations we can use the equal (=) operator. On the left we put a name that we want to assign the result to and on the right we put the operation that we want to save.

save = 5 + 7
save
# 12

In this example we assign the result 12 to the name save. Once we save a result under a name we can reuse it in an arithmetic expression. If a name stores a number it will just be evaluated as a number when it appears with a mathematical operator.

save = 5 + 7
10 + save
# 22

The equal operator is essential in Python. Here we are using it to save the result of a mathematical operation. Later we will learn that it is used to assign any kind of object to a name. A name is what allows us to refer to an object. Objects are very important in Python. We will learn all about that later but even in the seemingly simple realm of numbers there are different kinds of objects

Floating Point Numbers and Integers

You may have noticed that sometimes numbers have a dot as a decimal delimiter and some numbers don’t. That is because numbers can be floating point numbers (float for short) or integers (int for short). Those are different numerical types. What kind of number we get depends on the arithmetic operation used and the types of numbers involved. To create a specific type of number we can simply add the dot (.) to the number. To check what kind of object we created we can use the type() function.

integer = 3
type(integer)
# int
floating = 3.0
type(floating)
# float

So 3.0 generates a float and 3 generates an integer. In Python3 (the Python you should be using!) the division operator (/) always generates a floating point number, regardless of the numbers involved.

div = 4 / 2
div
# 2.0
type(div)
# float

In this case, both numbers are integers (four and two) and four can be divided by two without remainder. Division returns a float anyway. With multiplication, addition and subtraction, the situation is slightly more complicated. The type of the result depends on the type of the numbers involved. If one of the numbers is a float, the result will be float.

integer = 4 * 2
integer
# 8
type(integer)
# int
floating = 4 * 2.0
floating
# 8.0
type(floating)
# float

For now this seems like a pointless excercise. Mathematically there is no difference between 8.0 and 8. Later we will learn why the types of objects matters. Even the difference between a floating eight and an integer eight can be important. For example, when we index into a list, the number we use as an index must be an integer.

Summary

We learned about the most important mathematical operators, addition, subtraction, multiplication, division and exponent. When those operators appear together, operator precedence decides which operation is performed first. The precedence order is the same we learned in school. We can use parentheses to change the order of operation. Operations in parentheses are executed earlier. Furthermore, we learned that we can save results by assigning a name to them. We can assign a name with the equal operator. Finally, we learned that there are different numeric types. There are integers and floating point numbers. Why the difference between them matters will become more clear later.

The Scientists Guide to Python

  • Install Python as part of a data science platform such as Anaconda
  • Use an integrated development environment like Spyder (Installs with Anaconda)
  • Don’t reinvent the wheel, find out what already exists in your field

Python and Science

Python has been adopted with open arms by many scientific fields. A lot of its success is directly linked to the success of NumPy (the Python package for numerical computing) and the scientific Python community (SciPy community). If you want to understand the history of this success story, I recommend reading the recent nature methods paper which gives some strong background about the SciPy community. However, you don’t need to read the paper to use Python or any of the scientific packages.

A downside of Python is that it is a general purpose programming language. Python is used for anything you can think of. Web development, GUI design, game development, data mining, machine learning, computervision and so on. This can be intimidating to beginners. Especially to scientists who have a very specific task they want to automate. Compare this situation to MATLAB. Its purpose is very specific and is even part of the name. It is the MATrix LABoratory. It deals very efficiently with matrices and is generally good at math stuff. Both very valuable and important to most scientists. MATLAB is specifically designed to serve the science and engineering communities. Python on the other hand is for everyone, which can be a weakness in the beginning but can quickly turn into its greatest strength. There are no privileged tasks in Python, they are all handled well (if there is a great community behind the task maintaining it). The purpose of this guide is to ease some of the hurdles scientists face when diving into Python and highlight the many advantages.

The first problem we face is called package management. When we download pure Python from the official website (python.org), we get the core Python functionality. Python in itself does not include most of the features we depend on as scientists. Pythons functionality is organized in so called packages that we need to install if they are not built into Python. Alternatively we can install a pre-made Python distribution that includes scientific packages. This is my preferred way to installing Python and I strongly recommend beginners to start with Anaconda.

Installing Python with Anaconda

Anaconda is a data science platform that installs Python with most of the packages we need as scientists. It is open-source and completely free. When we install Anaconda we get lots of stuff. Most importantly we get Python, NumPy, Matplotlib, Conda and Spyder. We could install those things ourselves but this way we can have them without ever worrying about package management. We already talked about core Python and NumPy. Matplotlib is a package that allows us to plot our data. Conda is a package manager that allows us to update installed packages and install new packages. Finally, Spyder is an integrated development environment (IDE). Spyder stands for Scientific PYthon Development EnviRonment and it is exactly what we need to get started.

The Anaconda navigator. A graphical user interface that helps us find the main components of Anaconda. One of those is the Spyder IDE which you can start by simply clicking launch.

An IDE is extremely useful. For example, it helps us by highlighting important parts of our code visually. By running Spyder you have just setup your own IDE. This is your kingdom now. You will be able to do amazing things here and you should celebrate this moment properly. However, as you celebrate, my duty will be to guide you through the most important parts of this graphical user interface.

Spyder (Scientific PYthon Development EnviRonment). On the left we have the editor. Here we write scripts. Scripts are collections of code that can be execute all together in series when we hit the run button (green arrow). In the lower right is an IPython console. This is where scripts are executed interactively when we run them. We can also type code here and execute it immediately

Lets start with the interactive console in the lower right. Here, IPython is running and it is awaiting your commands. You can type code and run it by hitting enter. You can use it like a fancy calculator. Try 2+2, 2-2, 2*2, 2/2, 2**2. It’s all there. The interactive console is the perfect place to try out commands and see what they do. We can define variables here and import packages. Luckily we installed Anaconda, so we have NumPy already available. The conventional way to import NumPy is import numpy as np. To learn all about NumPy, find my NumPy blog series.

On the left is the editor. This is our code notebook. Here we edit files that store our code. Those files are commonly called scripts. When you hit enter here, code is not immediately executed. You are just moving to a new line. To execute the code here we can hit the green arrow (play button) above. When we do so, all lines of code in the script are executed from top to bottom. When the script finishes, the objects created in the script still exist in the interactive console. This is very useful to debug the script. We can use the interactive console to take a close look at what the script did.

The Power of Scripts

Most of your work will be about writing scripts that do something useful. Many times, there are other ways to solve the same task. Many things can be done by hand, clicking through the graphical user interface of another program. Sometimes this is still the best way, but there are many advantages to writing a script.

A script can be much faster than a human, especially on repeat tasks. When we have to rename 10 files in a certain way, we might decide that it is not worth to write a script. Let’s just do it quickly by hand. But the same task could come back later. This time with 100 files that require renaming. The most important thing is to pick your battles. We cannot automate everything. We will have to make tough decisions.

Another advantage of scripts is that they are a protocol of the analysis. If we can write a script that takes care of everything and is capable of analyzing the entire dataset, we know exactly how the analysis was performed. When we analyze manually we can achieve the same by very carefully writing down everything we do on each data point. However, this can easily fail if something about the analysis becomes important afterwards and it is not part of the protocol. Scripts are also more easily shared than a protocol because the script language (in our case Python) is less ambiguous.

Sometimes even the script fails as a definitive protocol. This is the case when we use packages that change functions. Lets assume we use a function called fastAnalysis from a fictional package called fancyCalc. The way fastAnalysis works changed between version 1.1 and 1.2 slightly. We only know what our script did if we remember which version we used at the time we ran the script. Here another advantage of scripts shows. We can pretty quickly run the same analysis again with different fancyCalc versions. If we get the same result as before this version is probably the one we ran previously. Manual analysis is usually much harder to repeat. Imagine you spent the last 2 month analyzing 200 samples. You had to normalize each sample for its corresponding baseline. How easy would it be to do the same analysis by hand with a slightly larger baseline interval? In a script this would usually involve changing a single number. By hand it could easily lead to another 2 months of work.

The Scripting Workflow

We talked at length about the advantages of scripts but how do we actually write one? For starters we need to know what we want to do. Then we need to look online for a function that we hope can achieve our goal. Then we try out the function. If it does what we want we move on to the next task. Usually we have one pretty large, intimidating task, such as “analyze this whole data set”. Then this task falls into smaller tasks. For example we will have to load our data, do preprocessing, extract regions of interest, quantify and so on. Those tasks fall apart into even smaller tasks. For preprocessing one of those tasks might be to subtract the background. Ideally, if we make our tasks small enough we will be less intimidated and we will find a function that performs this task.

In the age of the internet, programming is easy enough. Unfortunately, to become effective at it we will need to do some learning and get practice. Our goal is to become effective scripters. So we will need to learn three things: 1. We need to learn some basic Python. This will be easy, I promise. 2. We need to learn how to think computationally about our tasks. Only then will be be able to effectively divide entire projects into smaller, manageable tasks. 3. We will learn how to find, evaluate and use task specific packages quickly. Most tasks have already been solved by other people and we never want to reinvent the wheel (unless we have a vision for a really cool wheel that is much better than all the other wheels that already exist). My blog will lead you through all three steps.

What’s next?

The best thing you can do next is to start coding. Today. The amount of different programming languages, distributions of those languages, integrated development environments and resources to learn all of those is immense and can lead to real decision paralysis. You can spend forever trying to decide how to start but you already have everything you need. Install Anaconda, launch Spyder and just let lose. If you are not sure if Anaconda is right for you, install a distribution you think is more suitable. Install pure Python if you are feeling adventurous and just start hacking away. If you are not convinced Python is the best language for you, install another language. If you don’t want to learn with this blog, use another resource. There are so many great ones. Just do it. So now you need to get your hands dirty.

In future blog posts I will go through the three points I think are necessary to become effective scripters. I will start with basic Python and I will collect those blog posts here. I’d be happy if you want to join me.

Doing the Math with Python

First, we take a look at basic math with Python. We learn the basic arithmetic operators, parentheses and how to save the results by assigning a name.

Comparisons and Logic Functions in NumPy

  • We can compare arrays with scalars and other arrays using Pythons standard comparison operators
  • There are two different boolean representations of an array. arr.all is True if all elements are True, arr.any is True if any element is True.
  • To perform logical functions we need the special NumPy functions np.logical_and, np.logical_or, np.logical_not, np.logical_xor

Introduction

Logic functions allow us to check if logical statements about our arrays are true or false. Luckily, logic functions are very consistent with other array functions. They are performed element-wise by default and they can be performed on specific axes.

Comparing Arrays and Scalars

The same way we can use the arithmetic operators we can use all the logical operators: >, >=, <,< <=, ==, !=,

import numpy as np
arr = np.array([-1, 0, 1])
arr > 0
# array([False, False,  True])
arr >= 0
array([False,  True,  True])
arr < 0
# array([ True, False, False])
arr <= 0
# array([ True,  True, False])
arr == 0
# array([False,  True, False])
arr != 0
# array([ True, False,  True])

Comparing Arrays with Arrays

The result of a logic operation is a boolean array containing binary values of True or False. This particular case shows us logic operations of arrays with scalars. All boolean operations are performed element-wise, so each element is compared against the scalar. We can also compare arrays to arrays if their shape allows it.

arr_one = np.array([-1, 0, 1])
arr_two = np.array([1, 0, -1])
arr_one > arr_two
# array([False, False,  True])
arr_one >= arr_two
# array([False,  True,  True])
arr_one < arr_two
# array([ True, False, False])
arr_one <= arr_two
# array([ True,  True, False])
arr_one == arr_two
# array([False,  True, False])
arr_one != arr_two
array([ True, False,  True])

Truth Value of Arrays and Elements

Sometimes we need to check if an array contains any elements that are considered True in a boolean context. While the boolean value of array elements is well defined, the truth value of an entire array is not defined.

arr = np.array([0, 1])
bool(arr[0])
# False
bool(arr[1])
# True
bool(arr)
# ValueError: The truth value of an array with more 
# than one element is ambiguous. Use a.any() or a.all()

NumPy error messages are great. This one is so great that it even tells us which method we need to use to get at the truth value of an array. We can use arr.any() to find out if any of the elements evaluate to True or arr.all() to find out if all elements are True

arr_one = np.array([0, 0])
arr_one.any()
# False
arr_one.all()
# False
arr_two = np.array([0, 1])
arr_two.any()
# True
arr_two.all()
# False
arr_three = np.array([1, 1])
arr_three.any()
# True
arr_three.any()
# True

This can be useful to find out whether an array is empty

arr = np.array([])
arr.any()
# False

Logical Operations

Finally, we need to look at four more logical operations: and, or, not & xor.
Unfortunately we can’t just use the Python keywords. The reason is in the error message above: “ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”. It is ambiguous because NumPy does not know if we want to perform the operation element-wise or if we want to perform the operation on the truth value of the array. NumPy does not try to guess which one we mean, so it throws the error. To get these logical functions we need to call some more explicit NumPy functions.

arr_one = np.array([0,1,1])
arr_two = np.array([0,0,1])
np.logical_and(arr_one, arr_two)
# array([False, False,  True])
np.logical_or(arr_one, arr_two)
# array([False,  True,  True])
np.logical_not(arr_one)
# array([ True, False, False])
np.logical_xor(arr_one, arr_two)
# array([False,  True, False])

Summary

We are now well equipped to deal with arrays. We can compare arrays with scalar values and other arrays using the the standard comparison operators. We can also perform logical operations on arrays with the special NumPy functions (logical_and, logical_or, logical_not and logcal_xor). Finally we can get two different boolean values of an arrays using arr.all and arr.any.

NumPy Array Data Type

  • Any array has a data type (dtype)
  • The dtype determines what kind of data is stored in the array
  • Not all operations work for all dtypes

Introduction to Data Types

Having a data type (dtype) is one of the key features that distinguishes NumPy arrays from lists. In lists, the types of elements can be mixed. One index of a list can contain an integer, another can contain a string. This is not the case for arrays. In an array, each element must be of the same type. This gives the array some of its efficiency, because operations can know in advance, what kind of data they will find in each element simply by looking up the data type. At the same time it makes arrays slightly less flexible, because some operations are undefined for some data types and we cannot assign any kind of data to an array. But how does NumPy decide what data type an array should have in the first place?

Guessing or Defining the dtype

So far we were able to create arrays effortlessly without knowing what dtype even means. That is because NumPy will just take a guess, what the dtype should be, based on the input it gets for the array.

arr = np.array([4, 3, 2])
arr.dtype
# dtype('int32')
arr = np.array([4, 3.0, 2])
arr.dtype
# dtype('float64')
arr = np.array([4, '3', 2])
arr.dtype
# dtype('<U11')

In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number. Floats are a more complex data type in Python, which means that all other data types have to follow the more complex one. Therefore, all elements of the array are converted to floats and are stored with the dtype float64. Strings are an even more complex dtype. Because ‘3’ is a string in the final example, the dtype becomes ‘<U11’. U stands for unicode, a type of string encoding and the number indicates the length of the string. In all three cases NumPy guesses the dtype according to the content of the list. This works well most of the time but we can also explicitly define the dtype.

arr = np.array([4, 3, 2], dtype=np.float)
arr.dtype
# dtype('float64')
arr = np.array([4, 3, 2], dtype=np.str)
arr.dtype
# dtype('<U1')
arr = np.array([4, 3, 2], dtype=np.bool)
arr.dtype
# dtype('bool')

Converting arrays to other dtypes can be necessary because some operations will not work on arrays of mixed types. A dtype that is particularly problematic is the np.object dtype. It is the most flexible dtype but it can cause a lot of problems for both experts and beginners.

np.object and the Curse of Flexibility

Most dtypes are very specific. They let you know if the array contains a number (np.int, np.float) or a string (all unicode ‘U’ dtypes). Not so much np.object. It tells you that whatever is inside the array is a thing. Because everything is an object anyway. This can make an array as flexible as a list. Anything can be stored. That is also where the problems come in.

arr = np.array([[3,2,1],[2,5]])
arr.dtype
# dtype('O')  # 'O' means object
arr + 5
# TypeError: can only concatenate list (not "int") to list

Suddenly, the plus operation between an array and a scalar fails. What went wrong? Starting from the top, NumPy decides to assign the dtype of np.object to arr because the nested list entries have different lengths. Think of it this way: this array can neither be a (2, 3) nor a(2, 2) array of dtype integer. Therefore, NumPy makes it a (2,) array of dtype object. So the array contains two lists, the first one is of length 3 and the second one of length 2. NumPy generally turns anything that is more complex than a string into np.object. A list is one of those that gets turned into np.object. The error then occurs because the plus operation is not defined for a list with an integer. But that also means, that the operation will work, if the objects contained in the array so happen to work with the operation.

arr = np.array([3,2,1], dtype=np.object)
arr.dtype
dtype('O')
arr + 5
array([8, 7, 6], dtype=object)

This is one of the main problem of the np.object dtype. Operations work only sometimes and to know if an operation will work, each element has to be checked. With other dtypes, we know which operations will work just by just looking at it.

Summary

The dtype is one of the concepts that is closely related to the internal workings of NumPy. There is a lot that could be said about the details but effective beginners only need to remember a few points. First, the dtype determines what is stored in the array. All elements of an array have to conform to a specific type and dtype tells us which one. Second, NumPy guesses the dtype based on the literal data unless we specify which dtype we want. Guessing works most of the time but sometimes explicit types conversion is necessary. Third, operations that we know and love from numeric types (np.int, np.float) may not work on other types (np.str, np.obect). This is particularly annoying for beginners. If you have hard to debug errors, find out what dtype your arrays actually have.

Array Indexing with NumPy

  • Indexing is used to retrieve or change elements of a an array
  • Slice syntax (start:stop:step) gets a range of elements
  • Integer and boolean arrays can get an arbitrary set of elements

Introduction to Array Indexing

Indexing is an important feature that allows us to retrieve and reassign specific parts on an array. You probably already know the basics of indexing from Python lists and tuples. You can index into NumPy arrays the same way you index into those sequences but NumPy indexing comes with many extra features we will learn about here. First, lets look at single value indexing.

Single Value Indexing

We can use indexing to get single (scalar) values from an array. Indexing is always done with square brackets and we always start counting at 0.

import numpy as np
arr = np.arange(10,15)
arr
array([10, 11, 12, 13, 14])
arr[0]
10
arr[4]
14

Note that single value indexing does not return an array with a single entry but rather a numpy integer. To get a single value from a multi dimensional array we need to use multiple indices that are separated by commas.

arr = np.arange(20)
arr = arr.reshape((2,2,5))
arr
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
arr[0,0,1]
1
arr[1,0,4]
14

I recommend this way of indexing but you can also use multiple square brackets like you would for Python sequences.

arr
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
arr[0][0][0]
0
arr[1][0][4]
14

We can also use indexing to reassign elements of an array.

arr = np.arange(10,15)
arr
# array([10, 11, 12, 13, 14])
arr[1] = 20
arr
# array([10, 20, 12, 13, 14])

Slice Indexing

To retrieve a single value, our indices need to resolve all dimensions of the array and arrive at a single value. Whenever one dimension remains unspecified, we get an array (array view technically).

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [ 5,  6,  7,  8,  9]],
                [[10, 11, 12, 13, 14],
                 [15, 16, 17, 18, 19]]])
arr[0, 1]
array([5, 6, 7, 8, 9])
arr[1]
array([[10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

To take an entire dimension we can use the colon.

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [ 5,  6,  7,  8,  9]],
                [[10, 11, 12, 13, 14],
                 [15, 16, 17, 18, 19]]])
arr[0, :, 0]
array([0, 5])
arr[:, 0, 0]
array([ 0, 10])

The colon is very useful for indexing in general, because it allows us to take a slice of values instead of a single value. The syntax of the slice follows start:stop:step. If we leave out start, the slice starts at 0. If we leave out stop, it goes to the end of the dimension. If we leave out step, the step defaults to 1.

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [ 5,  6,  7,  8,  9]],
                [[10, 11, 12, 13, 14],
                 [15, 16, 17, 18, 19]]])
arr[0, 0, 1:5:2]
# array([1, 3])
arr[0, 0, 1:4]
# array([1, 2, 3])
arr[0, 0, 1:]
# array([1, 2, 3, 4])
arr[0, 0, :3]
# array([0, 1, 2])
arr[0, 0, :]
# array([0, 1, 2, 3, 4])

Index Array

So far we learned that we can use integers and slices for indexing. Now we learn that we can also use arrays to index into an array. When we use an array to index, that array has to either contain integers or boolean values. Lets take a look at integer array indexing first.

arr = np.arange(10,50,3)
idc = np.arange(5)
idc.dtype
dtype('int32')
arr[idc]
array([10, 13, 16, 19, 22])
idc = np.arange(5,8)
arr[idc]
array([25, 28, 31])
idc = np.array([1,2,4])
arr[idc]
array([13, 16, 22])

Note that in the examples where we generate index arrays with arange, we could achieve the same result with a slice as shown above and save one line of code. Integer arrays are most useful when they are generated by a process that is more complicated than the arange method. One example is the np.argwhere method we will learn more about in a later post.

Boolean Array

Boolean arrays also deserve at least one post of their own but here I will give you a teaser. We only want to retrieve those values, that satisfy a larger than condition.

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [10, 11, 12, 13, 14]],
                 [[5,  6,  7,  8,  9],
                 [15, 16, 17, 18, 19]]])
boolean_idc = arr > 10
boolean_idc
array([[[False, False, False, False, False],
        [False,  True,  True,  True,  True]],

       [[False, False, False, False, False],
        [ True,  True,  True,  True,  True]]])
arr[boolean_idc]
array([11, 12, 13, 14, 15, 16, 17, 18, 19])

Summary

We learned that indexing is useful to retrieve values and reassign parts of an array. There are several ways to index. First, we can use single integers to get to an element of a certain dimension. We can also use slices with the colon syntax start:stop:step to get at a sequence of elements. Furthermore, there are two advances indexing techniques, where we can use arrays containing integers or booleans to find an arbitrary collection of elements.

Broadcasting in NumPy

  • Broadcasting is triggered when an arithmetic operation is done on two arrays of different shape
  • The goal of broadcasting is to make both arrays the same shape by performing transformations on the shape of the smaller array
  • Once arrays have the same shape, the operation is applied element-wise
  • If the arrays cannot be broadcast an error is raised

Broadcasting Introduction

When we try to add two arrays together with the plus operator, addition is performed element-wise. That means, each element is added to a corresponding element is the other array. However, this only works when both arrays have the same shape. If two arrays have different shapes, a process called broadcasting tries to resolve the difference between the arrays by performing a series of transformations on the shape of the array with lower dimensionality. To understand broadcasting we need to understand the steps broadcasting performs. Lets look at a quick example.

import numpy as np
arr_one = np.array([[4, 3, 2, 5, 6, 2],
                    [30, 34, 1, 50, 60, 56],
                    [22, 34, 32, 21, 12, 6]])
arr_two = np.array([1, 10, 20, 30, 40, 50])
arr_one.shape
(3, 6)
arr_two.shape
(6,)
arrs_plus = arr_one + arr_two
arrs_plus
array([[  5,  13,  22,  35,  46,  52],
       [ 31,  44,  21,  80, 100, 106],
       [ 23,  44,  52,  51,  52,  56]])

This one works despite both arrays having different shapes, even different number of elements. This next one does not work under seemingly similar circumstances.

arr_one = np.array([[4, 3, 2, 5, 6, 2],
                    [30, 34, 1, 50, 60, 56],
                    [22, 34, 32, 21, 12, 6]])
arr_two = np.array([4, 40, 20])
arr_one.shape
(3, 6)
arr_two.shape
(3,)
arrs_plus = arr_one + arr_two
ValueError: operands could not be broadcast together with shapes (3,6) (3,)

What happened here? In the first example we add an array of shape (6,) to an array of shape (3, 6) and it works. In the second example we add an array of shape (3,) to a (3, 6) array and get an error. In both examples, the arrays have different shapes. Therefor, broadcasting is triggered and to understand what happens we need to understand the broadcasting sequence. Lets first work through the working example.

# Broadcasting rules in order
"""
Rule #1: The array with fewer dimensions is broadcast to match
Rule #2: Array shapes are aligned to the right.
         (3, 6)
           (,6)
Rule #3: All array dimensions must be equal or one
         Otherwise broadcasting fails as: ValueError
         Here 6 is equal to six, so we don't get an error and continue
Rule #4: Array dimensions are expanded in the leftward direction
         (6, 3)
         (1, 3)
Rule #5: Array dimensions of size 1 are duplicated to match.
         (6, 3)
         (6, 3)
Done. Array operation can now be executed element wise.
"""

These are the five broadcasting rules that are followed in order. They explain why operations between a (3,) and a (3, 6) array fail. After aligning both array shapes we encounter a problem. 3 does not equal 6, so rule #3 is violated and gives us the ValueError. There are several ways to make this operation work. However, it is best to first understand array indexing before delving into those. We will learn about array indexing in the next post.

Summary

If it weren’t for broadcasting we would have to manually convert the shape of arrays so that they are equal before we can perform arithmetic operations on them. Luckily, we learned that broadcasting always happens when we want to perform an operation on two arrays of different shapes. It tries to resolve the difference in shape by performing a series of steps on the smaller array. When broadcasting finishes successfully the operation can be performed element-wise. If broadcasting fails an error is raised.

NumPy Arrays and Shape

  • The same values can be stored in arrays with different shapes
  • Array methods can perform different operations depending on the array shape
  • The methods .reshape and .flatten change the shape of an array

Introducing Array Shape

Any array has a shape and the shape of an array is important for what kind of operations we can perform. Array shape is sometimes hard to imagine, even for experienced programmers so let’s just look at some code.

import numpy as np
my_array = np.array([3, 2, 5, 6, 3, 4])
my_array.shape
(6,)
my_array_reshaped = my_array.reshape((2,3))
my_array_reshaped.shape
(2, 3)
my_array_reshaped
array([[3, 2, 5],
       [6, 3, 4]])

Here we create an array with 6 elements and my_array.shape tells us that these 6 elements are arranged in a single dimension that has a length of 6. We then reshape the array with its .reshape method into an array with two rows and three columns. This doesn’t look immediately useful but imagine we did an experiment under control and experimental condition with three replicates each. You’d clearly want a structure that represents this. Also, we went from a vector to a matrix with just one line of code. The most important part of array shape is that we can perform array methods only on specific dimensions. To do so we just need to pass the axis argument.

my_array = np.array([[3, 2, 5],
                     [6, 3, 4]])
dim0_sum = my_array.sum(axis=0)
dim0_sum
array([9, 5, 9])
dim1_sum = my_array.sum(axis=1)
dim1_sum
array([10, 13])

Remember that we start out with a (2, 3) array, 2 rows and 3 columns. When we call sum(axis=0) on that array the 0th dimension is eliminated. The array goes from a (2, 3) shape to a (3, ) shape. It does so by calculating the sum across the 0th dimension. Likewise, when we pass sum(axis=1) the 1st dimension gets eliminated in the same way and the array becomes a (2, ) array. The same concept works of course for arrays of any dimension. But lets get back to array shapes. An array cannot be converted to any shape its shape and limit the shapes it can take.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5,6))
my_array_reshaped.shape
(5, 6)
my_array_reshaped = my_array.reshape((5,7))
ValueError: cannot reshape array of size 30 into shape (5,7)

Converting from (30,) to (5, 7) didn’t work for one simple reason. 5 times 7 is 35, not 30. In other words, the new array has more elements than the original array and NumPy will not just invent new elements to make reshaping work. If the number of elements checks out, we can reshape not only to two-dimensional arrays but to any dimension.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_reshaped.shape
(5, 2, 3)
my_array_reshaped
array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23]],

       [[24, 25, 26],
        [27, 28, 29]]])

Of course we can also reshape from higher to lower dimensions.

my_array = np.array([[3, 2, 5],
                     [6, 3, 4]])
my_array_reshaped = my_array.reshape((6,))
my_array_reshaped.shape
(6,)
my_array.shape
(2, 3)

If you want combine all dimensions into one single dimension, you can use the .flatten method.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_flattened = my_array_reshaped.flatten()
my_array_flattened.shape
(30,)

Why we need array shapes

We saw how to manipulate array shape and how array methods can use the shape of an array. Lets think a bit about the real world usage of array shape. Let’s say you are working on an image processing project. You are lucky and the images are already pre-processed in a way that each image has 64 pixels in both dimensions. So each image is an array of shape (64, 64) but your dataset consists of 1000 images. So you want your dataset to be stored as a (1000, 64, 64) array. But then your image processing project becomes a volume processing project. So each volume has 100 slices. So you need a (1000, 100, 64, 64) array. But wait. You are actually working on video files. There are 20000 frames for each volume. So you need a (1000, 20000, 100, 64, 64) array. It is rare that you will have to go beyond five dimensions, but you can. In several fields it is very easy to end up with five dimensional arrays (think fMRI).

Summary

Here we learned that the shape of an array is useful to store high dimensional data meaningfully and to have array methods operate only on specific dimensions. The .reshape method is important to change the shape of an existing array and the .flatten method can collapse an array into a single dimension. In the next blog post we will learn about broadcasting. Broadcasting is a mechanisms that is triggered whenever we perform an arithmetic operation on two arrays of different shapes (dimensionality). If two arrays have identical shape the operation is performed element-wise. If they have different shapes broadcasting performs a series of transformations on the lower dimensional array to make both arrays identical in shape and finally perform the operation element-wise.

Arithmetic Operations in NumPy

  • NumPy arrays come with many useful methods
  • All arithmetic operations that are used on arrays are performed element-wise
  • NumPy code is almost always faster than native Python (.append is a notable exception)

NumPy arrays are so useful because they allow us to do math on them very efficiently. For example, NumPy arrays come with many useful methods. One such method is the sum method, which calculates the sum of all values in the array

import numpy as np
my_array = np.array([4, 3, 1])
my_array.sum()
8

There are many other methods like this and they are extremely useful. Here is a list of the most commonly used methods.

my_array = np.array([4, 3, 1])
my_array.sum()  # Calculate the sum array values
8
my_array.mean()  # Calculate the mean of array values
2.6666666666666665
my_array.std()  # Calculate the standard deviation of array values
1.247219128924647
my_array.max()  # Find the maximum value
4
my_array.min()  # Find the minimum value
1

To learn about all array methods you can call the dir() function on any array, which will list all its methods. Alternatively you can check out the documentation for the array https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

Another useful property of arrays is that they do math when they appear together with any of the arithmetic operators (+, -, *, /, **, //, %).

my_array = np.array([4, 3, 1])
my_array_plus = my_array + 2
my_array_plus
array([6, 5, 3])

Here, the array appeared together with a scalar value, the single number 2. That number was added to each value. However, we can do the same thing with two arrays, if the have the same shape.

array_one = np.array([4, 3, 1])
array_two = np.array([1, 2, 4])
array_plus_array = array_one + array_two
array_plus_array
array([5, 5, 5])

In this case, addition is again performed element-wise. Each element in array_one is added to a corresponding element in array_two. The fact that the array performs useful math in this context might seem unremarkable but remember how the native Python list behaves.

list_one = [4, 3, 1]
list_two = [1, 2, 4]
list_plus_list = list_one + list_two
list_plus_list
[3, 2, 1, 1, 2, 4]
array_plus_array = np.array(list_one) + np.array(list_two)
array_plus_array
array([5, 5, 5])

If you are in full numerical computation mode this behavior of list might seem stupid to you. But remember: Python is a general purpose programming language and list is a general purpose container to store a sequence of objects. There could be anything in those lists and addition might not be a meaningful operation for those objects. This behavior always works, a list can be concatenated to another list regardless of the objects they store. That’s why we have NumPy. Python has to implement objects in a way that suits its general purpose. NumPy implements behavior in a way that we would expect while we do numerical stuff.

A word on performance

This is one of the rare occasions where it is worthwhile to talk about performance. When you are getting started, I strongly recommend against thinking too much about performance. Write functioning code first, then worry about readability, maintainability, reproducibility etc. etc. and worry about performance last (trust me on this one). But some of you will be working with large amounts of data and you will be delighted to hear that NumPy is much faster than native Python.

my_array = np.random.rand(100000)  # A large array with 100000 elements
my_list = list(my_array)
timeit sum(my_list)
18.1 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit my_array.sum()
90.3 µs ± 6.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The native Python version of sum is orders of magnitude slower than the NumPy version. You might have noticed that I created a very large array to demonstrate this. Actually the performance difference will increase with increasing array size, you can verify this for yourself. The take home message here is that whenever you can replace native Python with NumPy, you gain performance. But don’t worry about optimizing your NumPy code. One exception is the .append method, but more on that later.

Summary

We learned two essential things and one kind of interesting side-note. The first essential lesson is that arrays come with many methods that allow us to do useful math. We learned some of those methods and as you keep working with NumPy those will become second nature. The second thing we learned is that arithmetic operators are applied element-wise to arrays. This means that a scalar value is applied to each element in an array and whenever two arrays of the same shape appear together with an operator each element is applied to each corresponding element. We will learn the details of array shapes in the next blog post. Finally, we also learned that NumPy code is almost always much faster than native Python code. This is good to know. However, especially in the beginning you should focus on anything but performance.

Getting Started with NumPy

  • The array is the central NumPy object
  • Pass any sequence type object to the np.array() constructor to create an array
  • Use functions like np.zeros, np.arange and np.linspace to create arrays
  • Use np.random to create arrays with randomly generated values

NumPy is a Python package for numerical computing. Python is not specifically designed to deal with large amounts of data but NumPy can make data analysis both more efficient and readable than it is with pure Python. Without NumPy, we would simply store numbers in a list and perform operations on those numbers by looping through the list. NumPy brings us an object called the array, which is essential to anything data related in Python and most other data analysis packages in one way or another build on the NumPy array. Here we will learn several ways to create NumPy arrays but first let’s talk about installing NumPy.

Setting up NumPy

I highly recommend installing Python with a data science platform such as https://www.anaconda.com/ that comes with NumPy and other science critical packages.
To find out if you already have NumPy installed with your distribution try to import it

import numpy as np

If that does not work, try to install NumPy with the package installer for Python (pip) by going to your commdand line. There try:

pip install numpy

Finally, you can take a look at the docs for installation instructions. https://scipy.org/install.html

Three ways to create arrays

Now let’s create our first array. An array is a sequence of numbers so we can convert any Python sequence to an array. One of the most commonly used Python sequence is the list. To convert a Python list to an array we simply pass a list to the numpy array constructor

import numpy as np
my_list = [4, 2, 7, 9]
my_array = np.array(my_list)

This creates a NumPy array with the entries 4, 2, 7, 9. We can do the same with a tuple.

my_tuple = (4, 2, 7, 9)
my_array = np.array(my_list)

Of course we can also convert nested sequences to arrays and it works exactly the same way.

my_nested_list = [[4, 2, 7, 9], [3, 2, 5, 8]]
my_array = np.array(my_nested_list)

This is the first way to create arrays. Pass a sequence to the np.array constructor. The second way is to use numpy functions to create arrays. One such function is np.zeros.

zeros = np.zeros((3, 4))
np.array([[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]])

np. zeros gives us an array where each entry is 0 and it requires one argument: the shape of the array we want to get from it. Here we got an array with three rows and four columns, because we pass it the tuple (3, 4). This function is useful if you know how many values you need (the structure) but you do not know which values should be in there yet. So you can pre-initialize an all zero array and then assign the actual values to the array as you compute them. Another array creation function is called np.arange

arange= np.arange(5, 30, 2)
arange
array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])

np.arange gives us a sequence starting at 5, stopping at 30 (not including 30) and going in steps of 2. This function is very useful to generate sequences that can be used to index into another array. We will learn more about indexing in a future blog post. A function very similar to np.arange is np.linspace.

linspace = np.linspace(5, 29, 13)
linspace
array([ 5.,  7.,  9., 11., 13., 15., 17., 19., 21., 23., 25., 27., 29.])

Instead of taking the step size between values, linspace takes the number of entries in the output array. Also, the final value is inclusive (29 is the final value). Finally the third way to generate numpy arrays is with the np.random module. First, lets look at np.random.randint

randint = np.random.randint(5, 30, (3,4))
array([[26, 17, 26, 24],
       [20, 16, 29, 25],
       [25, 21, 26, 26]])

This creates an array containing random integers between 5 and 30 (non-inclusive) with 3 rows and 4 columns. If you try this code at home the values of your array will (most probably) look different but the shape should be the same. Finally lets look at np.random.randn

randn = np.random.randn(4,5)  # Random numbers from normal distribution
randn
array([[-2.34229894, -1.43985814, -0.51260701, -2.58213476,  1.61196437],
       [-0.69767456, -0.0950676 , -0.22415381, -0.90219875,  0.33513859],
       [ 0.56432586, -1.62877834, -0.60056852,  1.37310251, -1.20494281],
       [-0.20589457,  1.34870661, -0.89139339, -0.40300812, -0.15703367]])

np.random.randn gives us an array with numbers randomly drawn from the standard normal distribution, a gaussian distribution with mean of 0 and variance 1. Each argument we pass to the function creates another dimension. In this case we get 4 rows and 5 columns.

Summary

We learned how to create arrays, the central NumPy object. Working with NumPy means to work with arrays and now that we know how to create them we are well prepared to get working. In the next blog post we will take a look at some of the basic arithmetic functions we can perform on arrays and show that they are both more efficient and readable than Python builtin functions.