How to organize your research data during analysis

As researchers we often have to manage the entire data lifecycle from generation to the final report. Raw data are major parts of any analysis pipeline and it is important to properly store them. Many guidelines on data management focus on raw data but also stop there. However, raw data without analysis are pretty much worthless and there are major data management decisions to be made during analysis. Which intermediate results to save? What format? Where? Cloud storage? Do I build a database? Every decision will have effects in the actual analysis code and can determine how easy errors can be found and fixed. Here I want to give a detailed overview of the different options we have to organize our data during analysis. Python is my main programming language and this will be reflected here in that I prefer open source tools and I am not experienced with some of the commercial solutions that might exist. However, I will also cover the advantages/disadvantages of human-readable data formats and tools such as Microsoft Excel. We get started with raw data.

Raw data

Raw data are like a good hypothesis. Just like a hypothesis is perfect until you ruin it with data, raw data are perfect until you ruin them with a thorough analysis. Raw data are truth. This is how it happened. Therefore, the raw data is never changed and we always keep multiple copies of it. Even after we published the results of our analysis, it is a good idea to keep the raw data for at least 10 years. Depending on your funding and where you publish you might be legally and/or ethically required to keep raw data for a defined time period.

Anything we do with our raw data during the analysis we do in read-only mode. The only thing we might want to change is the file name. However, we should check whether there is a way to automate the file naming in the program that generates them. Automating file names is preferred and avoids human error. If we receive data from collaborators we should also make sure that we are not deleting or changing data that they placed in the file name or the directory hierarchy.

From an analysis point of view, we might not need any information in the file name at all. We could randomly generate it. However, we almost always want to store important metadata in the file name because it has two major advantages: 1. The file name is human-readable. You need neither programming knowledge nor knowledge about the file format to read it. 2. The file name is readable by the operating system. This means, our programs can read it even if they don’t know the file format. For example, you can use file name information to decide whether or not to load a file during analysis and save loading time.

So what do we put in the file name? The first rule is that it is better to save too much information as opposed to too little. But most file systems have limits on the file name size (I hit it many times in Windows 10). When you choose your file name format you can ask yourself: what would I want a human to know about the files before they even open them and what would be good to have easily accessible during analysis? You will read in a lot of guides that you must save the date and time. That is useful because it makes it easier to relate a file to your lab book. Otherwise I only recommend you to give each file a unique identifier. This can be a number or a string of characters. I prefer numbers because they can have the added bonus of giving the recording order at a glance. Other information in the file names is up to you.

Once you have decided which other information to put in your file names, choose one delimiter to separate items and another one for readability. I use underscore to separate items and minus for readability. That allows me to do things like: “001_YYYY-MM-DD_version-2.fmt”. Avoid empty spaces in file names. Whatever decision you make, you must remain consistent.

When it comes to the directories, I recommend having all raw files in the same directory. I used to save some directory hierarchy structure but nowadays I would put all of that into the file name. If you have a lot of files, some directory structure can help humans navigate it but during analysis I almost always prefer one directory for simplicity. The file name is certainly a good way to save metadata but there are other ways to save them. In the next chapter we will discuss metadata more generally.

Metadata

At the raw data stage, the definition of metadata is clear. It is any data that describes the context of the raw data but does not fit any of the raw data formats. Images are a good example. The raw data of an image are the pixel values that make up the digital image. Metadata of the image can be anything that relates to the context where the image was taken. The time it was taken, the place it was taken, the exposure time, the model of the lens used, the temperature that day, the person who took the image, the width, the height and much, much more. As the analysis progresses, metadata may become data. Let’s say your analysis pipeline counts the number of birds in each image and you want to find out whether there are more early birds than late birds. In that case the time of day the picture was taken becomes an independent variable.

So what kind of metadata should you save? That is up to you, your research question and your field but it is usually better to save too much metadata than too little. Metadata you did not save is almost impossible to reconstruct later, so think carefully when you decide that something is not worth saving.

How to store metadata? If you are lucky, all metadata you need might already be inside your raw data files. Most of the time however, you will have to add some metadata manually. In that case I prefer to create a single .csv file where each row describes a single raw data file. One of the advantages of .csv files is that they can be read by humans and script with a variety of programs.

The same metadata.csv file opened in four different programs. OpenOffice Calc (top left), Notepad (top right), Spyder IDE (lower left) and Notepad++ (lower right).

The id uniquely identifies the row and thereby the raw data file it refers to. A drawback of a .csv file is that it only works well for a single spreadsheet. If you require multiple related sheets – for example because you have multiple raw data sources – you might require a more complex format such as JSON or even build a database. More later on the reasons for and against building a database. Next, we will assume that we found a way to process our raw data and need to decide how to store it for analysis.

Analysis and storage of transformed raw data

Analysis of raw data most often means to iterate through all raw data files and do some computation on each. In our fictional example, we were taking images and then extracting birds from those images. Once we have extracted the birds, our unit of analysis changes. This has consequences for the metadata, which describes the raw data files. We need to relate the metadata from the raw data files to the birds and there are two major options to do this. First, we generate a spreadsheet where each row is a bird. In the image_id column we save the unique id of the image the bird was extracted from.

The extracted birds are related to a raw image by the fact that a specific bird was extracted from a single specific image. The tables in the top left and top right contain all information but during analysis we would need to open two files and relate the information. The table at the bottom has the advantage that both tables are merged into one. This can offer convenience but it increases the storage size of the file.

This is a relational data structure. The image_id in our birds table identifies a unique id in our metadata which allows us to figure out where, when and with what exposure a given bird was photographed. This is a great way to structure data. The only downside is that we have to load two files and then merge them during the analysis. Personally I prefer to create a single file where both tables are already merged. The downside of this data structure is that is is larger in storage size (because the metadata columns are repeated multiple times). Whether or not this is an issue for you depends on the total size of your data, the amount of rows extracted per image and your available disk space. If you decide on the relational data structure you might also want to consider building a database, because they are particularly well suited for relational storage. Next, we will discuss some advantages and disadvantages of databases.

You probably don’t want to build a database. Unless you do.

A database is any structured collection of information. This means that a collection of spreadsheets as we saw above could technically be considered a database. However, a database usually implies more rigorous rules of storage and access than a spreadsheet can guarantee. These rules give databases some of their major advantages. For example, multiple people and programs can interface with a database without messing things up. If two people open the same spreadsheet in excel and work on it they are almost guaranteed to end up with two conflicting versions. There are many other advantages of databases. When you decide for or against a database you should anticipate whether you will be able to make use of those advantages. Here is my list.

You should consider building a database if:

  • You will have to integrate multiple raw data sources.
    • Multiple raw data sources usually mean more relationships. Multiple relationships become harder to manage with spreadsheet but are the perfect use-case for a relational database.
  • You expect to store a massive amount of data.
    • A database usually stores data more efficiently, which can save you storage space.
    • A database can also make access more efficient, because you don’t load the entire database but request smaller packages of information at a time. This avoids running into RAM issues.
  • You will be adding data continuously to the project over a long period of time.
    • Having a fixed logic for adding information to the database is a massive advantage over long time periods and can avoid a lot of errors.
    • During very long projects you might want to train other people to interface with the database or even hand the project over to someone else. Having the relational rules that avoid errors is very helpful, as opposed to a spreadsheet where anyone can do whatever.
  • You want to make you data available through a web interface.
    • A web server interfacing with a spreadsheet would be very inefficient.
  • Multiple people need read/write access simultaneously.
    • As mentioned above, multi-user access works much better for databases.
  • You have security concerns.
    • This can be a security concern regarding the physical integrity of the data or malicious/unauthorized access. Databases are not perfect but they provide better safeguards than a spreadsheet

Those advantages sound amazing. So why would you ever not build a database? Well, many research projects are structured in a way that you cannot make use of any of these advantages. If a single person generates all their data in a week, processes the raw data the other week and prepares a final report in the third week, a database is overkill. The rules that database architectures follow provide major advantages but they also make them less flexible. Building and maintaining a database comes with some extra work that you only want to take on when you can make use of some of the advantages. Generally speaking, the larger a project is (regarding everything from number of people involved over data sources to time frame), the more worthwhile a database becomes.

Before we move on to data analysis, I want to briefly mention cloud storage. Cloud storage has some advantages but most of them can also be achieved in other ways. In my opinion, the most important reason to consider cloud storage is if you regularly need access to the data from a large variety of different locations that are well connected to the internet. However, if you have too much money, a commercial cloud storage can also provide services for you that might be annoying to maintain yourself. On the other hand, as a researcher you may also not be allowed (because of funding or institutional policies) to use certain cloud services. Other details of cloud storage are beyond this guide.

Visualization and statistical analysis

So far we extracted samples (birds) from our images and quantified features of those birds (species and size). Some additional features are given by the image metadata. There might be additional intermediate processing steps whose results you want to save. That’s up to you. For example, maybe you had to apply some filters to the images when you extracted the birds. Do you want to save these filtered images? If the filtering takes very long and you want to visually inspect the result, you probably want to save them. Otherwise, having the script that performs the filtering is sufficient.

The plots you draw are something you definitely want to save. You also want to save the results of statistical analysis. At this stage it is important to have a clear relationship between analysis scripts and their outputs. I hate to look through multiple scripts, trying to find the code that generated a specific figure I want to change. You could just try to put everything into one script. I have done this before but depending on the size of the project this becomes unwieldy very quickly and you will find yourselves scrolling through thousands of lines of code. There is no perfect recipe here so I think a practical example is in order.

A practical example for a project structure

Here I will give some examples, how the bird projects could look as a directory hierarchy.

The directory structure of an example project.

In the project folder itself are only Python scripts and a README.txt, where we can give information that anyone looking at our project should be aware of. Having everything else in specific folders makes it easier to locate a Python script. There is one more advantage: if you use git to version control or share your projects, you can exclude data and figures from tracking. Data is often too large to push to remote or we don’t want to make unpublished data public. Therefore we can simply add the data folders we want to keep local to a .gitignore file. However, managing your project with git is not a must and for smaller projects it does not always pay off.

Now for the scripts themselves. Each script should have a very well defined input and output. extract_birds.py for example loops through all raw files in ./raw and creates the file ./data/birds.csv. The other two scripts both use that same birds.csv file but they answer different questions and have different outputs. analyze_location.py creates the figure 02_figure_location.png and the statistical output location_anova.txt. On the other hand analyze_species_size.py creates 01_figure_species_size.png and species_ttest.txt.

In this example the same script visualizes and performs the statistical test. You could split this up further and have a script for visualization and another for stats. I prefer to combine them because both tasks require loading of the same data. On another note, I prefer to keep things like reports, papers and presentation in a separate folder. In my experience they clutter the project space.

Planning a project in detail is important but there are also diminishing returns. At some point the only way forward is to actually start the project. I hope you learned some useful tricks to manage your own data more effectively.

What’s in a Name?

  • To store an object, we assign it a name using the equal sign
  • Names must not start with a number and they cannot contain special characters (except for underscores)
  • When you choose names, be consistent and descriptive

Names

You have probably heard of variables. In mathematics, a variable is a placeholder for something that is not fully defined just yet. In most programming languages, we say: “We define the variable”. In Python we rarely talk about variables. We instead talk about names. We would say: “We assign a name”. Specifically, we assign a name to an object. To do so, we use the equal sign. While the vocabulary is slightly different, the result is very similar in all programming languages. Let’s look at an example.

my_name = 'Daniel'
my_name
# 'Daniel'

Here we assign the name my_name to the string object 'Daniel'. From now on we can use the name to refer to the object. We can perform operations on the name and Python will replace the name with the object it refers to.

my_name = 'Daniel'
my_name * 5
# 'DanielDanielDanielDanielDaniel'

The multiplication operator applited to a string concatenates that array multiple times. Here we get the same string five times. Note, that we use the name, instead of the literal object. We can also reassign the same name to a different object.

my_name = 'Daniel'
my_name
# 'Daniel'
my_name = 5 * my_name
my_name
# 'DanielDanielDanielDanielDaniel'

First, my_name refers to 'Daniel', then it refers to the result of the operation
my_name * 5, which is 'DanielDanielDanielDanielDaniel'. When we choose a name, there are very few rules that we must follow. However, there are many more rules that we definitely should follow. We take a look at the mandatory rules first.

Naming Rules

When we choose a name, there are many things to consider. But there are also some things that we are not allowed to do. For example a name cannot start with a number.

1st_name = 5
# SyntaxError: invalid syntax
first_name = 5
first_name
# 5
name_1 = 5
name_1
# 5

In other places of the name, numbers are allowed. Special characters are not allowed at any place of a name.

name$one = 5
# SyntaxError: invalid syntax

The only special character that is allowed is the underscore. Underscores are conventionally used in Python to structure names visually.

division_and_multiplication = 5 / 10 * 2
division_and_multiplication
# 1.0

Besides the visually structuring names, underscores also have special meaning when leading a name. For example, all objects in Python have special double underscore methods. We will learn all about them later.

Good Naming Practices

If you are just getting started with Python, you probably have a lot of things on your mind. Actually, if you follow the above rules, you can choose any name you want. However, it is very important to learn good naming practices in Python. Rule number one is consistency. Avoid wildly different ways to visually structure names within the same project. Don’t start with my_name and end with MyName. If you stay consistent within your projectsyou are already ahead of the curve. For beginners I recommend making names all lower case and structuring only using underscores. Also avoid numbers. Write name_one instead of name_1. Also, don’t be afraid of making names long. It is more important names are descriptive than short. length_of_segment is much better than los, even if there is more typing involved. Don’t be afraid to type. Be afraid of coming back to your code one year after you wrote it and trying to read completely unreadable, meaningless names.

Those are enough rules for now. If you want to adopt a perfect style right from the beginning you should read the official style guide for python called PEP8. There are generally no laws about Python style but this is the most authoritative document on coding style in Python and many people reading your code will expect you to follow the guidelines in this document. If you follow the few rules I describe above, you are already following the most important practices regarding naming.

Summary

To store objects we assign them a name. We can do that with the equal sign. The name is on the left, the object is on the right. name = object Needless to say, we’ll be spending a lot of time assigning names. Once a name is assigned, we can use the name instead of the object. Python will replace the name with the object in any operation. There are very few mandatory rules when choosing a name. For one, names cannot start with a number. Second, special characters, except for the underscore (_) are forbidden in names. There are some more rules that are good to follow. The most important one is to be consistent. Do not arbitrarily change your style within a project. If you do this you are already doing well.

The Scientists Guide to Python

  • Install Python as part of a data science platform such as Anaconda
  • Use an integrated development environment like Spyder (Installs with Anaconda)
  • Don’t reinvent the wheel, find out what already exists in your field

Python and Science

Python has been adopted with open arms by many scientific fields. A lot of its success is directly linked to the success of NumPy (the Python package for numerical computing) and the scientific Python community (SciPy community). If you want to understand the history of this success story, I recommend reading the recent nature methods paper which gives some strong background about the SciPy community. However, you don’t need to read the paper to use Python or any of the scientific packages.

A downside of Python is that it is a general purpose programming language. Python is used for anything you can think of. Web development, GUI design, game development, data mining, machine learning, computervision and so on. This can be intimidating to beginners. Especially to scientists who have a very specific task they want to automate. Compare this situation to MATLAB. Its purpose is very specific and is even part of the name. It is the MATrix LABoratory. It deals very efficiently with matrices and is generally good at math stuff. Both very valuable and important to most scientists. MATLAB is specifically designed to serve the science and engineering communities. Python on the other hand is for everyone, which can be a weakness in the beginning but can quickly turn into its greatest strength. There are no privileged tasks in Python, they are all handled well (if there is a great community behind the task maintaining it). The purpose of this guide is to ease some of the hurdles scientists face when diving into Python and highlight the many advantages.

The first problem we face is called package management. When we download pure Python from the official website (python.org), we get the core Python functionality. Python in itself does not include most of the features we depend on as scientists. Pythons functionality is organized in so called packages that we need to install if they are not built into Python. Alternatively we can install a pre-made Python distribution that includes scientific packages. This is my preferred way to installing Python and I strongly recommend beginners to start with Anaconda.

Installing Python with Anaconda

Anaconda is a data science platform that installs Python with most of the packages we need as scientists. It is open-source and completely free. When we install Anaconda we get lots of stuff. Most importantly we get Python, NumPy, Matplotlib, Conda and Spyder. We could install those things ourselves but this way we can have them without ever worrying about package management. We already talked about core Python and NumPy. Matplotlib is a package that allows us to plot our data. Conda is a package manager that allows us to update installed packages and install new packages. Finally, Spyder is an integrated development environment (IDE). Spyder stands for Scientific PYthon Development EnviRonment and it is exactly what we need to get started.

The Anaconda navigator. A graphical user interface that helps us find the main components of Anaconda. One of those is the Spyder IDE which you can start by simply clicking launch.

An IDE is extremely useful. For example, it helps us by highlighting important parts of our code visually. By running Spyder you have just setup your own IDE. This is your kingdom now. You will be able to do amazing things here and you should celebrate this moment properly. However, as you celebrate, my duty will be to guide you through the most important parts of this graphical user interface.

Spyder (Scientific PYthon Development EnviRonment). On the left we have the editor. Here we write scripts. Scripts are collections of code that can be execute all together in series when we hit the run button (green arrow). In the lower right is an IPython console. This is where scripts are executed interactively when we run them. We can also type code here and execute it immediately

Lets start with the interactive console in the lower right. Here, IPython is running and it is awaiting your commands. You can type code and run it by hitting enter. You can use it like a fancy calculator. Try 2+2, 2-2, 2*2, 2/2, 2**2. It’s all there. The interactive console is the perfect place to try out commands and see what they do. We can define variables here and import packages. Luckily we installed Anaconda, so we have NumPy already available. The conventional way to import NumPy is import numpy as np. To learn all about NumPy, find my NumPy blog series.

On the left is the editor. This is our code notebook. Here we edit files that store our code. Those files are commonly called scripts. When you hit enter here, code is not immediately executed. You are just moving to a new line. To execute the code here we can hit the green arrow (play button) above. When we do so, all lines of code in the script are executed from top to bottom. When the script finishes, the objects created in the script still exist in the interactive console. This is very useful to debug the script. We can use the interactive console to take a close look at what the script did.

The Power of Scripts

Most of your work will be about writing scripts that do something useful. Many times, there are other ways to solve the same task. Many things can be done by hand, clicking through the graphical user interface of another program. Sometimes this is still the best way, but there are many advantages to writing a script.

A script can be much faster than a human, especially on repeat tasks. When we have to rename 10 files in a certain way, we might decide that it is not worth to write a script. Let’s just do it quickly by hand. But the same task could come back later. This time with 100 files that require renaming. The most important thing is to pick your battles. We cannot automate everything. We will have to make tough decisions.

Another advantage of scripts is that they are a protocol of the analysis. If we can write a script that takes care of everything and is capable of analyzing the entire dataset, we know exactly how the analysis was performed. When we analyze manually we can achieve the same by very carefully writing down everything we do on each data point. However, this can easily fail if something about the analysis becomes important afterwards and it is not part of the protocol. Scripts are also more easily shared than a protocol because the script language (in our case Python) is less ambiguous.

Sometimes even the script fails as a definitive protocol. This is the case when we use packages that change functions. Lets assume we use a function called fastAnalysis from a fictional package called fancyCalc. The way fastAnalysis works changed between version 1.1 and 1.2 slightly. We only know what our script did if we remember which version we used at the time we ran the script. Here another advantage of scripts shows. We can pretty quickly run the same analysis again with different fancyCalc versions. If we get the same result as before this version is probably the one we ran previously. Manual analysis is usually much harder to repeat. Imagine you spent the last 2 month analyzing 200 samples. You had to normalize each sample for its corresponding baseline. How easy would it be to do the same analysis by hand with a slightly larger baseline interval? In a script this would usually involve changing a single number. By hand it could easily lead to another 2 months of work.

The Scripting Workflow

We talked at length about the advantages of scripts but how do we actually write one? For starters we need to know what we want to do. Then we need to look online for a function that we hope can achieve our goal. Then we try out the function. If it does what we want we move on to the next task. Usually we have one pretty large, intimidating task, such as “analyze this whole data set”. Then this task falls into smaller tasks. For example we will have to load our data, do preprocessing, extract regions of interest, quantify and so on. Those tasks fall apart into even smaller tasks. For preprocessing one of those tasks might be to subtract the background. Ideally, if we make our tasks small enough we will be less intimidated and we will find a function that performs this task.

In the age of the internet, programming is easy enough. Unfortunately, to become effective at it we will need to do some learning and get practice. Our goal is to become effective scripters. So we will need to learn three things: 1. We need to learn some basic Python. This will be easy, I promise. 2. We need to learn how to think computationally about our tasks. Only then will be be able to effectively divide entire projects into smaller, manageable tasks. 3. We will learn how to find, evaluate and use task specific packages quickly. Most tasks have already been solved by other people and we never want to reinvent the wheel (unless we have a vision for a really cool wheel that is much better than all the other wheels that already exist). My blog will lead you through all three steps.

What’s next?

The best thing you can do next is to start coding. Today. The amount of different programming languages, distributions of those languages, integrated development environments and resources to learn all of those is immense and can lead to real decision paralysis. You can spend forever trying to decide how to start but you already have everything you need. Install Anaconda, launch Spyder and just let lose. If you are not sure if Anaconda is right for you, install a distribution you think is more suitable. Install pure Python if you are feeling adventurous and just start hacking away. If you are not convinced Python is the best language for you, install another language. If you don’t want to learn with this blog, use another resource. There are so many great ones. Just do it. So now you need to get your hands dirty.

In future blog posts I will go through the three points I think are necessary to become effective scripters. I will start with basic Python and I will collect those blog posts here. I’d be happy if you want to join me.

Doing the Math with Python

First, we take a look at basic math with Python. We learn the basic arithmetic operators, parentheses and how to save the results by assigning a name.

Getting Started with NumPy

  • The array is the central NumPy object
  • Pass any sequence type object to the np.array() constructor to create an array
  • Use functions like np.zeros, np.arange and np.linspace to create arrays
  • Use np.random to create arrays with randomly generated values

NumPy is a Python package for numerical computing. Python is not specifically designed to deal with large amounts of data but NumPy can make data analysis both more efficient and readable than it is with pure Python. Without NumPy, we would simply store numbers in a list and perform operations on those numbers by looping through the list. NumPy brings us an object called the array, which is essential to anything data related in Python and most other data analysis packages in one way or another build on the NumPy array. Here we will learn several ways to create NumPy arrays but first let’s talk about installing NumPy.

Setting up NumPy

I highly recommend installing Python with a data science platform such as https://www.anaconda.com/ that comes with NumPy and other science critical packages.
To find out if you already have NumPy installed with your distribution try to import it

import numpy as np

If that does not work, try to install NumPy with the package installer for Python (pip) by going to your commdand line. There try:

pip install numpy

Finally, you can take a look at the docs for installation instructions. https://scipy.org/install.html

Three ways to create arrays

Now let’s create our first array. An array is a sequence of numbers so we can convert any Python sequence to an array. One of the most commonly used Python sequence is the list. To convert a Python list to an array we simply pass a list to the numpy array constructor

import numpy as np
my_list = [4, 2, 7, 9]
my_array = np.array(my_list)

This creates a NumPy array with the entries 4, 2, 7, 9. We can do the same with a tuple.

my_tuple = (4, 2, 7, 9)
my_array = np.array(my_list)

Of course we can also convert nested sequences to arrays and it works exactly the same way.

my_nested_list = [[4, 2, 7, 9], [3, 2, 5, 8]]
my_array = np.array(my_nested_list)

This is the first way to create arrays. Pass a sequence to the np.array constructor. The second way is to use numpy functions to create arrays. One such function is np.zeros.

zeros = np.zeros((3, 4))
np.array([[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]])

np. zeros gives us an array where each entry is 0 and it requires one argument: the shape of the array we want to get from it. Here we got an array with three rows and four columns, because we pass it the tuple (3, 4). This function is useful if you know how many values you need (the structure) but you do not know which values should be in there yet. So you can pre-initialize an all zero array and then assign the actual values to the array as you compute them. Another array creation function is called np.arange

arange= np.arange(5, 30, 2)
arange
array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])

np.arange gives us a sequence starting at 5, stopping at 30 (not including 30) and going in steps of 2. This function is very useful to generate sequences that can be used to index into another array. We will learn more about indexing in a future blog post. A function very similar to np.arange is np.linspace.

linspace = np.linspace(5, 29, 13)
linspace
array([ 5.,  7.,  9., 11., 13., 15., 17., 19., 21., 23., 25., 27., 29.])

Instead of taking the step size between values, linspace takes the number of entries in the output array. Also, the final value is inclusive (29 is the final value). Finally the third way to generate numpy arrays is with the np.random module. First, lets look at np.random.randint

randint = np.random.randint(5, 30, (3,4))
array([[26, 17, 26, 24],
       [20, 16, 29, 25],
       [25, 21, 26, 26]])

This creates an array containing random integers between 5 and 30 (non-inclusive) with 3 rows and 4 columns. If you try this code at home the values of your array will (most probably) look different but the shape should be the same. Finally lets look at np.random.randn

randn = np.random.randn(4,5)  # Random numbers from normal distribution
randn
array([[-2.34229894, -1.43985814, -0.51260701, -2.58213476,  1.61196437],
       [-0.69767456, -0.0950676 , -0.22415381, -0.90219875,  0.33513859],
       [ 0.56432586, -1.62877834, -0.60056852,  1.37310251, -1.20494281],
       [-0.20589457,  1.34870661, -0.89139339, -0.40300812, -0.15703367]])

np.random.randn gives us an array with numbers randomly drawn from the standard normal distribution, a gaussian distribution with mean of 0 and variance 1. Each argument we pass to the function creates another dimension. In this case we get 4 rows and 5 columns.

Summary

We learned how to create arrays, the central NumPy object. Working with NumPy means to work with arrays and now that we know how to create them we are well prepared to get working. In the next blog post we will take a look at some of the basic arithmetic functions we can perform on arrays and show that they are both more efficient and readable than Python builtin functions.