NumPy Array Data Type

  • Any array has a data type (dtype)
  • The dtype determines what kind of data is stored in the array
  • Not all operations work for all dtypes

Introduction to Data Types

Having a data type (dtype) is one of the key features that distinguishes NumPy arrays from lists. In lists, the types of elements can be mixed. One index of a list can contain an integer, another can contain a string. This is not the case for arrays. In an array, each element must be of the same type. This gives the array some of its efficiency, because operations can know in advance, what kind of data they will find in each element simply by looking up the data type. At the same time it makes arrays slightly less flexible, because some operations are undefined for some data types and we cannot assign any kind of data to an array. But how does NumPy decide what data type an array should have in the first place?

Guessing or Defining the dtype

So far we were able to create arrays effortlessly without knowing what dtype even means. That is because NumPy will just take a guess, what the dtype should be, based on the input it gets for the array.

arr = np.array([4, 3, 2])
arr.dtype
# dtype('int32')
arr = np.array([4, 3.0, 2])
arr.dtype
# dtype('float64')
arr = np.array([4, '3', 2])
arr.dtype
# dtype('<U11')

In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number. Floats are a more complex data type in Python, which means that all other data types have to follow the more complex one. Therefore, all elements of the array are converted to floats and are stored with the dtype float64. Strings are an even more complex dtype. Because ‘3’ is a string in the final example, the dtype becomes ‘<U11’. U stands for unicode, a type of string encoding and the number indicates the length of the string. In all three cases NumPy guesses the dtype according to the content of the list. This works well most of the time but we can also explicitly define the dtype.

arr = np.array([4, 3, 2], dtype=np.float)
arr.dtype
# dtype('float64')
arr = np.array([4, 3, 2], dtype=np.str)
arr.dtype
# dtype('<U1')
arr = np.array([4, 3, 2], dtype=np.bool)
arr.dtype
# dtype('bool')

Converting arrays to other dtypes can be necessary because some operations will not work on arrays of mixed types. A dtype that is particularly problematic is the np.object dtype. It is the most flexible dtype but it can cause a lot of problems for both experts and beginners.

np.object and the Curse of Flexibility

Most dtypes are very specific. They let you know if the array contains a number (np.int, np.float) or a string (all unicode ‘U’ dtypes). Not so much np.object. It tells you that whatever is inside the array is a thing. Because everything is an object anyway. This can make an array as flexible as a list. Anything can be stored. That is also where the problems come in.

arr = np.array([[3,2,1],[2,5]])
arr.dtype
# dtype('O')  # 'O' means object
arr + 5
# TypeError: can only concatenate list (not "int") to list

Suddenly, the plus operation between an array and a scalar fails. What went wrong? Starting from the top, NumPy decides to assign the dtype of np.object to arr because the nested list entries have different lengths. Think of it this way: this array can neither be a (2, 3) nor a(2, 2) array of dtype integer. Therefore, NumPy makes it a (2,) array of dtype object. So the array contains two lists, the first one is of length 3 and the second one of length 2. NumPy generally turns anything that is more complex than a string into np.object. A list is one of those that gets turned into np.object. The error then occurs because the plus operation is not defined for a list with an integer. But that also means, that the operation will work, if the objects contained in the array so happen to work with the operation.

arr = np.array([3,2,1], dtype=np.object)
arr.dtype
dtype('O')
arr + 5
array([8, 7, 6], dtype=object)

This is one of the main problem of the np.object dtype. Operations work only sometimes and to know if an operation will work, each element has to be checked. With other dtypes, we know which operations will work just by just looking at it.

Summary

The dtype is one of the concepts that is closely related to the internal workings of NumPy. There is a lot that could be said about the details but effective beginners only need to remember a few points. First, the dtype determines what is stored in the array. All elements of an array have to conform to a specific type and dtype tells us which one. Second, NumPy guesses the dtype based on the literal data unless we specify which dtype we want. Guessing works most of the time but sometimes explicit types conversion is necessary. Third, operations that we know and love from numeric types (np.int, np.float) may not work on other types (np.str, np.obect). This is particularly annoying for beginners. If you have hard to debug errors, find out what dtype your arrays actually have.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s