The Secret Life of Data

Recently, I was having a chat with another coder in the Real Python community about the difference between literal values in code and data. During that discussion, this coder typed the following:

Data has a life cycle inside of Python. It starts off life as literals, gets turned into objects, which gives us values to work with that have properties and methods we can use on that data.

This got me thinking about the levels of indirection that are in play whenever we talk about data in a computer program. While the discussion was specific to Python, the topic isn’t, and it’s something I want to explore a bit more.

Note that this will get a bit deep and parts of it are specific to Python.

First, we need some definitions.

I Mean, Literally…

Let’s define a literal. From the aforementioned discussion, we worked through the concept of a literal, and I think a good definition is:

literal: a constant value typed directly into your source code, which is evaluated at compile or interpretation time.

For example, if you type num = 5 in your code, then the 5 is a literal. It represents the value of 5, and is constant (i.e. it never changes and will always be 5).

Eh, it Varies…

You can compare this to a variable, which for our purposes can be defined as:

variable: a named entity of a specific type, whose value can be accessed and modified.

In the example above, num is the variable. It has a name, a type (in this case, an integer), and we can access and change the value. For example, we can write print(num) to access and display the current value, and num = 7 to change it.

Indirectly Speaking

With some definitions out of the way, let’s take a closer look at the leading quote:

Data has a life cycle inside of Python. It starts off life as literals, gets turned into objects, which gives us values to work with that have properties and methods we can use on that data.

What does this actually mean? Let’s walk through a practical example, starting with the literal 5.

On it’s own, it’s simply the value 5.

Direct storage
No indirection

However, this value needs to live somewhere in memory. So a memory location is set aside to store this value. This location has an address, which is what we use to locate the value. Using the memory address to find the value we need represents one level of indirection. This all happens behind the scenes of your compiler or interpreter, so you never have to deal with it.

First Indirection
Indirection level 1

Next, let’s consider the expression num = 5. This creates a variable called num, which sets a memory location aside to store it. As before, this memory location has an address, which is now linked to the variable name num. This is a second level of indirection — as the programmer, you reference the variable num, which actually contains the address it references, which finally contains the value stored there.

Second indirection
Indirection level 2

As before, your language compiler or interpreter hides some of this complexity from you. Compilers and interpreters maintain symbol tables, which link variables names to the memory address holding that variable’s value. When you access the variable, the compiler or interpreter looks up the variable name in the symbol table, then goes to that address to get the required value.

Of course, it doesn’t necessarily stop there.

Down the Rabbit Hole

Many Python programmers use the CPython interpreter. Internally, this interpreter stores everything as an object. You can think of an object as a single entity which contains multiple pieces of information about the data it holds, such as the type of the data as well as the value.

So when you type num = 5 in CPython, num now references an object. One part of this object is a property called value, which holds the address of the memory location in which the literal value of 5 is stored.

Python indirection
Indirection level 3

This represents another level of indirection: num references the underlying CPython object, which has a property called value which holds an address where the actual value is stored.

Function Junction

Of course, programs which just store data get boring quickly. Coders also write functions, which perform specific calculations on data. For example, if you write print(num) in Python, you are calling a function called print() and passing in the current value of the variable num. This function accepts the data you pass in, and display its value on the screen. The data passed in is called a parameter, and you give the parameter a name which you will use inside the function when you define the function.

Let’s take a look at a Python function which simply adds 2 to whatever parameter is passed to it and prints the result. That function might look like this:

def add_two(data):
    print(data + 2)

In your code, you can use this function by simply typing add_two(num), which is termed calling the function. What happens then?

  • The variable num references an object which has a value property which has the address of a memory location with the value. That value is retrieved.
  • That value is passed to the function, which stores it in the parameter data.
  • The parameter data also references an object which has a value property which has the address of another memory location, in which the value passed in is located.
  • This value is now gathered and passed to the print() function, which stores it in another parameter it defined.
  • That parameter also references an object which has a value property which has the address of another memory location, in which the value passed in is located.
  • That value is then displayed on the screen.
  • When the print() function ends, the parameter is used is removed from the symbol table.
  • When the add_two() function ends, the data parameter is also removed from the symbol table.

Whew! That’s a lot of work to do something so simple.

So why all this indirection? Why is all so complex?

Outward Simplicity Requires Internal Complexity

Most of us aren’t old enough to remember when automobiles were first invented and started appearing on our streets. They were loud, boxy affairs, with small engines, hard wheels, and controls which varied from maker to maker. The cars themselves were relatively simple, and they required a lot of input and maintenance on the part of the driver and owner to keep moving. Even starting an early car sometimes required the driver to exit the vehicle to crank it manually. In the early days, only enthusiasts owned cars, as they were the only people willing to do what it took to make them work.

Mr and Mrs Henry Ford in his first car

Modern cars, by comparison, are much simpler to operate, but much more complex interally. Electronics have largely replaced many previously manual features on cars from starters to windows to getting fuel into the engine. Engineering advances have improved vehicle aerodynamics, braking systems, and even pneumatic tires to help smooth the ride and increase control. Automobile dashboards can tell us how much fuel is left, how fast we are driving, and even provide entertainment for long journeys. Children in most advanced countries can learn to operate automobiles in secondary school.

The simplicity of the early automobile made them quick and (relatively) easy to manufacture, but difficult for the average user to drive. As engineering advanced and complexity was added to the automobile, they became much easier to operate. The added complexity hides certain necessary operating details from the user:

  • Automatic transmissions remove the need to monitor engine speeds and change gears.
  • Anti-lock braking systems remove the need to monitor tire skidding under heavy braking.
  • Power steering and brakes allow anyone, regardless of physical strength, to operate the car.

The same progression can be seen with programming languages.

In the early days of computing, programmers had to code directly on the CPU (what is now called bare metal). One level of abstraction above this is assembly language, which assigns cryptic words like mov, jne, and lea to the actions the processor can do. These instructions are very simple, but are done very fast. Programmers who wanted to accomplish something more complicated needed to break their ideas down into the simple steps the CPU could do. Programming was arduous, and required a level of deep understanding of the underlying computer system.

Soon, languages which could break down complex ideas into CPU instructions started being developed. These languages allowed new types of thinking about computational problems and opened the task of programming to generations of new people. Programmers now needed to know less about the computer itself and more about the language. However, these languages also introduced complex new ideas and structures of their own, which needed to be translated into simpler steps for the CPU to use.

Do you need to know how Python stores data internally to use Python? No. No more than you need to know how an automatic transmission works to drive a car.

But if you’re ever curious, know that there’s more you can learn…