What Lies Beyond

Last week, we talked about the concepts of data types, which determine how a program interprets the contents of memory. We ended that post with the example string "hello" stored in memory:

Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6
0x68 0x65 0x6c 0x6c 0x6f ????

And a few questions about it:

  • What’s in memory location 6?
  • How do we know to stop reading the string at memory location 5?
  • What if this was the number 448,378,203,247?

Let’s start with the last question first, but first, a disclaimer.

There is a method to this…

In the preface to The TeXBook, Donald Knuth made the following statement:

Another noteworthy characteristic of this manual is that it doesn’t always tell the truth.

Dr. Knuth goes on to write that, when concepts are introduced, they are just generalities. As the reader proceeds to more complicated and in depth material, the reader may find rules stated earlier are contradicted, refined, or just plain wrong. He defends this practice as one which helps readers learn new ideas. It’s much easier to modify a rule you know than it is to introduce a complicated idea with lots of exceptions.

As a teacher, I whole-heartedly agree with the good doctor on this point.

Therefore, some of the ideas discussed are approximations, generally correct, but may be slightly off, a bit out-dated, or just wrong. That’s OK for now, as we learn more about how code works at a lower level. It’s not meant to be accurate reference material – there’s plenty of that out there. Learn the concept, and you can refine it later.

OK, on to the ideas!

There’s a Limit

When we talked about data types last time, we looked at a chart showing the names which are usually given to various data types:

Name Short-hand
Character char, chr
String str
Integer int, long, short
Floating-point float, real, double
Boolean bool

We said the short-hand name of the type is also what that type is called in various programming languages. Let’s expand on that a little, and introduce a new concept of the size of a data type. We’ll focus first on the integer data types.

You may have heard the terms 32-bit or 64-bit when referring to computers or operating systems. That label defines the computer architecture, and indicates the maximum size of a memory address used on that computer or operating system. Older 32-bit systems can only address 4 gigabytes of memory at one time, because the largest address usable is:

$$2^{32} \approx 4.2 \times 10^9$$

On a 64-bit system, theoretically you can address up to 16 exabytes of memory (an exabyte is roughly 1 billion gigabytes), because the largest address usable is:

$$2^{64} \approx 16 \times 10^{18}$$

What does this have to do with the size of an integer? Well, the architecture of the machine and the software which runs on it defines how big a number defined as int can be. On a 32-bit system, the biggest int is 32-bits, or four bytes (recall a byte is made up of eight bits). That allows you to use numbers from 0 through 4,294,9657,295.

But wait, I hear you say. Above, we used a bigger number – 448 billion, which you said used five bytes. How did that happen?

Simple – the computer used a different integer type, called a long. A long integer is usually defined as being twice a big as a normal integer, so a long would use eight bytes instead of four.

So what happened to the extra three bytes?

Remember that you can write the decimal number 15 as 015, or 0015. The leading zeros are usually dropped when you write the number. In this case, that’s what happened with the leading bytes for our example – we just didn’t show the leading zeros. If we had, the example would have looked like this:

Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6 Loc 7 Loc 8
0x00 0x00 0x00 0x68 0x65 0x6c 0x6c 0x6f

In fact, this is how this number is actually stored in memory, in eight consecutive bytes.

NOTE: I said a long was usually defined as being twice as large as an int. It doesn’t have to be. Depending on the system and language, it may be the same size as an int, or four bytes.

In any case, it will never be smaller than an int.

There are even bigger and smaller integers you can use as well, depending on the system architecture, the programming language being used, and your particular needs. The quad is usually defined to be as big as four int values, or sixteen bytes, while a short is usually half as big, or two bytes.

So now, you should be able to answer the original question: how does the computer know which memory to look at and when to stop? It knows because the data type used (in this case, a long) stores it’s data in eight bytes. Since we know the starting location, the computer starts reading memory there, and once eight bytes have been read, it will stop.

Stop! In the Name of Null

In fact, almost all the basic types we listed earlier work this way. We can update the chart we have showing the approximate size of each type. Note that these sizes may differ based on the system or programming language you are using:

Name Short-hand Size
Character char, chr One byte
String str Who knows?
Integer int, long, short Two, four, or eight bytes
Floating-point float, real, double Eight or sixteen bytes
Boolean bool One byte

Wait, what’s up with the String type?

Remember, strings are just sequences of individual characters. Strings can be any length you want. The following are all strings:

  • "Hello" has five characters
  • "My name is Inigo Montoya" has twenty-four characters.
  • "You killed my father" has twenty characters.
  • "Prepare to die!" has fifteen characters.

Each of these strings has a different string length, which means we can’t use the same technique as we did for integers. We know where the string begins, but we need to know when to stop.

There are several techniques which allow the computer to know the length of a string, but the simplest and most well-used is a technique called null termination or zero termination.

When using null-terminated strings, the computer stores a zero after the last character of the string. When it needs to read the string from memory, it starts reading at the first character, and keeps reading until it reads a zero.

So here’s what the ASCII string "hello" really looks like in memory:

Loc 1 Loc 2 Loc 3 Loc 4 Loc 5 Loc 6
0x68 0x65 0x6c 0x6c 0x6f 0x00

So now we should be able to answer the first two questions:

  • What’s in memory location 6?

    If it’s a string, then memory location 6 contains a zero.

  • How do we know to stop reading the string at memory location 5?

    Because we read the zero at location 6.

Of course, there are other ways to represent strings in memory. If you want to know more, let me know in the comments.

Next time: more complicated data types!