Today, we’ll tie two previous ideas together and see how the computer deals with them simulataneously.
But first, an apology for missing the past two weeks of posts. There were a few reasons:
- My auto-publish algorithm failed for some reason on July 6.
- My main machine, a Dell XPS 13 needed to be sent in for service this week. Apparently, when the keyboard rises to meet you fingers, that’s a sign the battery is no longer holding it’s original shape and size.
- Because I lost my laptop, my wife generously lent me her older Surface 2 Pro, which required some rerouting of cables and setting up some software.
Now, however, I’m back and you can expect posts to be back on track.
We now return to your regularly scheduled tech post.
Storing Text in Memory
Late last year, we talked about how computers can store text in memory by using encoding schemes like ASCII or Unicode. As a reminder, these schemes map the characters with which we write to numbers the computer can store.
Here’s a small portion of the ASCII table (which is also comprised the first 256 entries in the Unicode table):
Character | Number | Description |
---|---|---|
… | ||
< | 60 (0x3C) | Less Than Sign |
= | 61 (0x3D) | Equals Sign |
> | 62 (0x3E) | Greater Than Sign |
? | 63 (0x3F) | Question Mark |
@ | 64 (0x40) | At Sign |
A | 65 (0x41) | Upper-case A |
B | 66 (0x42) | Upper-case B |
C | 67 (0x43) | Upper-case C |
… | ||
a | 97 (0x61) | Lower-case a |
b | 98 (0x62) | Lower-case b |
c | 99 (0x63) | Lower-case c |
… |
You can store any word you wish in computer memory by looking up each character in the table, and storing the corresponding number. For example, if I wanted to store the word "hello", it would be stored as the following five hexadecimal numbers:
Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
---|---|---|---|---|
0x68 | 0x65 | 0x6C | 0x6C | 0x6F |
Storing Large Numbers
We also talked last year about storing big numbers in memory. As a reminder, if a number is too big to fit into one memory location, you split the number into smaller parts by dividing it by 256 (the largest number that will fit in one memory location) and store the remainders until you get to zero. In essence, this converts the number to hexadecimal and stores every two hex digits in a single memory location. If this sounds confusing, take a look at the big numbers article and the hexadecimal article again.
This technique works for numbers of just about any size. For example, let’s say you’re analyzing financial data. You want to convert US dollars to Japanese yen. It’s possible you may have to store the number 448,378,203,247 somewhere (that much yen is about 4 billion dollars). The computer will convert this decimal number to hexadecimal and store it in memory like this:
Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 |
---|---|---|---|---|
0x68 | 0x65 | 0x6C | 0x6C | 0x6F |
Do you see it? Take a closer look at what was stored in each example above.
The computer stored the exact same data for the word "hello" as it did for the decimal number 448,378,203,247.
So the question you should ask yourself is: How does the computer know the data with which it is working is a number or words?
Learning to Type
The simple answer is: It doesn’t. To the computer, all data is just numbers.
Interpreting and manipulating that data is what concerns humans. Therefore, the concept of the data type was invented.
Simply put, a data type (or type) tells the computer how specific data is intended to be used. More correctly, types tell specific programs how data is intended to be used. Different programs can treat the same data differently.
Almost every programming language defines a basic set of data types which coders can use to represent their data. At a basic level, most languages support the following data types:
- Strings and characters, used to represent textual data.
- A character can store an individual letter, like ‘a’, ‘P’, or ‘?’.
- A string is simply several individual characters strung together. For example, "Hello", "My Name is Jon", and "Feh!" are all strings.
- Integers, which are positive and negative whole numbers, and zero.
- The numbers 42, -294, and 1447 are all integers.
- Floating-point numbers, which represent positive and negative rational numbers.
- The point being referred to is the decimal point, and it’s called "floating" because it can float around and appear anywhere in the number.
- The numbers -273.15, 1.618, and 2,342.19 are all floating point numbers.
- Boolean values, which represent True and False.
- Boolean data is often the result of a comparison. For example, the expression 10 > 5 evaluates as "True".
You may see short-hand names for these data types in your readings as well. Often, the short-hand name is also what the data type is called in a particular language:
Name | Short-hand |
---|---|
Character | char , chr |
String | str |
Integer | int , long , short |
Floating-point | float , real , double |
Boolean | bool |
We’ll dig into each of these in later blog posts. For now, it’s enough to know that if you read an article where someone talks about long
data, they mean integers.
Who’s Asking?
We mentioned above that different programs can treat the same data in different ways. What one program sees as text data might be interpretted and manipulated as integer data by another.
This is only partially true, as data might be interpretted many different ways even within a single program. How can this be? Let’s look at an example, using our ASCII table above.
If you look carefully, you’ll notice something interesting about the encodings representing upper and lower case letters:
Character | Number | Description |
---|---|---|
… | ||
A | 65 (0x41) | Upper-case A |
B | 66 (0x42) | Upper-case B |
C | 67 (0x43) | Upper-case C |
… | ||
a | 97 (0x61) | Lower-case a |
b | 98 (0x62) | Lower-case b |
c | 99 (0x63) | Lower-case c |
… |
Let’s take a closer look at the numbers:
-
The encoding for the upper-case letter "A" is 65 (0x41), while the encoding for a lower-case "a" is 97 (0x61). Subtracting one from the other gives us 32, or 0x20.
-
For the upper-case "B" and lower-case "b", the encodings are 66 and 98, which are also 32 apart.
-
Same for "C" and "c" — they are separated by 32.
In fact, it’s the same for all the letters — every lower-case letter in encoded as a number 32 higher than it’s upper-case. So what can we do with this?
Let’s say you wanted to convert a word from lower-case into upper-case. You could do it like this (the following is not real code, but pseudo-code — more on that later as well):
for each character in word:
if character is 'a' then
change character to 'A'
if character is 'b' then
change character to 'B'
if character is 'c' then
change character to 'C'
...
While this can work, it’s not very efficient, and it’s very tedious to write.
Instead of working harder, let’s work smarter and use the fact that all lower-case characters are encoded 32 places higher than their upper-case counterparts.
To do so, we first need to interpret characters as numbers. Most languages allow you to get the ASCII encoding for a character directly (again, this is pseudo-code):
> character = 'a'
> print ASC(character)
97 (0x61)
Once we convert a lower-case letter to it’s ASCII encoding, all we need to do is subtract 32 from it to get the upper-case encoding.
Now, we need to be able to interpret that number as a character again. Again, most languages have this provision as well:
> character = 'a'
> print ASC(character)
97 (0x61)
> new_character = ASC(character) - 32
> print CHR(new_character)
'A'
Knowing that we can interpret the same data as a number using ASC()
and as text using CHR()
, let’s rewrite our pseudo-code to take advantage of that:
for each character in word
# Is it a lower case character?
if character is lower_case then
# Get the encoding for this character
encoded_character = ASC(character)
# Get the new character
new_character = CHR(encoded_character - 32)
# Make the change
change character to new_character
How this works in a real programming language depends on the actual language, but the ideas can be implemented in any language.
Next Time…
Remember how we stored our string "hello" in memory from above:
Loc 1 | Loc 2 | Loc 3 | Loc 4 | Loc 5 | Loc 6 |
---|---|---|---|---|---|
0x68 | 0x65 | 0x6C | 0x6C | 0x6F | ??? |
For next time, ask yourself:
- How do we know to stop reading the string at memory location 5?
- What if this was the number 448,378,203,247?
- And what’s in memory location 6?
Until then, stay curious.