Scraping the Web

A while ago, I was approached by a friend with help on a programming problem. They crunch data for a major metropolitan city’s finance office, and wanted to discuss some coding issues they were bumping into.

I need to mention my friend is a self-taught coder. Their experience is with marketing and advertising, not coding. While they have the skills to apply what they know to their problem, they don’t think like a coder.

That’s where I come in.

For this issue, my friend used Python and the library Beautiful Soup, which makes scraping web page data easier. They were gathering a set of business names, addresses, and phone numbers from the Yellow Pages web site to match against data the city already had.

They had working code which pulled all the data into three separate Python lists:

  1. A list of business names
  2. A list of addresses
  3. A list of phone numbers

The thought was that each element in these lists would correspond to the same business entity. That is, business_name[1] would be located at the address address[1], and have the phone number phone[1].

The problem my friend faced was that, after they gathered all this data, they had two fewer phone numbers than business names or addresses. That meant was no guarantee of alignment between all this data.

I’ll be honest — I don’t know the Beautiful Soup library at all. I’ve never used it before. Before we spoke, I headed to the website to familiarize myself with it, look at some example code, and see if I could find something useful.

My friend used the find_all feature, which returns a Python list containing every piece of data contained in a given HTML tag across the page. On this site, all the business names were in <div> tags with a specific class. Similarly, address and phone numbers had their own class tags. This made it simple to use `find_all`, pass in the correct class, and get a single list containing all the data with that `class`.

Now while I don’t know Beautiful Soup, I do know a bit about data normalization. To my eye, there were three problems.

The first was keeping the data properly related. After talking with my friend a bit, I mentioned that I expect all the data on any given business to be stored together. In other words, the business name, address, and phone number of any given business should be grouped together in a single data structure. Beautiful Soup returned the data in separate lists, but there was no reason it had to stay that way.

Of course, this led to the second problem, which was how to get all the data on a single business together. We solved that by inspecting the page a little deeper. We noticed that while the business name, address, and phone numbers each had their own <div class> tags, there was also a parent <div class> for each entry which encompassed them all. If we used find_all to find the parent class, Beautiful Soup would return all the data on the business in a single string.

Which leads us to the final problem to solve. How could we split the data in that single string into the business name, address, and phone number? To solve this last problem, I noted two things:

  1. The string always had the data in the same sequence:
    a. Business name
    b. Phone number
    c. Address
  2. The phone numbers always have the same format, namely (###) ###-####.

Since the phone numbers always look the same, they are perfect for finding with a regular expression, which is basically a fancy pattern matcher. We told Python to find the phone number in the string using the known format. Once the phone number was found, we knew everything before it was the business name, and everything after it was the address. For the two entries where there was no phone number, we could either ignore the entry, or use the fact that the address usually starts with a number to break the data apart.

My friend took this information and rolled with it. I learned about a week later they had successfully learned regular expressions and solved the business problem. All it took was some extra tools (namely regular expressions), and thinking about the problem a little differently.