905edca3-b58c-45be-b588-c4a5983e7e48--2020-1215_pasta-garlic-butter-sauce_3x2_rocky-luten_023-03262144

Assignment II: More Pasta, Please! 🤌

As a Pasta Fanatic

As a fanatic of Italian food, I don’t like pizza, espresso (coffee), or gelato (ice cream). Instead, I’m crazy about pasta. There are about 200 types of them, with unique geometry and history.

Types-of-Pasta-e4bb6688

So, here comes the question. What is the restaurant with the most types of pasta in Hong Kong?

The Long and Winding Road

Our web scraping target is OpenRice, simple and straightforward. Choose the Italian restaurant panel, and there are 949 of them.

Step 1: Data Scraping

The restaurant title should be selected. Then I also add some supplementary information such as bookmark, upvote, and downvote number.

screen_1-a2de63cb

Menu information is quite harder to get. I have to click into the restaurant page, and then again click into the menu page. So, these are two additional moves. Then, the menu is a mess. So, CSS selector of “poi-menu-item-info-name” is used to get all the dishes. I cannot select only the pasta dishes during the first step, because their selector is not unique.

截屏2022-10-17 17.21.52-4503896b

Besides, Parsehub has a quota of 200 pages. The main page, restaurant page, and menu page all count. In the end, I only get about 100 restaurants.

Step 2: Data Cleaning

Data cleaning is the most complicated process in my case. At first, the table looks like this.

截屏2022-10-17 17.30.18-7b40cd53

I have to change the headers and add the ID number, which is done by OpenRefine. Then, Python is needed to organize the ID number to remove the duplicate ones.

The next step is to acquire the list of pasta from Wikipedia, and count the menu dishes according to the list. Two layers of for loop and the find function help to tackle the problem. After the counting is completed, a new csv file should be created, with the Python library of csv.

Here is the final csv.

截屏2022-10-17 17.35.24-dfeb45ef

Step 3: Data Analysis

Data analysis is easier. Two sql commands are used. The first is “SELECT * FROM Pasta_result ORDER by Pastatypes DESC” to rank these restaurants by pasta types. The second one is “SELECT Pastatypes, count(Pastatypes) as total FROM Pasta_result GROUP by Pastatypes” to count their number according to pasta types.

Mamma Mia! Give Me More Pasta!

The result is bit frustrating. First, many of them don’t have text menus, which is a bit lazy. These are marked with the number -1. Indeed, I admit I could use OCR to get the texts from images, but as I look through these pictures, there are also not many pasta types. Second, There are so few pasta types. I cannot believe the largest number is 4.

Pic_1-22dcafe0

Here is the bar chart, which could be more clear to see and compare the results.

Pic_2-281b6942

It is no surprise, though. Pizza is more polular than pasta in Asia. When it comes to Italian food, pizza is the first dish to pop out. Also, nearly most of my friends know little about these pasta types, and some even mistake spaghetti for pasta.

Still, I do hope that some restaurants can focus on pasta. Maybe there are some restaurants with image menus that provide lots of pasta types. Guess I should look at all of them one by one, or OCR might be added to my Python program.

Except for pasta, menu is another issue. Most of the restaurants on yelp or Dianping have text menus, while most of those on OpenRice only have menu pictures uploaded by users. This is without doubt a bad design.

I have lots of improvents to do for this project, while OpenRice also has to reflect on itself.

Similar Posts