analysis-gb7cea70b3_1920-97f89824

A New Way to Find the Best Restaurants

Web scraping can be useful in the digital age, as it makes it possible to collect extensive information on the internet for data analysis. In my second assignment, I went through the basic process of web scraping, data cleaning, and data analysis. It was enjoyable to play with data using Parsehub, Openrefine and SQL.

Since I am living in Fanling, I decided to conduct research on the restaurants in Fanling. Initially, my intention was to find out the gourmet center of Fanling. Using Parsehub, I collected data of more than 200 restaurants in Fanling, including restaurant names, addresses, prices, categories, likes, dislikes, and marks.

When I imported the data into Openrefine for cleaning, I found that the data crawled was quite well-structured, yet there were still some fields needed transformation. I tried to split the category into two fields as many restaurants have two levels of category. At first, I failed to make it by splitting them by space, as the two categories of each restaurant were divided by a blank line. Then I used regular expression by setting “/n” as the separator, and it worked. Another interesting task was to deal with “K” in the field “marks”. My solution was to delete “K” using replace function first, and utilize if function to switch the floats to integers. To better measure the performance of each restaurant, I also generated a new field called “net sentiment” (formula: likes – dislikes) in Openrefine.

After playing with Openrefine for a while, I realized that it was difficult to answer my questions about the gourmet center, because I did not find a way to effectively and logically group the address information. Therefore, I changed my questions as follows:
1. What are the best 10 restaurants in Fanling?
2. What types of restaurants are most common in Fanling?
3. Which cuisines are most appreciated in Fanling?

With the manageable questions in mind, the remaining procedures went smoothly. To answer the first question, a query containing “ORDER BY” could easily solve the problem. For the second question, I used “COUNT” and “GROUP” to find out the number of restaurants in each category. The logic to figure out the last question is quite similar to the second one – “GROUP” is also required, while “AVG” is needed instead of “COUNT” to show the average performance of the restaurants in different categories. I also set a filter to only take into account the cuisines with at least 8 restaurants by “HAVING”, as it could to some extent reduce the bias caused by small number of total restaurants.

Thanks for reading my journal. You can access my website here.

Similar Posts