1 2-f17bf1be

BTA#2| OpenRice Tsuen Wan – Applying “Likes Rate” to Restaurant-selection

Hi there! It’s Mariana.

This week I’m handling data scraping, cleaning and analysis using ParseHub, OpenRefine, SQLite and Python. Here’s a link to my page (click)

Personally, I take OpenRice as a reference to go to when I’m trying to find somewhere to eat, especially in a rather unfamiliar area for me.

During the very first 2 weeks when I arrived in Hong Kong, an area that I frequently visited was Tsuen Wan. That is also why I chose it as my target area for analysis.

After settling the target area, the first thing that I did was actually not trying to find out about what problem I wanted to solve with the analysis. Each page of OpenRice contains a lot of information marked by different fonts, colors and icons and… a lot of ads as well. The poor visual experience of the site constantly made me feel distracted when I was trying to find some relationship between all those numbers. So, I decided that I would first take down the data of all the major metrics appearing on the page. This happened to be an exercise for me as well to organize my mind when trying to scrap data from such a “maximalism” website.

Challenge#1: The displayed metrics for some items are different

スクリーンショット 2022-10-16 9.28.13-eb42fc8c

At first I was trying to directly scrap data from the search result page. However, if we take a look at the items listed, we may find out that the metrics involved vary from sponsor restaurants and normal restaurants, which will lead to the failure of the automatic identification of elements to be put into the same column.

To solve the problem, I had to scrap some of the data from the detail page for each restaurant instead of the search result page which required me to build an extra template for detail page, and to add motion of click to connect the two templates. It was a bit tricky to shift from my first method to the new one, because I had to reconsider what metrics to be collected from the result page and what to be from the detail pages.

Overall, it seems to me that scrapping data from a site can somehow be a detailed-oriented process that requires a lot of patience. It’s important to observe how the data’s arranged on the page and to always go back and check to see if the system has put the right data into its right place.

Challenge#2: Cleaning repetitive content

When I exported the CSV file from ParseHub, I found that there were three restaurants that appeared in the chart for several times. Except for the first time, the repetitive lines has no data in the metrics columns.

I went back to the site seeking for the reason why the three of them were included again and again in the result and it turned out that the three of them were all sponsor restaurants that were fixed on the top of each page. Every time the template was repeated, the three of them would be taken into account. Since the repetitive lines are followed by a bunch of blank units I used “Facet by blank” function to remove them.

Challenge#3: Grouping, sorting and counting

Instead of going into details, I wanted to first have an overview on how’s the proportion of each type of food and price range in that area. The result can be really useful if we are sitting at home trying to decide which area to go for more choices of the food that we like in a preferable price range.

To achieve the goal, I didn’t directly write SQL lines. Instead, after the data needed had been taken out from the database, I grouped the data in “type” and “avg_expense” columns and counted the units in each group using groupby() and count() function, and finally used sort_values() to rank the results.

However, in the following rankings about likes/dislikes, I simply applied sorting function to SQL lines, writing “order by” and “desc” to do a descending ranking. The whole process of grouping, counting and sorting appeared quite novel to me, I spent sometime to learn about them and it turned out to be pretty useful.

Challenge#4: How to better understand likes/dislikes? Introduction of “Likes Rate”

The problem that I aimed to solve with the analysis was “how to use likes/dislikes smartly as a metric to pick a restaurant”.

Unlike other platforms such as Dianping, OpenRice’s rating system takes 0.5 star instead of 0.1 star as one level. Such “vague” setting makes it hard for us to make comparison between different restaurants because most of them were rated either 4 star or 4.5 star. In this case, number of likes and dislikes may be a better criteria for our reference.

At the very beginning, I planned to simply rank them by number of likes or dislikes. The limitation of this method is that there are some popular restaurants that appear on the top of both the rankings, which means it’s unwise to judge a restaurant only by its number of likes or dislikes, regardless of their distribution. That’s when I decided to apply a new criteria called “Likes Rate” so that I could check out how’s the proportion of positive evaluation for each restaurant.

Likes Rate = number of likes / (number of likes + number of dislikes)

I created a new column with the same name using SQL, and made a new ranking according to it. Then we may refer to the new ranking to find out restaurants with relatively stable performance and popularity as well. Please check out in my site about the recommendation that I finalized with the help of the analysis!

So this is it! So much for this week’s assignment.

Wish you all the best and hope you like it.

Mariana

Similar Posts