Journal002-e3e01f9b

Assignment2: What’s for dinner in Tsim Sha Tsui

When conducting this assignment, I put myself in the shoes of a tourist with a budget under $200 looking for restaurants with a relatively good reputation in Tsim Sha Tsui, the most prosperous area in Hong Kong.

When scraping the data on Parsehub, at first, I only collected the number of comments, likes, and dislikes, but no specific score on rating. To achieve this, we need to enter the next detail page of each restaurant. To train Parsehub to do this, we must click “no” when it asked, “is this a next-page button” after creating the relative relationship. Figuring this out and finding the logic cost me quite a lot of time, and we’d better make sure each step is right and then do a trial run, otherwise, it can be very inefficient.

The raw data scraped from the website contains a total of 2084 records, the first step I did was removing all records (424 records) with a blank value of rating. I kept those with a blank secondary-category value as I didn’t think it would affect my analysis. Then, I deleted the duplicate records with the standard of both the name and address being the same. Next, I grouped the primary category of restaurants into 8 categories based on the classification of cuisines on Open Rice, where Hong Kong and Cantonese cuisine are combined into one category and Japanese cuisine is excluded from other Asian cuisines due to their significant quantity.

For the address, I used Nearest Neighbor Clustering and set a radius from 1 to 4 to group them. The trick I used here is that as I just want to focus on the road where the restaurants are located, I split the column and only keep the first 7 letters of the address, then I use the text facet function to achieve the outcome. This improves the efficiency of the clustering to some extent.

Finally, I got the table with 10 columns and 1644 rows of records. I analyzed it mainly from 3 perspectives of primary category, address, and price range and created a word cloud of their secondary category. Learning to calculate percentages with SUM(COUNT()) OVER() is what bothers me the most, because count() returns the number of occurrences, while SUM() is generally used for numeric records, therefore we can’t combine them directly. Moreover, when creating the word cloud, I needed to first install the word cloud module in anaconda and handle the Cantonese characters, which confused me for a long time. In the end, I watched a tutorial video and applied an existing font package to solve this problem.

During this process, I found some types of cuisine have a high poor rating rate but also a high rating. I don’t think this is informative as the number of likes and dislikes is too few, so I decided to exclude these two factors in my subsequent analysis and only focus on the average rating.

I got some interesting findings which you can check on my website . Frankly, I enjoyed this work more than assignment1. I think using Python and SQL to do data processing gives me a greater sense of control than HTML and CSS languages and allows me to know exactly where the problem lies. Also, although I had never used python before, based on my little knowledge of R before, I found it to be a relatively more efficient tool for data processing, with easier manipulation and clearer logic. They have something in common, but if you ask me to choose one to go deeper, I will definitely choose Python!

Similar Posts