journal2-8f01ed51

Popular on RED? Explore the food at Kwai Chung Plaza!

My second assignment was even more excruciating than the first one. But this assignment allowed me to apply my ability to scrape, clean, and arrange data in the real world, and was a painful and joyful process. You can visit my page by clicking here.

OpenRice was the first local software I used when I came to Hong Kong (besides LeaveHomeSafe) and it allowed me to learn about local food. At the same time, I often use RED, a graphical social media platform, to keep up with local information in Hong Kong. While browsing RED, many posts about Kwai Chung Plaza were pushed to my homepage. My interest in this issue prompted me to conduct this analysis.

1. Data scraping
After I decided on the issue I wanted to investigate, I selected the data of restaurants in the Kwai Fong area in OpenRice and scraped all the data including name, address, cuisine, price, collection, and the number of positive and negative ratings through Parsehub.

Among all the data scraped, several rows of data were URL and Adwords, which were incorrectly scraped due to special cases, and I chose to process them later in the step of data cleaning.

2. Data cleaning
This step is the most difficult in the whole process for me because I am not familiar with the OpenRefine programming language and need to refer to the user manual and technical reference for each step again and again.

I first used the function of filter to filter out all the URLs and delete all the matching raws to remove the redundant content. Secondly, “edit cells-transform” and “edit cells-common transform” are the two functions I use most frequently. For instance, when removing the units of data, I use ‘value.replace(“reviews”,””) to replace the unit of reviews with blank, and then convert cells to numbers, so that subsequent calculations can be performed.

The biggest obstacle is to clean the “collection” data, because, for those data, numbers over a thousand use “k” as a unit, but numbers under a thousand have no unit. So how to adjust these two parts separately is a problem that took me a long time to overcome. First, I converted all the cells to “number”, and the cells with units were still recognized as a string. Next, my goal was to edit all the cells that were still a string, so I passed the code “if(value .type()==”string”,toNumber(substring(value,0,-1))*1000, value)”, using the if function to edit the content of type string. Then, I need to delete the unit “K” and the substring function in the GREL function can help me to remove the unwanted part. After removing the unit, I need to change the string to a number to get the correct number by calculating the decimal point, so the “toNumber” function is used in this step. Finally, after transforming the data into a number, the calculation of “*1000” was done to get the exact number after removing the unit “K”.

3. Data analysis
In this step, my analysis was divided into three steps.
– The variety of restaurant cuisines
– Favorable Rate of Restaurants with Different Cuisines
– Analysis of Popular Shops in Kwai Chung Plaza

In the first part, I counted the number of each cuisine and sorted them by cuisine type by using the COUNT and GROUP BY functions.
When calculating the Favorable Rate in the second part, I envisioned calculating it by positive ratings/(positive ratings+negative ratings), but it showed an error during the calculation. So the data is converted by the CAST AS FLOAT function and then its average is calculated. Finally, the ROUND function is used to control the range of decimals.
The third table contains several type of datas, I use the number of collections as a measure to determine the hotness of the restaurant, ranked the number of collections from highest to lowest, and selected the top 15 data for analysis. By comparing average maximum price, average favourable rate and number of collections to explore the characteristics of the restaurant in Kwai Chung Plaza.

Finally, for the first time, I performed a complete data analysis and felt a strong sense of accomplishment and satisfaction when the data was successfully presented in the data frame. I hope my thoughts on handling data can be helpful to you all, and let’s struggle through our next assignment together!

Similar Posts