Come and find some Korean restaurants! (My struggle with scraping)
Hello, everyone! This is my journal for assignment 2. You can see my results here.
I will share my feelings and experience from the following aspects.
1. The time for scraping was much longer than I thought
I observed the OpenRice website for some time and I wrote down the properties needed for my analysis. The step was quite easy, you get a check on the numbers or texts you want in your list and that’s all.
The next step was choosing the properties in ParseHub, which was not very difficult as well. You just need to click on the information and ParseHub is smart enough to get them for you. What I need to mention is the use of button. I used two buttons in my project, one for the next page and another for the detailed information page. Be cautious when you make decisions on which one you should scrap from the list page and which one from the details page. At first, I chose the number of wish list in the list page. However, I found it presented as “xxK”, which was not a certain number. Then, I decided to scrap it from the details page and got the exact numbers.
The most suffering experience in the process came at last. Waiting for the result took nearly 2 hours, not to mention the time cost for nothing as my ParseHub met sudden close twice. Also, the free version can only also me to scrap 200 pages at one time, so I had to repeat the process twice. Next time, I will begin scraping earlier or try to use python for the scraping process.
2. Coping with the data in OpenRefine
I think OpenRefine is very user-friendly, and I did three things with it.
First, I renamed all the columns. The name from ParseHub started with “name_” as I use the “relative select” function, I deleted it to make the name shorter and clearer.
Next, I use the “facet” function to exclude restaurants with blank in “good_comments” and “bad_comments. In this way, I deleted the duplicated restaurants with advertisements which appeared at the top of every single list page. I also checked if there were other duplications of name and location.
Third, I combined the three columns of recommendation as one column. Each restaurant has the different presentation of recommendations, some have five while others have none. I chose to scrap the first three recommendations (if any). Combining them together made it easier for me to make the word cloud image afterwards.
3. Using python and sql for analysis
The commands I used most often are “order by”, “group by” and “limit”. I think they are very easy to monitor as well as extremely powerful. They help you to make the data more clearly organized and even attribute to formulating the ideas of analysis. With these commands, I found there might be a concentration for location or category, so I count the number of restaurants in each district and with each label.
I also tried something interesting with python: word cloud. I searched the tutorial for several days and I have tried some sets of code. Finally, I got the one working well for me! I posted the code I used below; you can try it if you are interested (you may need to install “wordcloud” and “jieba” in your notebook first)
#为了制作词云需要引入的资源库
import jieba
import jieba.analyse
import codecs
import re
from collections import Counter
from wordcloud import WordCloud
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import wordcloud
#导入文档
with open(“text.txt”,encoding=”utf-8″) as f:
s = f.read()
#使用分词工具
ls = jieba.lcut(s)
text = ‘ ‘.join(ls)
#导入词云形状的遮罩图片
img = Image.open(“wordcloud_background.jpg”)
mask = np.array(img)
#生成过程
wc = wordcloud.WordCloud(font_path=”System/Library/Fonts/PingFang.ttc”,#字体路径
colormap = “gist_heat”, #配色方案(可以在官网上直接复制名称)
mask = mask, #设置图片形状
width = 1200, #如果没有图片形状则使用这个下面的宽度高度设定
height = 900,
background_color=’white’,
max_words=100)
wc.generate(text)
wc.to_file(“wordcloud.png”)