This week’s assignment is to use Wget to scrape a website.
Valentine’s Day is around the corner, and as usual, all the Valentine’s Day-related commercials has started to appear everywhere in life, I thought the best way to have an intimate moment with loved ones should be something more than going to an overpriced dinner at a overcrowded joint. Why not making something enjoyable yourself and make it personal and sincere?
Presentation is the key, even if it is just a cake. So I started to look for pictures of chocolate cakes and the first website came to mind is foodnetwork.com.
As I am fairly new to html and have not used Wget before, As I browse through the website, I noticed the amount of pictures on the site, I knew I have to narrow down to something specific so it won’t burden my server too much. So I decided to look for/ scrape /save pictures under the chocolate cheesecake recipes and comfort food sections.
The command I used is:
wget -r -A jpeg,jpg http://www.foodnetwork.com/topics/chocolate-cheesecake-recipes.html
A few things happened:
1) When I tried to document work progress and process by uploading some Wget log here, WordPress won’t let me:
Even after I changed the file name, it still resisted:
2) It took long, I mean, really long for Wget to get pictures…. hours. It lost connection to the website a few time in the process, so I have to start over again and again.
Given the amount of space and speed allowed, finally I was able to get some data index on my server and google drive.