class: center, middle, zubi # Common scraping challenges & my scraping library "Legs" ### .left[Roman Landenband] .left[slides [http://romansky.github.io/scraping-legs-talk](http://romansky.github.io/scraping-legs-talk)] .left[.legs[Legs] repo [https://github.com/uniformlyrandom/legs](https://github.com/uniformlyrandom/legs)] --- # Why people scrape stuff? * no public API - target has no motivation/interest in writing API service layer * API is partial - there's more then meets the API Scraping is mostly the process of turning "unstructured data" into "structured data" (HTML -> CSV) --- # Tool-belt `XPath` ```html
zTitle
item1
price1
item2
price2
//title/text() == "zTitle" (//div[@class="item"])[2]/span[@class="name"]/text() == "item2" ``` `regex` ```html
a link
replace(/^.+href="([^"]+)".+$/,"$1") = http://site.com/zubi?dubi=true ``` --- # Other tools * queues (jobs) * Selenium * PhantomJS --- # Why I wrote .legs[Legs] * had many small scraping projects, had "need" to over engineer * a very useful tool to master * has legitimate usages * its a recurring solution to recurring problem --- # .legs[Legs] trivia * refined APIs and idea over years of iterations with small projects * originally written in `JavaScript` on top of `Node.js` * had issues with formatting of non `UTF8` sites * re-written to `Scala` * open-sourced `Jul-2014` --- # Components Core- portable, declarative, `DSL` inside `Json` structures ```javascript [ { "action" : "FETCH/url", "values" : { "url" : "https://news.google.com/" }, "yields" : "s_google_news" },{ "action" : "ECHO/s_google_news" } ] ``` Rich library of helpers * `selenium` with `Phantom.js` * various `string` manipulation * can write custom add-hoc code using `Nashorn` (`JavaScript`) Jobs / orchestration * `FIFO` job queue * schedule a job * workers to poll queues for job * retry on fail, after 3 failures, defer --- # Scraping `Seret.co.il` * Let's scrape (www.seret.co.il)[www.seret.co.il] * Example (code)[https://github.com/uniformlyrandom/legs/tree/master/examples/seretcoil] * `Firebug`, `$x("..")` (works on Chrome too) --- # Closing * extending .legs[Legs] * future plans: UI- scraping, monitoring; repository of "receipts" --- class: center, middle, zubi # Questions?