Scraping of Websites using Mechanize Gem

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. The Mechanize library is used for automating interaction with websites. Mechanize gem automatically stores and sends cookies, follows redirects, and can follow links and submit forms.

Form fields can be populated and submitted. It also keeps track of the sites that you have visited as a history. It leverages Nokogiri to parse a page for the relevant forms and buttons and provides a simplified interface for manipulating a webform.

Dependencies

ruby 1.8.7, 1.9.2, or 1.9.3
Nokogiri

Getting started with Mechanize:

Let’s Fetch a Page!

First thing is first. Make sure that you’ve required mechanize and that you instantiate a new mechanize object:

require 'rubygems'
require 'mechanize'

agent = Mechanize.new
Now we’ll use the agent we’ve created to fetch a page. Let’s fetch google with our mechanize agent:
page = agent.get ('http://google.com/')

Finding Links

Mechanize returns a page object whenever you get a page, post, or submit a form. When a page is fetched, the agent will parse the page and put a list of links on the page object.

Now that we’ve fetched google’s homepage, lets try listing all of the links:

page.links.each do |link|
  puts link.text
end

We can list the links, but Mechanize gives a few shortcuts to help us find a link to click on. Lets say we wanted to click the link whose text is ‘News’. Normally, we would have to do this:

page = agent.page.links.find { |l| l.text == 'News' }.click (or)
page = agent.page.link_with(:text => 'News').click (or)
agent.page.links_with(:text => 'News')[1].click (or)
page.link_with(:href => '/something').

Filling Out Forms

Lets continue with our google example. Here’s the code we have so far:

require 'rubygems'
require 'mechanize'

agent = Mechanize.new
page = agent.get('http://google.com/')

If we pretty print the page, we can see that there is one form named ‘f’, that has a couple buttons and a few fields:

pp page

Now that we know the name of the form, lets fetch it off the page:

google_form = page.form('f')

Lets take a look at the code all together:

require 'rubygems'
require 'mechanize'

agent = Mechanize.new
page = agent.get('http://google.com/')
google_form = page.form('f')
google_form.q = 'ruby mechanize'
page = agent.submit(google_form)
pp page

Scraping Data 
Mechanize uses Nokogiri to parse HTML. What does this mean for you? You can treat a mechanize page like an nokogiri object. After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using nokogiri methods:
agent.get('http://someurl.com/').search("p.posted")

The expression given to Mechanize::Page#search may be a CSS expression or an XPath expression:

agent.get('http://someurl.com/').search(".//p[@class='posted']")

The Mechanize library is used for automating interaction with websites. Mechanize gem automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. It also keeps track of the sites that you have visited as a history. leverages Nokogiri to parse a page for the relevant forms and buttons and provides a simplified interface for manipulating a webform.

RailsCarma has been working on Ruby on Rails framework from its nascent stage and has handled over 250 RoR projects. With a team of over 100+ RoR developers well-versed with latest techniques and tools, RailsCarma is well suited to help you with all your development needs. We are happy to happy to help you with your queries. Use our Contact Us page to connect to us. Read Related articles :

Get in touch with us.

Subscribe For Latest Updates

About Post Author

admin

See author's posts

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Scraping of Websites using Mechanize Gem

Subscribe For Latest Updates

About Post Author

admin

Leave a Comment Cancel Reply

Quick Navigation

Our Services

Contact Info