Weed Wide Web Scraping

[2020-05-18]

This is a simple demonstration of web scraping using python3, BeautifulSoup, and selenium on Ubuntu 18.04 with the additional pre-requisite of installing geckodriver for selenium to interact with webpages. Additionally using your favorite web browser’s “Inspect Element” and “View Source” functionalities is a must to pin point exactly the html element you want to extract and manipulate.

Import the functions for parsing the web pages:

from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq

Import the web automation module specifically including the common keys (i.e. for pressing “Enter” in our case):

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Import the time module for delaying the execution within the for-loop below to give time for the page to load:

import time

Import python3’s base module for reading and writing csv files:

import csv

Define the website we want to scrape data from:

url = "http://www.weedscience.org/Summary/Herbicide.aspx"

Download the webpage and parse:

aspx = uReq(url)
soup = BeautifulSoup(aspx.read(), "html.parser")

Extract the list of keys we want to enter into one of the webpage’s input:

herbi_list = [soup.findAll('div', {'class': 'rcbSlide'})[1].findAll('li', {'class': 'rcbItem'})[i].text for i in range(1,len(soup.findAll('div', {'class': 'rcbSlide'})[1].findAll('li', {'class': 'rcbItem'})))]

Instantiate our webdriver:

driver = webdriver.Firefox()
driver.get(url)

Parse, extract the data we need, and save as csv files:

for herbi in herbi_list:
    # herbi = 'B (ALS inhibitors)'
    print(herbi)
    input_element = driver.find_element_by_name('ctl00$AboveSideMenu$cmbxHerbicideGroup')
    input_element.clear()
    input_element.send_keys(herbi)
    input_element.send_keys(Keys.ENTER)
    ### wait for the page to load
    time.sleep(20) # we could do better by checking if the key, i.e. herbi variable have been set
    soup = BeautifulSoup(driver.page_source, "html.parser")
    table = soup.find('table', {'id': 'ctl00_Main_RadGrid1_ctl00'}).findAll('tbody')[1]
    OUT = []
    for row in table.findAll('tr'):
        out = []
        for cell in row.select('td'):
            out.append(cell.text.replace("\n", ""))
        OUT.append(out)
    with open("Scrapping_output_" + herbi.replace(" ", "_") + ".csv", "w") as f:
        w = csv.writer(f)
        w.writerows(OUT)
driver.close()

Output

  • A (ACCase inhibitors)
  • B (ALS inhibitors)
  • C1 (Photosystem II inhibitors)
  • C2 (PSII inhibitor (Ureas and amides))
  • C3 (PSII inhibitors (Nitriles))
  • D (PSI Electron Diverter)
  • E (PPO inhibitors)
  • F1 (Carotenoid biosynthesis inhibitors)
  • F2 (HPPD inhibitors)
  • F3 (Carotenoid biosynthesis (unknown target))
  • F4 (DOXP inhibitors)
  • G (EPSP synthase inhibitors)
  • H (Glutamine synthase inhibitors)
  • I (DHP synthase inhibitors)
  • K1 (Microtubule inhibitors)
  • K2 (Mitosis inhibitors)
  • K3 (Long chain fatty acid inhibitors)
  • L (Cellulose inhibitors)
  • M (Uncouplers)
  • N (Lipid Inhibitors)
  • O (Synthetic Auxins)
  • P (Auxin transport inhibitors)
  • Z (Antimicrotubule mitotic disrupter)
  • Z (Cell elongation inhibitors)
  • Z (Nucleic acid inhibitors)
  • Z (Unknown)