51 Star 399 Fork 62

g1879 / DrissionPage

Create your Gitee Account
Explore and code with more than 6 million developers,Free private repositories !:)
Sign up
Clone or download
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README.en.md

Introduction


DrissionPage, a combination of driver and session, is a python- based Web automation operation integration tool.
It achieves seamless switching between selenium and requests.
Therefore, the convenience of selenium and the high efficiency of requests can be balanced.
It integrates the common functions of the page, the API of the two modes is consistent, and it is easy to use.
It uses the POM mode to encapsulate the commonly used methods of page elements, which is very suitable for automatic operation function expansion.
What's even better is that its usage is very concise and user- friendly, with a small amount of code and friendly to novices.

project address:

Sample address: Use DrissionPage to crawl common websites and automation

**Contact Email: ** g1879@qq.com

Concept and background


Idea

Concise, easy to use, extensible

Background

When the requests crawler faces the website to be logged in, it has to analyze data packets and JS source code, construct complex requests, and often has to deal with anti- climbing methods such as verification codes, JS confusion, and signature parameters, which has a high threshold. If the data is generated by JS calculation, the calculation process must be reproduced. The experience is not good and the development efficiency is not high.
Using selenium, these pits can be bypassed to a large extent, but selenium is not efficient. Therefore, this library combines selenium and requests into one, switches the corresponding mode when different needs, and provides a user- friendly method to improve development and operation efficiency.
In addition to merging the two, the library also encapsulates common functions in web pages, simplifies selenium's operations and statements. When used for web page automation, it reduces the consideration of details, focuses on function implementation, and makes it more convenient to use.
Keep everything simple, try to provide simple and direct usage, and be more friendly to novices.

Features


  • The first pursuit is simple code.
  • Allow seamless switching between selenium and requests, sharing session.
  • The two modes provide consistent APIs, and the user experience is consistent.
  • Humanized page element operation mode, reducing the workload of page analysis and coding.
  • The common functions are integrated and optimized, which is more in line with actual needs.
  • Compatible with selenium code to facilitate project migration.
  • Use POM mode packaging for easy expansion.
  • A unified file download method makes up for the lack of browser downloads.
  • Simple configuration method, get rid of tedious browser configuration.

Project structure


Structure diagram

Drission Class

Manage the WebDriver object and Session object responsible for communicating with the web page, which is equivalent to the role of the driver.

MixPage Class

MixPage encapsulates the common functions of page operation. It calls the driver managed in the Drission class to access and operate the page. Can switch between driver and session mode. The login status will be automatically synchronized when switching.

DriverElement class

The page element class in driver mode can perform operations such as clicking on the element, inputting text, modifying attributes, running js, etc., and can also search for descendant elements at its lower level.

SessionElement Class

The page element class in session mode can obtain element attribute values and search for descendant elements at its lower levels.

Simple demo


Comparison with selenium code

The following code implements exactly the same function, compare the amount of code between the two:

  • Find the first element whose text contains some text with explicit wait
# Use selenium:
element = WebDriverWait(driver).until(ec.presence_of_element_located((By.XPATH,'//*[contains(text(), "some text")]')))

# Use DrissionPage:
element = page('some text')
  • Jump to the first tab
# Use selenium:
driver.switch_to.window(driver.window_handles[0])

# Use DrissionPage:
page.to_tab(0)
  • Select drop- down list by text
# Use selenium:
from selenium.webdriver.support.select import Select
select_element = Select(element)
select_element.select_by_visible_text('text')

# Use DrissionPage:
element.select('text')
  • Drag and drop an element
# Use selenium:
ActionChains(driver).drag_and_drop(ele1, ele2).perform()

# Use DrissionPage:
ele1.drag_to(ele2)
  • Scroll the window to the bottom (keep the horizontal scroll bar unchanged)
# Use selenium:
driver.execute_script("window.scrollTo(document.documentElement.scrollLeft, document.body.scrollHeight);")

# Use DrissionPage:
page.scroll_to('bottom')
  • Set headless mode
# Use selenium:
options = webdriver.ChromeOptions()
options.add_argument("- - headless")

# Use DrissionPage:
set_headless()
  • Get pseudo element content
# Use selenium:
text = webdriver.execute_script('return window.getComputedStyle(arguments[0], "::after").getPropertyValue("content");', element)

# Use DrissionPage:
text = element.after
  • Get shadow- root
# Use selenium:
shadow_element = webdriver.execute_script('return arguments[0].shadowRoot', element)

# Use DrissionPage:
shadow_element = element.shadow_root
  • Use xpath to get attributes or text nodes directly
# Use selenium:
Quite complicated

# Use DrissionPage:
class_name = element('xpath://div[@id="div_id"]/@class')
text = element('xpath://div[@id="div_id"]/text()[2]')

Compare with requests code

The following code implements exactly the same function, compare the amount of code between the two:

  • Get element content
url ='https://baike.baidu.com/item/python'

# Use requests:
from lxml import etree
headers = {'User- Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers = headers)
html = etree.HTML(response.text)
element = html.xpath('//h1')[0]
title = element.text

# Use DrissionPage:
page = MixPage('s')
page.get(url)
title = page('tag:h1').text

Tips: DrissionPage comes with default headers

  • download file
url ='https://www.baidu.com/img/flexible/logo/pc/result.png'
save_path = r'C:\download'

# Use requests:
r = requests.get(url)
with open(f'{save_path}\\img.png','wb') as fd:
   for chunk in r.iter_content():
       fd.write(chunk)
        
# Use DrissionPage:
page.download(url, save_path,'img')  # Support renaming and handle file name conflicts

Mode switch

Log in to the website with selenium, and then switch to requests to read the web page. Both will share login information.

page = MixPage()  # Create page object, default driver mode
page.get('https://gitee.com/profile')  # Visit the personal center page (not logged in, redirect to the login page)

page.ele('@id:user_login').input('your_user_name')  # Use selenium to enter the account password to log in
page.ele('@id:user_password').input('your_password\n')

page.change_mode()  # Switch to session mode
print('Title after login:', page.title,'\n')  # session mode output after login

Output:

Title after login: Personal Information- Code Cloud Gitee.com

Get and print element attributes

# Connect the previous code
foot = page.ele('@id:footer- left')  # find element by id
first_col = foot.ele('css:>div')  # Use the css selector to find the element in the lower level of the element (the first one)
lnk = first_col.ele('text: Command Learning')  # Use text content to find elements
text = lnk.text  # Get element text
href = lnk.attr('href')  # Get element attribute value

print(text, href,'\n')

# Concise mode series search
text = page('@id:footer- left')('css:>div')('text:command learning').text
print(text)

Output:

Git command learning https://oschina.gitee.io/learn- git- branching/

Git command learning

download file

url ='https://www.baidu.com/img/flexible/logo/pc/result.png'
save_path = r'C:\download'
page.download(url, save_path)

Installation


pip install DrissionPage

Only supports python3.6 and above, and the driver mode currently only supports chrome.It has only been tested in the Windows environment. To use the driver mode, you must download chrome and corresponding version of chromedriver. [chromedriver download]
The get_match_driver() method in the easy_set tool can automatically identify the chrome version and download the matching driver.

Instructions


Import module

from DrissionPage import MixPage

Initialization

If you only use session mode, you can skip this section.

Before using selenium, you must configure the path of chrome.exe and chromedriver.exe and ensure that their versions match.
In the new version, if the program finds that their versions do not match when running, it will automatically download the corresponding version and set the path. If there is no special need, no manual intervention is required.

There are four ways to configure the path:

  • Run directly, let the program automatically complete the settings (recommended)

  • Use the get_match_driver() method of the easy_set tool

  • Write the path to the ini file of this library

  • Write two paths to system variables

  • Fill in the path in the code

auto configuration

In the new version, you don't need to do any configuration, just run the program directly, the program will get the path of chrome.exe in the system, and automatically download the chromedriver.exe that matches the version. No feeling at all. If you need to set the chrome.exe used by yourself, you can use the following method.

Use the get_match_driver() method

If you choose this method, please run the following code before using it for the first time. The program will automatically detect the chrome version installed on your computer, download the corresponding driver, and record it in the ini file.

from DrissionPage.easy_set import get_match_driver

get_match_driver()

Output:

ini文件中chrome.exe路径 D:\Google Chrome\Chrome\chrome.exe 

version 75.0.3770.100 

chromedriver_win32.zip
Downloading to: D:\python\projects\DrissionPage\DrissionPage
 100% Success.

解压路径 D:\python\projects\chromedriver.exe 

正在检测可用性...
版本匹配,可正常使用。

Then you can start using it.

If you want to use the specified chrome.exe (green version), and specify the ini file and the save path of chromedriver.exe, you can write:

get_match_driver(ini_path ='ini file path', save_path ='save path', chrome_path='chrome path')

Tips: When you specify chrome_path, the program writes this path to the INI file after successful detection.

Use set_paths() method

If the previous method fails, you can download chromedriver.exe yourself, and then run the following code to record the path to the ini file.

from DrissionPage.easy_set import set_paths
driver_path ='D:\\chrome\\chromedriver.exe' # Your chromedriver.exe path, if not filled in, it will be searched in system variables
chrome_path ='D:\\chrome\\chrome.exe' # Your chrome.exe path, if not filled in, it will be searched in system variables
set_paths(driver_path, chrome_path)

This method also checks whether the chrome and chromedriver versions match, and displays:

正在检测可用性...
版本匹配,可正常使用。

or

出现异常:
Message: session not created: Chrome version must be between 70 and 73
  (Driver info: chromedriver=73.0.3683.68 (47787ec04b6e38e22703e856e101e840b65afe72),platform=Windows NT 10.0.19631 x86_64)
可执行easy_set.get_match_driver()自动下载匹配的版本。
或自行从以下网址下载:https://chromedriver.chromium.org/downloads

After passing the check, you can use the driver mode normally.

In addition to the above two paths, this method can also set the following paths:

debugger_address  # Debug browser address, such as: 127.0.0.1:9222
download_path  # Download file path
tmp_path  # Temporary folder path
user_data_path  # User data path
cache_path  # cache path

Tips:

  • Different projects may require different versions of chrome and chromedriver. You can also save multiple ini files and use them as needed.
  • It is recommended to use the green version of chrome, and manually set the path, to avoid browser upgrades causing mismatch with the chromedriver version.
  • It is recommended to set the debugger_address when debugging the project and use the manually opened browser to debug, saving time and effort.

Other methods

If you don't want to use the ini file (for example, when you want to package the project), you can write the above two paths in the system path, or fill in the program. See the next section for the use of the latter.

Create drive object Drission

The creation step is not necessary. If you want to get started quickly, you can skip this section. The MixPage object will automatically create the object.

Drission objects are used to manage driver and session objects. When multiple pages work together, the Drission object is used to pass the driver, so that multiple page classes can control the same browser or Session object. The configuration information of the ini file can be directly read and created, or the configuration information can be passed in during initialization.

# Create from the default ini file
drission = Drission()

# Create by other ini files
drission = Drission(ini_path ='D:\\settings.ini')

# Create without ini files
drission = Drission(read_file = False)

To manually pass in the configuration (ignore the ini file):

from DrissionPage.config import DriverOptions

# Create a driver configuration object, read_file = False means not to read the ini file
do = DriverOptions(read_file = False)

# Set the path, if it has been set in the system variable, it can be ignored
do.set_paths(chrome_path ='D:\\chrome\\chrome.exe',
              driver_path ='D:\\chrome\\chromedriver.exe')

# Settings for s mode
session_options = {'headers': {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6)'}}

# Proxy settings, optional
proxy = {'http': '127.0.0.1:1080','https': '127.0.0.1:1080'}

# Incoming configuration, driver_or_options and session_or_options are optional, you need to use the corresponding mode to pass in
drission = Drission(driver_or_options, session_or_options, proxy=proxy)

The usage of DriverOptions and SessionOptions is detailed below.

Use page object MixPage

The MixPage page object encapsulates common web page operations and realizes the switch between driver and session modes. MixPage must control a Drission object and use its driver or session. If it is not passed in, MixPage will create one by itself (using the incoming configuration information or reading from the default ini file).

Tips: When multiple objects work together, you can pass the Drission object in one MixPage to another, so that multiple objects can share login information or operate the same page.

Create Object

There are three ways to create objects: simple, passing in Drission objects, and passing in configuration. Can be selected according to actual needs.

# Simple creation method, automatically create Drission objects with ini file default configuration
page = MixPage()
page = MixPage('s')

# Create by passing in the Drission object
page = MixPage(drission)
page = MixPage(drission, mode='s', timeout=5)  # session mode, waiting time is 5 seconds (default 10 seconds)

# Incoming configuration information, MixPage internally creates Drission according to the configuration
page = MixPage(driver_options=DriverOption, session_options=SessionOption)  # default d mode

visit website

# Default mode
page.get(url)
page.post(url, data, **kwargs)  # Only session mode has post method

# Specify the number of retries and interval
page.get(url, retry=5, interval=0.5)

Tips: If there is an error in the connection, the program will automatically retry twice. The number of retries and the waiting interval can be specified.

Switch mode

Switch between s and d modes, the cookies and the URL you are visiting will be automatically synchronized when switching.

page.change_mode(go=False)  # If go is False, it means that the url is not redirected

Tips: When using a method unique to a certain mode, it will automatically jump to that mode.

Page properties

page.url  # currently visited url
page.mode  # current mode
page.drission  # Dirssion object currently in use
page.driver  # WebDirver object currently in use
page.session  # Session object currently in use
page.cookies  # Get cookies information
page.html  # Page source code
page.title  # Current page title

# d mode unique:
page.tabs_count  # Return the number of tab pages
page.tab_handles  # Return to the handle list of all tabs
page.current_tab_num  # Return the serial number of the current tab page
page.current_tab_handle  # Return to the current tab page handle

Page operation

When calling a method that only belongs to d mode, it will automatically switch to d mode. See APIs for detailed usage.

page.set_cookies()  # set cookies
page.get_cookies()  # Get cookies, which can be returned by list or dict
page.change_mode()  # Switch mode, it will automatically copy cookies
page.cookies_to_session()  # Copy cookies from WebDriver object to Session object
page.cookies_to_driver()  # Copy cookies from Session object to WebDriver object
page.get(url, retry, interval,
         **kwargs)  # Use get to access the web page, you can specify the number of retries and the interval
page.ele(loc_or_ele, timeout)  # Get the first element, node or attribute that meets the conditions
page.eles(loc_or_ele, timeout)  # Get all eligible elements, nodes or attributes
page.download(url, save_path, rename, file_exists, **kwargs)  # download file
page.close_driver()  # Close the WebDriver object
page.close_session()  # Close the Session object

# s mode unique:
page.post(url, data, retry, interval,
          **kwargs)  # To access the webpage in post mode, you can specify the number of retries and the interval

# d mode unique:
page.wait_ele(loc_or_ele, mode, timeout)  # Wait for the element to be deleted, displayed, and hidden from the dom
page.run_script(js, *args)  # Run js statement
page.create_tab(url)  # Create and locate a tab page, which is at the end
page.to_tab(num_or_handle)  # Jump to tab page
page.close_current_tab()  # Close the current tab page
page.close_other_tabs(num_or_handles)  # Close other tabs
page.to_frame(iframe)  # cut into iframe
page.screenshot(path)  # Page screenshot
page.scrool_to_see(element)  # Scroll until an element is visible
page.scroll_to(mode,
               pixel)  # Scroll the page as indicated by the parameter, and the scroll direction is optional:'top', 'bottom', 'rightmost', 'leftmost', 'up', 'down', 'left', ' right', 'half'
page.refresh()  # refresh the current page
page.back()  # Browser back
page.et_window_size(x, y)  # Set the browser window size, maximize by default
page.check_page()  # Check whether the page meets expectations
page.chrome_downloading()  # Get the list of files that chrome is downloading
page.process_alert(mode, text)  # Process the prompt box

Use of cookies

MixPage supports obtaining and setting cookies. The specific usage methods are as follows:

page.cookies # Return cookies in dictionary form, only cookies available for the current domain name will be returned
page.get_cookies(as_dict=False) # Return the cookies available for the current domain name in the form of a list, each cookie contains all the detailed information
page.get_cookies(all_domains=True) # Return all cookies in list form, only s mode is valid
page.set_cookies(cookies) # Set cookies, you can pass in RequestsCookieJar, list, tuple, str, dict

Tips:

  • After setting cookies in d mode, you must refresh the page to see the effect.

  • The s mode can set cookies in the ini file, SessionOptions, and configuration dictionary, which can be passed in when MixPage is initialized. The d mode can only be set with the set_cookies() function.

Find element

ele() returns the first eligible element, and eles() returns a list of all eligible elements. You can use these two functions under the page object or element object to find subordinate elements.

page.eles() and element.eles() search and return a list of all elements that meet the conditions.

Description:

  • The element search timeout is 10 seconds by default, and it stops waiting when it times out or finds an element. You can also set it as needed.
  • -You can find elements with query string or selenium native loc tuple (s mode can also be used) -The query string has 7 methods such as @