Packages

  • package root

    This is the documentation for the splish library.

    This is the documentation for the splish library.

    Package Information

    The splish package contains a single class splish.Splash with methods for interacting with a ScrapingHub Splash instance.

    If you haven’t hit the above link yet and are unfamilar with Splash, the TLDR is that it’s an alternative to Selenium in that it’s a full browser and executes javascript. The full rendering engine is based on Qt Webkit and Splash instances have a REST API that provides a ton of flexibility when needed and ease of use for more casual scraping tasks.

    You can get it up and running locally with Docker via:

    sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash

    If you've built the source and run sbt packInstall, you can start playing with splash on the command line via ~/local/bin/splash-main. Here's the help:

    splash 1.0
    Usage: splash [options] url
    
      url                    the URL to scrape
      -r, --render html      request action; one of 'html', 'json' or 'har'
      --help                 prints this usage text
      -w, --wait <value>     How long to wait (in seconds) after loading the page (to allow js onX scripts to run). Default is 2 seconds
      -t, --timeout <value>  Overall page/connection timeout. Defaults to 30 seconds
      -h, --host <value>     Splash instance host name or IP address (defaults to localhost)
      -p, --port <value>     Splash instance port if not the default (8050)
      -u, --user <value>     Splash username (if authentication is required). Default is no authentcation.
      -p, --pass <value>     Splash password (if authentication is required). Default is no authentication.
      -s, --ssl              Use an SSL connection to the Splash instance? (defaults to false)

    The first thing we need to do is make a connection to the server

    import splish.Splash
    
    val s = Splash()
    
    println(s)
    
    ## Splash(localhost,8050,null,null,false)

    We can test that connection and get some other information as well:

    println(
      "Server is up? " + s.isActive() + "\n" +
      "What's the server version? " + s.version()("splash") + "\n" +
      "How long has the server been up? " + s.performanceStatistics()("cputime").num
    )
    
    ## Server is up? true
    ## What's the server version? "3.2"
    ## How long has the server been up? 68.84

    The library makes use of [uJson](http://www.lihaoyi.com/upickle/#uJson) for more complex return types and a few methods return a requests [Response](https://github.com/lihaoyi/requests-scala/blob/master/requests/src/requests/Model.scala#L235-L276) object due to the result of a call to more dynamic endpoints being un-knowable at call time (Splash allows you to use Lua to perform complex page interaction and you can return images, plaintext, HTML or JSON via the Lua interface).

    The classic use case for Splash is to feed it a URL and get HTML back after it’s had time to process any javascript. The URL in the following example relies on javascript to add content to the page:

    val html = s.renderHTML("https://rud.is/splash-js-test.html")
    
    println(html)
    
    ## <html><head>
    ##     <title>Test</title>
    ##   </head>
    ##   <body onload="addElements()">
    ##     < p>This is a Splash test page.
    ##     < p><span id="target">This won't be here if javascript is disabled</span>
    ##     <script>
    ##       function addElements() {
    ##         document.getElementById("target").innerHTML = "This won't be here if javascript is disabled" ;
    ##       }
    ##     </script>
    ##
    ##
    ## </body></html>

    Here’s what that looks like just using the requests library:

    import requests._
    
    val res = requests.get("https://rud.is/splash-js-test.html")
    
    println(res.text)
    
    ## <html>
    ##   <head>
    ##     <title>Test</title>
    ##   </head>
    ##   <body onload="addElements()">
    ##     < p>This is a Splash test page.
    ##     < p><span id="target"></span>
    ##     <script>
    ##       function addElements() {
    ##         document.getElementById("target").innerHTML = "This won't be here if javascript is disabled" ;
    ##       }
    ##     </script>
    ##   </body>
    ## </html>

    Most of the other Splash API endpoints have corresponding methods in the library (the image-oriented ones are on the TODO list). We can get the same page in both Splash JSON:

    println(s.renderJSON("https://rud.is/splash-js-test.html", responseBody = true, html = true))
    
    ## {"title":"Test","requestedUrl":"https://rud.is/splash-js-test.html","url":"https://rud.is/splash-js-test.html","geometry":[0,0,1024,768],"html":"\n    Test\n  \n  \n    < p>This is a Splash test page.\n    < p>This won't be here if javascript is disabled\n    \n  \n\n"}

    and HAR formats:

    println(s.renderHAR("https://rud.is/splash-js-test.html?1", responseBody = true))
    
    ## {"log":{"browser":{"version":"602.1","comment":"PyQt 5.9, Qt 5.9.1","name":"QWebKit"},"pages":[{"pageTimings":{"_onPrepareStart":286,"_onStarted":1,"onContentLoad":285,"onLoad":286},"id":"1","title":"Test","startedDateTime":"2018-08-18T22:27:57.550751Z"}],"version":"1.2","creator":{"version":"3.2","name":"Splash"},"entries":[{"pageref":"1","time":121,"timings":{"connect":-1,"blocked":-1,"send":0,"ssl":-1,"receive":1,"dns":-1,"wait":120},"request":{"url":"https://rud.is/splash-js-test.html?2","headers":[{"value":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1","name":"User-Agent"},{"value":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","name":"Accept"}],"queryString":[{"value":"","name":"2"}],"method":"GET","httpVersion":"HTTP/1.1","headersSize":188,"cookies":[],"bodySize":-1},"cache":{},"response":{"headers":[{"value":"nginx/1.13.9","name":"Server"},{"value":"Sat, 18 Aug 2018 22:20:16 GMT","name":"Date"},{"value":"text/html","name":"Content-Type"},{"value":"Sat, 18 Aug 2018 21:49:08 GMT","name":"Last-Modified"},{"value":"chunked","name":"Transfer-Encoding"},{"value":"keep-alive","name":"Connection"},{"value":"Accept-Encoding","name":"Vary"},{"value":"W/\"5b789454-15d\"","name":"ETag"},{"value":"Sun, 19 Aug 2018 21:49:08 GMT","name":"Expires"},{"value":"max-age=84532","name":"Cache-Control"},{"value":"max-age=31536000; includeSubDomains; preload","name":"Strict-Transport-Security"},{"value":"SAMEORIGIN","name":"X-Frame-Options"},{"value":"<3","name":"X-Powered-By"},{"value":"frame-ancestors 'self', default-src * 'self' data: 'unsafe-inline' 'unsafe-eval'; report-uri 'https://hrbrmstr.report-uri.com/r/d/csp/reportOnly';","name":"Content-Security-Policy"},{"value":"default-src * 'self' data: 'unsafe-inline' 'unsafe-eval'; report-uri 'https://hrbrmstr.report-uri.com/r/d/csp/reportOnly';","name":"X-Content-Security-Policy"},{"value":"default-src * 'self' data: 'unsafe-inline' 'unsafe-eval'; report-uri 'https://hrbrmstr.report-uri.com/r/d/csp/reportOnly';","name":"X-WebKit-CSP"},{"value":"1; mode=block","name":"X-XSS-Protection"},{"value":"nosniff","name":"X-Content-Type-Options"},{"value":"gzip","name":"Content-Encoding"}],"ok":true,"redirectURL":"","httpVersion":"HTTP/1.1","bodySize":349,"cookies":[],"status":200,"content":{"encoding":"base64","mimeType":"text/html","text":"PGh0bWw+CiAgPGhlYWQ+CiAgICA8dGl0bGU+VGVzdDwvdGl0bGU+CiAgPC9oZWFkPgogIDxib2R5IG9ubG9hZD0iYWRkRWxlbWVudHMoKSI+CiAgICA8cD5UaGlzIGlzIGEgU3BsYXNoIHRlc3QgcGFnZS48L3A+CiAgICA8cD48c3BhbiBpZD0idGFyZ2V0Ij48L3NwYW4+PC9wPgogICAgPHNjcmlwdD4KICAgICAgZnVuY3Rpb24gYWRkRWxlbWVudHMoKSB7CiAgICAgICAgZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoInRhcmdldCIpLmlubmVySFRNTCA9ICJUaGlzIHdvbid0IGJlIGhlcmUgaWYgamF2YXNjcmlwdCBpcyBkaXNhYmxlZCIgOwogICAgICB9ICAgIAogICAgPC9zY3JpcHQ+CiAgPC9ib2R5Pgo8L2h0bWw+Cg==","size":349},"headersSize":971,"statusText":"OK","url":"https://rud.is/splash-js-test.html?2"},"_splash_processing_state":"finished","startedDateTime":"2018-08-18T22:27:57.552544Z"}]}}
    Definition Classes
    root
  • package splish
    Definition Classes
    root
  • Splash
  • SplashMain
c

splish

Splash

case class Splash(splashHost: String = "localhost", splashPort: Integer = 8050, splashUser: String = null, splashPassword: String = null, useSSL: Boolean = false) extends Product with Serializable

A class to facilitate access to a Splash instance.

Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. The (twisted) QT reactor is used to make the service fully asynchronous allowing to take advantage of webkit concurrency via the QT main loop.

splashHost

the host name or IP address of the Splash instance. Defaults to "localhost"

splashPort

the port number the Splash instance is running on. Defaults to 8050

splashUser

the username use if authentication is enabled in the Splash instance. Keep null for no authentication

useSSL

if true the connection to the Splash intance will be made over HTTPS

Example:
  1. import splish.Splash Splash().renderHTML("https://https://www.scala-lang.org/")

See also

The Splash Official API Documentation

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Splash
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Splash(splashHost: String = "localhost", splashPort: Integer = 8050, splashUser: String = null, splashPassword: String = null, useSSL: Boolean = false)

    splashHost

    the host name or IP address of the Splash instance. Defaults to "localhost"

    splashPort

    the port number the Splash instance is running on. Defaults to 8050

    splashUser

    the username use if authentication is enabled in the Splash instance. Keep null for no authentication

    useSSL

    if true the connection to the Splash intance will be made over HTTPS

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @native() @throws( ... )
  6. def debugInfo(): Js

    Retrieve debug-level information for the Splash instance

    Retrieve debug-level information for the Splash instance

    returns

    the Splash debug-level information (a ujson.Js parsed object)

  7. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  8. def execute(luaSource: String, timeout: Double = 30, allowedDomains: String = null, proxy: String = null, filters: String = null, luaArgs: Map[String, String] = null): Response

    Execute a custom rendering script and return a result.

    Execute a custom rendering script and return a result.

    The "render" endpoints cover many common use cases, but are occassionally insufficient for a given task. This API endpoint interface allows the caller to write custom Splash Scripts.

    These are complete Lua scripts that must include a function main(splash, args) ... end in the body. See the sibling method run() for an equivalent method that provides that boilerplate for you.

    luaSource

    The browser automation script. See the Splash Scripts Tutorial for more information.

    timeout

    A timeout (in seconds) for the render (defaults to 30). By default, maximum allowed value for the timeout is 90 seconds.

    allowedDomains

    (String) Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.

    proxy

    Proxy profile name or proxy URL. A proxy URL should have the following format: "[protocol://][user:password@]proxyhost[:port]"

    filters

    (String) Comma-separated list of request filter names.

    luaArgs

    (Map[String,String]) additional arguments to be passed to the lua script These will be available in a splash.args table.

    returns

    a Response object since there's no way for the function to know what the script will return. You will need to process the text element.

  9. def finalize(): Unit
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  11. def history(): Js

    Retrieve information about requests/responses for the pages loaded by the Splash instance

    Retrieve information about requests/responses for the pages loaded by the Splash instance

    returns

    the Splash history information(a ujson.Js parsed object)

  12. def isActive(): Boolean

    Test whether the Splash instance is responding

    Test whether the Splash instance is responding

    returns

    true if the Splash server could be reached and responded affirmatively.

  13. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  17. def performanceStatistics(): Js

    Retrieve peformance-related statistics for the running Splash instance

    Retrieve peformance-related statistics for the running Splash instance

    returns

    the Splash performance statistics (a ujson.Js parsed object)

  18. def renderHAR(urlToRender: String, responseBody: Boolean = false, baseURL: String = null, timeout: Double = 30, resourceTimeout: Double = 0, wait: Double = 0, proxy: String = null, viewport: String = null, js: String = null, jsSource: String = null, filters: String = null, allowedDomains: String = null, allowedContentTypes: String = null, forbiddenContentTypes: String = null, images: Boolean = true): Js

    Return information about Splash interaction with a website in HAR format.

    Return information about Splash interaction with a website in HAR format. It includes information about requests made, responses received, timings, headers, etc.

    Currently this endpoint doesn’t expose raw request contents; only meta-information like headers and timings is available. Response contents is included when responseBody is true.

    urlToRender

    The url to render (required)

    responseBody

    If true then response content is included in the HAR records; The default if false

    returns

    parsed JSON

    See also

    renderHTML() for documentation on the additional parameters

    The Official Splash API documentation for render.har endpoint

  19. def renderHTML(urlToRender: String, baseURL: String = null, timeout: Double = 30, resourceTimeout: Double = 0, wait: Double = 0, proxy: String = null, viewport: String = null, js: String = null, jsSource: String = null, filters: String = null, allowedDomains: String = null, allowedContentTypes: String = null, forbiddenContentTypes: String = null, images: Boolean = true): String

    Return the HTML of the javascript-rendered page as a String

    Return the HTML of the javascript-rendered page as a String

    urlToRender

    The url to render (required)

    baseURL

    The base url to render the page with. Base HTML content will be fetched from the URL given in the url argument, while relative referenced resources in the HTML-text used to render the page are fetched using the URL given in the baseURL argument as base.

    timeout

    A timeout (in seconds) for the render (defaults to 30). By default, maximum allowed value for the timeout is 90 seconds.

    resourceTimeout

    A timeout (in seconds) for individual network requests.

    wait

    Time (in seconds) to wait for updates after page is loaded (defaults to 0). Increase this value if you expect pages to contain setInterval/setTimeout javascript calls because with wait=0 callbacks of setInterval/setTimeout won’t be executed. Non-zero wait is also required for PNG and JPEG rendering when doing full-page rendering

    proxy

    Proxy profile name or proxy URL. A proxy URL should have the following format: "[protocol://][user:password@]proxyhost[:port]"

    viewport

    (String) View width and height (in pixels) of the browser viewport to render the web page. Format is “<width>x<height>”, e.g. "800x600". Default value is "1024x768".

    js

    Javascript profile name.

    jsSource

    JavaScript code to be executed in page context.

    filters

    (String) Comma-separated list of request filter names.

    allowedDomains

    (String) Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.

    allowedContentTypes

    (String) Comma-separated list of allowed content types. If present, Splash will abort any request if the response’s content type doesn’t match any of the content types in this list. Wildcards are supported using Python's fnmatch syntax.

    forbiddenContentTypes

    (String) Comma-separated list of forbidden content types. If present, Splash will abort any request if the response’s content type doesn’t match any of the content types in this list. Wildcards are supported using Python's fnmatch syntax.

    images

    Whether to download images.

    returns

    String containing HTML content

    See also

    The Official Splash API documentation for render.html endpoint

  20. def renderJSON(urlToRender: String, responseBody: Boolean = false, html: Boolean = false, png: Boolean = false, jpeg: Boolean = false, iframes: Boolean = false, script: Boolean = false, console: Boolean = false, history: Boolean = false, har: Boolean = false, baseURL: String = null, timeout: Double = 30, resourceTimeout: Double = 0, wait: Double = 0, proxy: String = null, viewport: String = null, js: String = null, jsSource: String = null, filters: String = null, allowedDomains: String = null, allowedContentTypes: String = null, forbiddenContentTypes: String = null, images: Boolean = true): Js

    Return a parsed, JSON-encoded dictionary with information about JavaScript-rendered webpage.

    Return a parsed, JSON-encoded dictionary with information about JavaScript-rendered webpage. It can include HTML, PNG and other information, based on arguments passed.

    urlToRender

    The url to render (required)

    responseBody

    If true then response content is included in the HAR records; The default if false

    html

    Whether to include HTML in output

    png

    Whether to include PNG in output

    jpeg

    Whether to include JPEG in output

    iframes

    Whether to include information about child frames in output

    script

    Whether to include the result of the executed javascript final statement in output

    console

    Whether to include the executed javascript console messages in output

    history

    Whether to include the history of requests/responses for webpage main frame

    har

    Whether to include HAR in output. If this option is true the result will contain the same data as renderHAR() provides under har key.

    See also

    renderHTML() for documentation on the additional parameters

    The Official Splash API documentation for render.json endpoint

  21. def reset(): Js

    Run Python garbage collector in the Splash instance and clear internal WebKit caches.

    Run Python garbage collector in the Splash instance and clear internal WebKit caches.

    returns

    information about the number of objects freed and the status of the Splash instance

  22. def run(luaSource: String, timeout: Double = 30, allowedDomains: String = null, proxy: String = null, filters: String = null, luaArgs: Map[String, String] = null): Response

    Execute a custom rendering script and return a result.

    Execute a custom rendering script and return a result.

    This is nearly identical to execute() but it provided the boilerplate function main(splash, args) ... end

    luaSource

    The browser automation script. See the Splash Scripts Tutorial for more information.

    timeout

    A timeout (in seconds) for the render (defaults to 30). By default, maximum allowed value for the timeout is 90 seconds.

    allowedDomains

    (String) Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.

    proxy

    Proxy profile name or proxy URL. A proxy URL should have the following format: "[protocol://][user:password@]proxyhost[:port]"

    filters

    (String) Comma-separated list of request filter names.

    luaArgs

    (Map[String,String]) additional arguments to be passed to the lua script These will be available in a splash.args table.

    returns

    a Response object since there's no way for the function to know what the script will return. You will need to process the text element.

  23. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  24. def version(): Js

    Retrieve the version information from a running Splash instance

    Retrieve the version information from a running Splash instance

    returns

    the Splash version information (a ujson.Js parsed object)

  25. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped