case class Splash(splashHost: String = "localhost", splashPort: Integer = 8050, splashUser: String = null, splashPassword: String = null, useSSL: Boolean = false) extends Product with Serializable
A class to facilitate access to a Splash instance.
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. The (twisted) QT reactor is used to make the service fully asynchronous allowing to take advantage of webkit concurrency via the QT main loop.
- splashHost
the host name or IP address of the Splash instance. Defaults to "
localhost
"- splashPort
the port number the Splash instance is running on. Defaults to
8050
- splashUser
the username use if authentication is enabled in the Splash instance. Keep
null
for no authentication- useSSL
if
true
the connection to the Splash intance will be made overHTTPS
import splish.Splash Splash().renderHTML("https://https://www.scala-lang.org/")
- See also
- Alphabetic
- By Inheritance
- Splash
- Serializable
- Serializable
- Product
- Equals
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
Splash(splashHost: String = "localhost", splashPort: Integer = 8050, splashUser: String = null, splashPassword: String = null, useSSL: Boolean = false)
- splashHost
the host name or IP address of the Splash instance. Defaults to "
localhost
"- splashPort
the port number the Splash instance is running on. Defaults to
8050
- splashUser
the username use if authentication is enabled in the Splash instance. Keep
null
for no authentication- useSSL
if
true
the connection to the Splash intance will be made overHTTPS
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
-
def
debugInfo(): Js
Retrieve debug-level information for the Splash instance
Retrieve debug-level information for the Splash instance
- returns
the Splash debug-level information (a
ujson.Js
parsed object)
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
execute(luaSource: String, timeout: Double = 30, allowedDomains: String = null, proxy: String = null, filters: String = null, luaArgs: Map[String, String] = null): Response
Execute a custom rendering script and return a result.
Execute a custom rendering script and return a result.
The "render" endpoints cover many common use cases, but are occassionally insufficient for a given task. This API endpoint interface allows the caller to write custom Splash Scripts.
These are complete Lua scripts that must include a
function main(splash, args) ... end
in the body. See the sibling methodrun()
for an equivalent method that provides that boilerplate for you.- luaSource
The browser automation script. See the Splash Scripts Tutorial for more information.
- timeout
A timeout (in seconds) for the render (defaults to 30). By default, maximum allowed value for the timeout is 90 seconds.
- allowedDomains
(String) Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.
- proxy
Proxy profile name or proxy URL. A proxy URL should have the following format: "
[protocol://][user:password@]proxyhost[:port]
"- filters
(String) Comma-separated list of request filter names.
- luaArgs
(Map[String,String]) additional arguments to be passed to the lua script These will be available in a splash.args table.
- returns
a
Response
object since there's no way for the function to know what the script will return. You will need to process thetext
element.
-
def
finalize(): Unit
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
history(): Js
Retrieve information about requests/responses for the pages loaded by the Splash instance
Retrieve information about requests/responses for the pages loaded by the Splash instance
- returns
the Splash history information(a
ujson.Js
parsed object)
-
def
isActive(): Boolean
Test whether the Splash instance is responding
Test whether the Splash instance is responding
- returns
true
if the Splash server could be reached and responded affirmatively.
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
performanceStatistics(): Js
Retrieve peformance-related statistics for the running Splash instance
Retrieve peformance-related statistics for the running Splash instance
- returns
the Splash performance statistics (a
ujson.Js
parsed object)
-
def
renderHAR(urlToRender: String, responseBody: Boolean = false, baseURL: String = null, timeout: Double = 30, resourceTimeout: Double = 0, wait: Double = 0, proxy: String = null, viewport: String = null, js: String = null, jsSource: String = null, filters: String = null, allowedDomains: String = null, allowedContentTypes: String = null, forbiddenContentTypes: String = null, images: Boolean = true): Js
Return information about Splash interaction with a website in HAR format.
Return information about Splash interaction with a website in HAR format. It includes information about requests made, responses received, timings, headers, etc.
Currently this endpoint doesn’t expose raw request contents; only meta-information like headers and timings is available. Response contents is included when
responseBody
istrue
.- urlToRender
The url to render (required)
- responseBody
If
true
then response content is included in the HAR records; The default iffalse
- returns
parsed JSON
- See also
renderHTML()
for documentation on the additional parametersThe Official Splash API documentation for render.har endpoint
-
def
renderHTML(urlToRender: String, baseURL: String = null, timeout: Double = 30, resourceTimeout: Double = 0, wait: Double = 0, proxy: String = null, viewport: String = null, js: String = null, jsSource: String = null, filters: String = null, allowedDomains: String = null, allowedContentTypes: String = null, forbiddenContentTypes: String = null, images: Boolean = true): String
Return the HTML of the javascript-rendered page as a String
Return the HTML of the javascript-rendered page as a String
- urlToRender
The url to render (required)
- baseURL
The base url to render the page with. Base HTML content will be fetched from the URL given in the url argument, while relative referenced resources in the HTML-text used to render the page are fetched using the URL given in the baseURL argument as base.
- timeout
A timeout (in seconds) for the render (defaults to 30). By default, maximum allowed value for the timeout is 90 seconds.
- resourceTimeout
A timeout (in seconds) for individual network requests.
- wait
Time (in seconds) to wait for updates after page is loaded (defaults to 0). Increase this value if you expect pages to contain setInterval/setTimeout javascript calls because with wait=0 callbacks of setInterval/setTimeout won’t be executed. Non-zero wait is also required for PNG and JPEG rendering when doing full-page rendering
- proxy
Proxy profile name or proxy URL. A proxy URL should have the following format: "
[protocol://][user:password@]proxyhost[:port]
"- viewport
(String) View width and height (in pixels) of the browser viewport to render the web page. Format is “
<width>x<height>
”, e.g. "800x600
". Default value is "1024x768
".- js
Javascript profile name.
- jsSource
JavaScript code to be executed in page context.
- filters
(String) Comma-separated list of request filter names.
- allowedDomains
(String) Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.
- allowedContentTypes
(String) Comma-separated list of allowed content types. If present, Splash will abort any request if the response’s content type doesn’t match any of the content types in this list. Wildcards are supported using Python's fnmatch syntax.
- forbiddenContentTypes
(String) Comma-separated list of forbidden content types. If present, Splash will abort any request if the response’s content type doesn’t match any of the content types in this list. Wildcards are supported using Python's fnmatch syntax.
- images
Whether to download images.
- returns
String containing HTML content
-
def
renderJSON(urlToRender: String, responseBody: Boolean = false, html: Boolean = false, png: Boolean = false, jpeg: Boolean = false, iframes: Boolean = false, script: Boolean = false, console: Boolean = false, history: Boolean = false, har: Boolean = false, baseURL: String = null, timeout: Double = 30, resourceTimeout: Double = 0, wait: Double = 0, proxy: String = null, viewport: String = null, js: String = null, jsSource: String = null, filters: String = null, allowedDomains: String = null, allowedContentTypes: String = null, forbiddenContentTypes: String = null, images: Boolean = true): Js
Return a parsed, JSON-encoded dictionary with information about JavaScript-rendered webpage.
Return a parsed, JSON-encoded dictionary with information about JavaScript-rendered webpage. It can include HTML, PNG and other information, based on arguments passed.
- urlToRender
The url to render (required)
- responseBody
If
true
then response content is included in the HAR records; The default iffalse
- html
Whether to include HTML in output
- png
Whether to include PNG in output
- jpeg
Whether to include JPEG in output
- iframes
Whether to include information about child frames in output
- script
Whether to include the result of the executed javascript final statement in output
- console
Whether to include the executed javascript console messages in output
- history
Whether to include the history of requests/responses for webpage main frame
- har
Whether to include HAR in output. If this option is
true
the result will contain the same data asrenderHAR()
provides underhar
key.
- See also
renderHTML()
for documentation on the additional parametersThe Official Splash API documentation for render.json endpoint
-
def
reset(): Js
Run Python garbage collector in the Splash instance and clear internal WebKit caches.
Run Python garbage collector in the Splash instance and clear internal WebKit caches.
- returns
information about the number of objects freed and the status of the Splash instance
-
def
run(luaSource: String, timeout: Double = 30, allowedDomains: String = null, proxy: String = null, filters: String = null, luaArgs: Map[String, String] = null): Response
Execute a custom rendering script and return a result.
Execute a custom rendering script and return a result.
This is nearly identical to
execute()
but it provided the boilerplatefunction main(splash, args) ... end
- luaSource
The browser automation script. See the Splash Scripts Tutorial for more information.
- timeout
A timeout (in seconds) for the render (defaults to 30). By default, maximum allowed value for the timeout is 90 seconds.
- allowedDomains
(String) Comma-separated list of allowed domain names. If present, Splash won’t load anything neither from domains not in this list nor from subdomains of domains not in this list.
- proxy
Proxy profile name or proxy URL. A proxy URL should have the following format: "
[protocol://][user:password@]proxyhost[:port]
"- filters
(String) Comma-separated list of request filter names.
- luaArgs
(Map[String,String]) additional arguments to be passed to the lua script These will be available in a splash.args table.
- returns
a
Response
object since there's no way for the function to know what the script will return. You will need to process thetext
element.
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
version(): Js
Retrieve the version information from a running Splash instance
Retrieve the version information from a running Splash instance
- returns
the Splash version information (a
ujson.Js
parsed object)
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
This is the documentation for the
splish
library.Package Information
The splish package contains a single class splish.Splash with methods for interacting with a ScrapingHub Splash instance.
If you haven’t hit the above link yet and are unfamilar with Splash, the TLDR is that it’s an alternative to Selenium in that it’s a full browser and executes javascript. The full rendering engine is based on Qt Webkit and Splash instances have a REST API that provides a ton of flexibility when needed and ease of use for more casual scraping tasks.
You can get it up and running locally with Docker via:
If you've built the source and run
sbt packInstall
, you can start playing withsplash
on the command line via~/local/bin/splash-main
. Here's the help:The first thing we need to do is make a connection to the server
We can test that connection and get some other information as well:
The library makes use of [
uJson
](http://www.lihaoyi.com/upickle/#uJson) for more complex return types and a few methods return arequests
[Response
](https://github.com/lihaoyi/requests-scala/blob/master/requests/src/requests/Model.scala#L235-L276) object due to the result of a call to more dynamic endpoints being un-knowable at call time (Splash allows you to useLua
to perform complex page interaction and you can return images, plaintext, HTML or JSON via the Lua interface).The classic use case for Splash is to feed it a URL and get HTML back after it’s had time to process any javascript. The URL in the following example relies on javascript to add content to the page:
Here’s what that looks like just using the
requests
library:Most of the other Splash API endpoints have corresponding methods in the library (the image-oriented ones are on the TODO list). We can get the same page in both Splash JSON:
and HAR formats: