URLParser is a custom search command designed to parse URLs. Because it relies on the new chuncked protocol, URLParser is compatible starting with Splunk 6.4.0 and above.
... | urlparser [field=fieldname] [listname="*|iana|mozilla|..."] [mode=[simple|extended]]
URLParser is a community supported app and compared to UTBox, URLParser is faster, extract more fields and is easier to use.
URLParser will extract the following fields form the submitted URLs:
The field url_subdomain_parts can also be processed by Splunk spath command to access to individual parts of the subdomain (url_subdomain.1, url_subdomain.2, ...).
The command signature is the following:
... | urlparser [field=fieldname] [mode=[simple|extended]] [listname="listname1|listname2|..."]
All arguments are optional and default values are set to the following: * field: url * mode : extended * listname: mozilla
The simplest way to call urlparser is as follow:
... | urlparser
In the previous example, urlparser will automatically works with the field 'url', load the 'mozilla' suffix list and perform an 'extended' extraction of the fields.
This example demonstrates the parsing of a 'complex' URL and how the Splunk spath command can be used to leverage the url_subdomain_parts field.
| stats count
| fields - count
| eval url = "hTTp://je@n:pass:w@rd@images.www.gOOGle.Co.uk:256/iDNex.php?var=CALue32&ouech=gros#pouet"
| urlparser
| spath input=url_subdomain_parts
| transpose
This simple example also illustrates that the case of the input URL is unchanged by URLParser, which is a fundamental to work with URLs containing Base64 data for example (exfiltration scenarios and alike). Users willing to normalize URL in lower case can easily do it by using Splunk's eval command and it's lower() function.
It is a good habit to filter URLs prior sending them to urlparser to avoid empty url fields, or url set as '-' (often seen in proxy logs).
... | search url=* url!="-" | urlparser
In some situation, using the stats command to deduplicate repeted url can be desirable.
URLParser is also accessible as a scripted lookup. This will be useful for situations where the custom search command cannot be used like if you are building a datamodel. The scripted lookup is slower than the custom search command.
... | eval list="iana|mozilla" | lookup urlparser_lookup url list
To pass a string argument to a scripted lookup, a little trick need to be used as illustrated with the previous example. In this example, the lists to use are set to 'iana' and 'mozilla' by a prelimerary call to the Splunk eval command.
URLParser will focus on everything about URL Parsing. In short, computing the shannon entropy of a word, whether that'd be a domain name or not, is not part of the process of parsing a URL.
The mode option, admit two values: 'simple', or 'extended' so it's usage is straightforward:
In case of an unknown submission, the default mode 'extended' is used.
The mode 'simple' only call python's method urlparse() to extract basic elements from URLs and the mode 'extended' extract many more elements like the TLD, the subdomain, the domain without the TLD, etc.
The listname option allows to specify one or more lists of known TLDs to load. URLParser is shipped with two default lists, the IANA list and the Mozilla Public Suffix List but users can define their own custom lists to either complement, or replace, the default lists. Multiple lists can be loaded by specifying the separator "|" (pipe).
Examples:
There is no limit to the number of lists one can load and the TLDs present in multiple lists are loaded only once (the underneath logic is a boolean OR).
Lists files are stored under the application directory ($APP_DIR/suffix_lists) and must be named following this syntax: suffix_list_\<name lowercase>.dat
Examples:
This section describes what is the formalism expected for the content of a custom list:
Example:
// This is my custom list
pouet
\*.yata
!coco.yata
Line 1: define "pouet" as a TLD.
www.domain.pouet: TLD=pouet, Domain=domain.pouet
Line 2: define that everything under ".yata" is part of the TLD
www.domain.cw.yata: TLD=cw.yata, Domain=domain.cw.yata
www.domain.hehe.yata: TLD=hehe.yata, Domain=domain.hehe.yata
Line 3: define an exception for the .yata TLD: coco.yata is NOT a TLD.
www.domain.coco.yata: TLD=yata, Domain=coco.yata
Those tests are just an indication of performances and were realized on a MacBook Pro over a sample dataset of proxy logs with Splunk 6.5.1.
URLParser (scripted lookup)
search url!=- url=* | head 200000 | eval list="mozilla|iana"| lookup urlparser_lookup url list
This search has completed and has returned 5,123 results by scanning 204,129 events in 81.6 seconds
URLParser (custom search comand)
search url!=- url=* | head 200000 | urlparser listname="mozilla|iana"
This search has completed and has returned 5,123 results by scanning 204,129 events in 26.91 seconds
As a reference point for comparaison, here are the results with UTBox:
search url!=- url=* | head 200000 | eval list="*" | lookup ut_parse_extended_lookup url list
This search has completed and has returned 5,123 results by scanning 204,129 events in 83.123 seconds
URLParser execution logs can be found under $SPLUNK_HOME/var/log/splunk/urlparser.log
As a Splunkbase app developer, you will have access to all Splunk development resources and receive a 10GB license to build an app that will help solve use cases for customers all over the world. Splunkbase has 1000+ apps from Splunk, our partners and our community. Find an app for most any data source and user need, or simply create your own with help from our developer portal.