Regexp on urls with ruby

In ruby, if you would like to get the components of a URI you would need to use URI that’s not in the core library but in the standard library. Thus making you require it, like this:

require 'uri'

uri = URI("https://hello.asdas.com/foo/asdf/asda?aaa=bbb&a=b")
uri.path # "/foo/asdf/asda"

and you could shortcut it without storing it in a variable if you just gonna call URI once:

require 'uri'
URI("https://hello.asdas.com/foo/asdf/asda?aaa=bbb&a=b").path # "/foo/asdf/asda"

This regexp rather removes the stuff before the path than catching the stuff from the other way, just because I think it’s easier. Here’s how it looks

url = "https://hello.asdas.com/foo/asdf/asda?aaa=bbb&a=b"
url.gsub(/^http(s)?:\/\/(([a-z]+)|([\.]+))+\//, "") # foo/asdf/asda?aaa=bbb&a=b

A better one, this will capture every part of the url as groups.

url = "https://hello.asdas.com/foo/asdf/asda?aaa=bbb&a=b&#fragment"
/^(http[s]?:\/\/)?([^\/]*)([^\?|#]*)([^\#]*)(.*)$/.match(url)
# <MatchData
# "https://hello.asdas.com/foo/asdf/asda?aaa=bbb&a=b&#fragment"
# 1:"https://"
# 2:"hello.asdas.com"
# 3:"/foo/asdf/asda"
# 4:"?aaa=bbb&a=b&"
# 5:"#fragment"
# >

We can do even better with named captures:

m = /^(?<scheme>http[s]?:\/\/)?(?<authority>[^\/]*)(?<path>[^\?|#]*)(?<query>[^\#]*)(?<fragment>.*)$/.match(url)

m.named_captures
# {
#   "scheme"=>"https://",
#   "authority"=>"hello.asdas.com",
#   "path"=>"/foo/asdf/asda",
#   "query"=>"?aaa=bbb&a=b&",
#   "fragment"=>"#fragment"
# }

Resources