quickjs-tart

quickjs-based runtime for wallet-core logic
Log | Files | Refs | README | LICENSE

URL-SYNTAX.md (14999B)


      1 <!--
      2 Copyright (C) Daniel Stenberg, <daniel@haxx.se>, et al.
      3 
      4 SPDX-License-Identifier: curl
      5 -->
      6 
      7 # URL syntax and their use in curl
      8 
      9 ## Specifications
     10 
     11 The official "URL syntax" is primarily defined in these two different
     12 specifications:
     13 
     14  - [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986) (although URL is called
     15    "URI" in there)
     16  - [The WHATWG URL Specification](https://url.spec.whatwg.org/)
     17 
     18 RFC 3986 is the earlier one, and curl has always tried to adhere to that one
     19 (since it shipped in January 2005).
     20 
     21 The WHATWG URL spec was written later, is incompatible with the RFC 3986 and
     22 changes over time.
     23 
     24 ## Variations
     25 
     26 URL parsers as implemented in browsers, libraries and tools usually opt to
     27 support one of the mentioned specifications. Bugs, differences in
     28 interpretations and the moving nature of the WHATWG spec does however make it
     29 unlikely that multiple parsers treat URLs the same way.
     30 
     31 ## Security
     32 
     33 Due to the inherent differences between URL parser implementations, it is
     34 considered a security risk to mix different implementations and assume the
     35 same behavior.
     36 
     37 For example, if you use one parser to check if a URL uses a good hostname or
     38 the correct auth field, and then pass on that same URL to a *second* parser,
     39 there is always a risk it treats the same URL differently. There is no right
     40 and wrong in URL land, only differences of opinions.
     41 
     42 libcurl offers a separate API to its URL parser for this reason, among others.
     43 
     44 Applications may at times find it convenient to allow users to specify URLs
     45 for various purposes and that string would then end up fed to curl. Getting a
     46 URL from an external untrusted party and using it with curl brings several
     47 security concerns:
     48 
     49 1. If you have an application that runs as or in a server application, getting
     50    an unfiltered URL can trick your application to access a local resource
     51    instead of a remote resource. Protecting yourself against localhost accesses
     52    is hard when accepting user provided URLs.
     53 
     54 2. Such custom URLs can access other ports than you planned as port numbers
     55    are part of the regular URL format. The combination of a local host and a
     56    custom port number can allow external users to play tricks with your local
     57    services.
     58 
     59 3. Such a URL might use other schemes than you thought of or planned for.
     60 
     61 ## "RFC 3986 plus"
     62 
     63 curl recognizes a URL syntax that we call "RFC 3986 plus". It is grounded on
     64 the well established RFC 3986 to make sure previously written command lines
     65 and curl using scripts remain working.
     66 
     67 curl's URL parser allows a few deviations from the spec in order to
     68 inter-operate better with URLs that appear in the wild.
     69 
     70 ### Spaces
     71 
     72 A URL provided to curl cannot contain spaces. They need to be provided URL
     73 encoded to be accepted in a URL by curl.
     74 
     75 An exception to this rule: `Location:` response headers that indicate to a
     76 client where a resource has been redirected to, sometimes contain spaces. This
     77 is a violation of RFC 3986 but is fine in the WHATWG spec. curl handles these
     78 by re-encoding them to `%20`.
     79 
     80 ### Non-ASCII
     81 
     82 Byte values in a provided URL that are outside of the printable ASCII range
     83 are percent-encoded by curl.
     84 
     85 ### Multiple slashes
     86 
     87 An absolute URL always starts with a "scheme" followed by a colon. For all the
     88 schemes curl supports, the colon must be followed by two slashes according to
     89 RFC 3986 but not according to the WHATWG spec - which allows one to infinity
     90 amount.
     91 
     92 curl allows one, two or three slashes after the colon to still be considered a
     93 valid URL.
     94 
     95 ### "scheme-less"
     96 
     97 curl supports "URLs" that do not start with a scheme. This is not supported by
     98 any of the specifications. This is a shortcut to entering URLs that was
     99 supported by browsers early on and has been mimicked by curl.
    100 
    101 Based on what the hostname starts with, curl "guesses" what protocol to use:
    102 
    103  - `ftp.` means FTP
    104  - `dict.` means DICT
    105  - `ldap.` means LDAP
    106  - `imap.` means IMAP
    107  - `smtp.` means SMTP
    108  - `pop3.` means POP3
    109  - all other means HTTP
    110 
    111 ### Globbing letters
    112 
    113 The curl command line tool supports "globbing" of URLs. It means that you can
    114 create ranges and lists using `[N-M]` and `{one,two,three}` sequences. The
    115 letters used for this (`[]{}`) are reserved in RFC 3986 and can therefore not
    116 legitimately be part of such a URL.
    117 
    118 They are however not reserved or special in the WHATWG specification, so
    119 globbing can mess up such URLs. Globbing can be turned off for such occasions
    120 (using `--globoff`).
    121 
    122 # URL syntax details
    123 
    124 A URL may consist of the following components - many of them are optional:
    125 
    126     [scheme][divider][userinfo][hostname][port number][path][query][fragment]
    127 
    128 Each component is separated from the following component with a divider
    129 character or string.
    130 
    131 For example, this could look like:
    132 
    133     http://user:password@www.example.com:80/index.html?foo=bar#top
    134 
    135 ## Scheme
    136 
    137 The scheme specifies the protocol to use. A curl build can support a few or
    138 many different schemes. You can limit what schemes curl should accept.
    139 
    140 curl supports the following schemes on URLs specified to transfer. They are
    141 matched case insensitively:
    142 
    143 `dict`, `file`, `ftp`, `ftps`, `gopher`, `gophers`, `http`, `https`, `imap`,
    144 `imaps`, `ldap`, `ldaps`, `mqtt`, `pop3`, `pop3s`, `rtmp`, `rtmpe`, `rtmps`,
    145 `rtmpt`, `rtmpte`, `rtmpts`, `rtsp`, `smb`, `smbs`, `smtp`, `smtps`, `telnet`,
    146 `tftp`
    147 
    148 When the URL is specified to identify a proxy, curl recognizes the following
    149 schemes:
    150 
    151 `http`, `https`, `socks4`, `socks4a`, `socks5`, `socks5h`, `socks`
    152 
    153 ## Userinfo
    154 
    155 The userinfo field can be used to set username and password for
    156 authentication purposes in this transfer. The use of this field is discouraged
    157 since it often means passing around the password in plain text and is thus a
    158 security risk.
    159 
    160 URLs for IMAP, POP3 and SMTP also support *login options* as part of the
    161 userinfo field. They are provided as a semicolon after the password and then
    162 the options.
    163 
    164 ## Hostname
    165 
    166 The hostname part of the URL contains the address of the server that you want
    167 to connect to. This can be the fully qualified domain name of the server, the
    168 local network name of the machine on your network or the IP address of the
    169 server or machine represented by either an IPv4 or IPv6 address (within
    170 brackets). For example:
    171 
    172     http://www.example.com/
    173 
    174     http://hostname/
    175 
    176     http://192.168.0.1/
    177 
    178     http://[2001:1890:1112:1::20]/
    179 
    180 ### "localhost"
    181 
    182 Starting in curl 7.77.0, curl uses loopback IP addresses for the name
    183 `localhost`: `127.0.0.1` and `::1`. It does not resolve the name using the
    184 resolver functions.
    185 
    186 This is done to make sure the host accessed is truly the localhost - the local
    187 machine.
    188 
    189 ### IDNA
    190 
    191 If curl was built with International Domain Name (IDN) support, it can also
    192 handle hostnames using non-ASCII characters.
    193 
    194 When built with libidn2, curl uses the IDNA 2008 standard. This is equivalent
    195 to the WHATWG URL spec, but differs from certain browsers that use IDNA 2003
    196 Transitional Processing. The two standards have a huge overlap but differ
    197 slightly, perhaps most famously in how they deal with the
    198 [German "double s"](https://en.wikipedia.org/wiki/%c3%9f)
    199 ([LATIN SMALL LETTER SHARP S](https://codepoints.net/U+00DF)).
    200 
    201 When WinIDN is used, curl uses IDNA 2003 Transitional Processing, like the rest
    202 of Windows.
    203 
    204 ## Port number
    205 
    206 If there is a colon after the hostname, that should be followed by the port
    207 number to use. 1 - 65535. curl also supports a blank port number field - but
    208 only if the URL starts with a scheme.
    209 
    210 If the port number is not specified in the URL, curl uses a default port
    211 number based on the provide scheme:
    212 
    213 DICT 2628, FTP 21, FTPS 990, GOPHER 70, GOPHERS 70, HTTP 80, HTTPS 443,
    214 IMAP 132, IMAPS 993, LDAP 369, LDAPS 636, MQTT 1883, POP3 110, POP3S 995,
    215 RTMP 1935, RTMPS 443, RTMPT 80, RTSP 554, SCP 22, SFTP 22, SMB 445, SMBS 445,
    216 SMTP 25, SMTPS 465, TELNET 23, TFTP 69
    217 
    218 # Scheme specific behaviors
    219 
    220 ## FTP
    221 
    222 The path part of an FTP request specifies the file to retrieve and from which
    223 directory. If the file part is omitted then libcurl downloads the directory
    224 listing for the directory specified. If the directory is omitted then the
    225 directory listing for the root / home directory is returned.
    226 
    227 FTP servers typically put the user in its "home directory" after login, which
    228 then differs between users. To explicitly specify the root directory of an FTP
    229 server, start the path with double slash `//` or `/%2f` (2F is the hexadecimal
    230 value of the ASCII code for the slash).
    231 
    232 ## FILE
    233 
    234 When a `FILE://` URL is accessed on Windows systems, it can be crafted in a
    235 way so that Windows attempts to connect to a (remote) machine when curl wants
    236 to read or write such a path.
    237 
    238 curl only allows the hostname part of a FILE URL to be one out of these three
    239 alternatives: `localhost`, `127.0.0.1` or blank ("", zero characters).
    240 Anything else makes curl fail to parse the URL.
    241 
    242 ### Windows-specific FILE details
    243 
    244 curl accepts that the FILE URL's path starts with a "drive letter". That is a
    245 single letter `a` to `z` followed by a colon or a pipe character (`|`).
    246 
    247 The Windows operating system itself converts some file accesses to perform
    248 network accesses over SMB/CIFS, through several different file path patterns.
    249 This way, a `file://` URL passed to curl *might* be converted into a network
    250 access inadvertently and unknowingly to curl. This is a Windows feature curl
    251 cannot control or disable.
    252 
    253 ## IMAP
    254 
    255 The path part of an IMAP request not only specifies the mailbox to list or
    256 select, but can also be used to check the `UIDVALIDITY` of the mailbox, to
    257 specify the `UID`, `SECTION` and `PARTIAL` octets of the message to fetch and
    258 to specify what messages to search for.
    259 
    260 A top level folder list:
    261 
    262     imap://user:password@mail.example.com
    263 
    264 A folder list on the user's inbox:
    265 
    266     imap://user:password@mail.example.com/INBOX
    267 
    268 Select the user's inbox and fetch message with `uid = 1`:
    269 
    270     imap://user:password@mail.example.com/INBOX/;UID=1
    271 
    272 Select the user's inbox and fetch the first message in the mail box:
    273 
    274     imap://user:password@mail.example.com/INBOX/;MAILINDEX=1
    275 
    276 Select the user's inbox, check the `UIDVALIDITY` of the mailbox is 50 and
    277 fetch message 2 if it is:
    278 
    279     imap://user:password@mail.example.com/INBOX;UIDVALIDITY=50/;UID=2
    280 
    281 Select the user's inbox and fetch the text portion of message 3:
    282 
    283     imap://user:password@mail.example.com/INBOX/;UID=3/;SECTION=TEXT
    284 
    285 Select the user's inbox and fetch the first 1024 octets of message 4:
    286 
    287     imap://user:password@mail.example.com/INBOX/;UID=4/;PARTIAL=0.1024
    288 
    289 Select the user's inbox and check for NEW messages:
    290 
    291     imap://user:password@mail.example.com/INBOX?NEW
    292 
    293 Select the user's inbox and search for messages containing "shadows" in the
    294 subject line:
    295 
    296     imap://user:password@mail.example.com/INBOX?SUBJECT%20shadows
    297 
    298 Searching via the query part of the URL `?` is a search request for the
    299 results to be returned as message sequence numbers (`MAILINDEX`). It is
    300 possible to make a search request for results to be returned as unique ID
    301 numbers (`UID`) by using a custom curl request via `-X`. `UID` numbers are
    302 unique per session (and multiple sessions when `UIDVALIDITY` is the same). For
    303 example, if you are searching for `"foo bar"` in header+body (`TEXT`) and you
    304 want the matching `MAILINDEX` numbers returned then you could search via URL:
    305 
    306     imap://user:password@mail.example.com/INBOX?TEXT%20%22foo%20bar%22
    307 
    308 If you want matching `UID` numbers you have to use a custom request:
    309 
    310     imap://user:password@mail.example.com/INBOX -X "UID SEARCH TEXT \"foo bar\""
    311 
    312 For more information about IMAP commands please see RFC 9051. For more
    313 information about the individual components of an IMAP URL please see RFC 5092.
    314 
    315 * Note old curl versions would `FETCH` by message sequence number when `UID`
    316 was specified in the URL. That was a bug fixed in 7.62.0, which added
    317 `MAILINDEX` to `FETCH` by mail sequence number.
    318 
    319 ## LDAP
    320 
    321 The path part of an LDAP request can be used to specify the: Distinguished
    322 Name, Attributes, Scope, Filter and Extension for an LDAP search. Each field
    323 is separated by a question mark and when that field is not required an empty
    324 string with the question mark separator should be included.
    325 
    326 Search for the `DN` as `My Organization`:
    327 
    328     ldap://ldap.example.com/o=My%20Organization
    329 
    330 the same search but only return `postalAddress` attributes:
    331 
    332     ldap://ldap.example.com/o=My%20Organization?postalAddress
    333 
    334 Search for an empty `DN` and request information about the
    335 `rootDomainNamingContext` attribute for an Active Directory server:
    336 
    337     ldap://ldap.example.com/?rootDomainNamingContext
    338 
    339 For more information about the individual components of an LDAP URL please see
    340 [RFC 4516](https://datatracker.ietf.org/doc/html/rfc4516).
    341 
    342 ## POP3
    343 
    344 The path part of a POP3 request specifies the message ID to retrieve. If the
    345 ID is not specified then a list of waiting messages is returned instead.
    346 
    347 ## SCP
    348 
    349 The path part of an SCP URL specifies the path and file to retrieve or
    350 upload. The file is taken as an absolute path from the root directory on the
    351 server.
    352 
    353 To specify a path relative to the user's home directory on the server, prepend
    354 `~/` to the path portion.
    355 
    356 ## SFTP
    357 
    358 The path part of an SFTP URL specifies the file to retrieve or upload. If the
    359 path ends with a slash (`/`) then a directory listing is returned instead of a
    360 file. If the path is omitted entirely then the directory listing for the root
    361 / home directory is returned.
    362 
    363 ## SMB
    364 The path part of an SMB request specifies the file to retrieve and from what
    365 share and directory or the share to upload to and as such, may not be omitted.
    366 If the username is embedded in the URL then it must contain the domain name
    367 and as such, the backslash must be URL encoded as %2f.
    368 
    369 When uploading to SMB, the size of the file needs to be known ahead of time,
    370 meaning that you can upload a file passed to curl over a pipe like stdin.
    371 
    372 curl supports SMB version 1 (only)
    373 
    374 ## SMTP
    375 
    376 The path part of an SMTP request specifies the hostname to present during
    377 communication with the mail server. If the path is omitted, then libcurl
    378 attempts to resolve the local computer's hostname. However, this may not
    379 return the fully qualified domain name that is required by some mail servers
    380 and specifying this path allows you to set an alternative name, such as your
    381 machine's fully qualified domain name, which you might have obtained from an
    382 external function such as gethostname or getaddrinfo.
    383 
    384 The default smtp port is 25. Some servers use port 587 as an alternative.
    385 
    386 ## RTMP
    387 
    388 There is no official URL spec for RTMP so libcurl uses the URL syntax supported
    389 by the underlying librtmp library. It has a syntax where it wants a
    390 traditional URL, followed by a space and a series of space-separated
    391 `name=value` pairs.
    392 
    393 While space is not typically a "legal" letter, libcurl accepts them. When a
    394 user wants to pass in a `#` (hash) character it is treated as a fragment and
    395 it gets cut off by libcurl if provided literally. You have to escape it by
    396 providing it as backslash and its ASCII value in hexadecimal: `\23`.