set regex pattern for CURIE and add URL, fixes #400#406
set regex pattern for CURIE and add URL, fixes #400#406sierra-moxon wants to merge 3 commits into1.4from
Conversation
sierra-moxon
commented
Mar 16, 2023
- regexes are fairly permissive taking into account that CURIE syntax in the "wild" might be less conformant to prefix best practices.
- the ticket called for a URL type, if we want a semantic URI type (where the URI can be the subject of a triple in RDF), then we should be much more restrictive. When we have a specific use case for this, I'd be happy to refine it.
edeutsch
left a comment
There was a problem hiding this comment.
okay with me in principle, but I don't think the regular expression seems helpful. Is there a "standard" regexp for URLs out there on stack overflow or somewhere?
| description: >- | ||
| externalDocs: | ||
| url: https://www.ietf.org/rfc/rfc3987.txt | ||
| pattern: ^(http(s)?:\/\/.)?(www\.)?\S+$ |
There was a problem hiding this comment.
Isn't this regular expression anything without a space? Is that really helpful? Testing this regexp:
#!/bin/env python3
import re
inputs = [ 'http://arax.ncats.io', 'foo', '@*$&@#', 'PMID:123', 'http://peptideatlas.org/tmp/hello world.txt' ]
for input in inputs:
match = re.search(r'^(http(s)?:\/\/.)?(www\.)?\S+$', input)
if match:
print(f"MATCHES {input}")
else:
print(f" x {input}")
yields
MATCHES http://arax.ncats.io
MATCHES foo
MATCHES @*$&@#
MATCHES PMID:123
x http://peptideatlas.org/tmp/hello world.txt
only the last one fails.
Yet if you paste that into your browser, it works!
Maybe this is useful with a more restrictive regular expression?
There was a problem hiding this comment.
My first thought too was to find a more comprehensive one. The one I found and tested that was more restrictive did fail the line length linting on this repo and before I tried to break it into many lines, I started asking around a bit for best practice on this. The feedback I got was that a very restrictive regex will mean constant tweaking with "in the wild" implementations of URLs and CURIEs. However, if we want to have a URI (not L) type, that is much more restrictive, we can do that.
vdancik
left a comment
There was a problem hiding this comment.
I think that specifying regex in specification is not necessary