문자열 내에서 URL을 찾기위한 정규식

Question 1

문자열 내에서 URL을 찾는 데 사용할 수있는 정규식을 아는 사람이 있습니까? 전체 문자열이 URL인지 확인하기 위해 Google에서 많은 정규 표현식을 찾았지만 전체 문자열에서 URL을 검색 할 수 있어야합니다. 예를 들어, 내가 찾을 수 있도록하고 싶습니다 www.google.com및 http://yahoo.com다음 문자열 :

Hello www.google.com World http://yahoo.com

문자열에서 특정 URL을 찾고 있지 않습니다. 문자열의 모든 URL을 찾고 있으므로 정규식이 필요합니다.

Question 2

이것은 내가 사용하는 것입니다

(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

나를 위해 일하고 당신에게도 일해야합니다.

Question 3

정규식이이 용도에 완벽하지 않다고 생각합니다. 여기 에서 꽤 단단한 것을 찾았 습니다

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm

여기에 게시 된 다른 것들과 비교하여 몇 가지 차이점 / 장점 :

이메일 주소와 일치 하지 않습니다.
localhost : 12345와 일치합니다.
moo.com없이 http또는 같은 것을 감지하지 않습니다www

예를 보려면 여기 를 참조 하십시오.

Question 4

text = """The link of this question: /programming/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
The code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)

산출:

[
    '/programming/6038061/regular-expression-to-find-urls-within-a-string', 
    'www.google.com', 
    'facebook.com',
    'http://test.com/method?param=wasd'
]

Question 5

여기에 제공된 솔루션 중 어느 것도 내가 가진 문제 / 사용 사례를 해결하지 못했습니다.

내가 여기에 제공 한 것은 내가 지금까지 발견 / 만든 것 중 최고입니다. 처리하지 않는 새로운 엣지 케이스를 발견하면 업데이트하겠습니다.

\b
  #Word cannot begin with special characters
  (?<![@.,%&#-])
  #Protocols are optional, but take them with us if they are present
  (?<protocol>\w{2,10}:\/\/)?
  #Domains have to be of a length of 1 chars or greater
  ((?:\w|\&\#\d{1,5};)[.-]?)+
  #The domain ending has to be between 2 to 15 characters
  (\.([a-z]{2,15})
       #If no domain ending we want a port, only if a protocol is specified
       |(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])

Question 6

이 정규식 패턴이 원하는 것을 정확하게 처리한다고 생각합니다.

/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

다음은 URL을 추출하는 스 니펫 예제입니다.

// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";

// The Text you want to filter for urls
$text = "The text you want  /programming/6038061/regular-expression-to-find-urls-within-a-string to filter goes here.";

// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);

Question 7

위의 모든 답변은 URL의 유니 코드 문자와 일치하지 않습니다. 예 : http://google.com?query=đức+filan+đã+search

솔루션의 경우 다음이 작동합니다.

(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)

Question 8

링크 선택에 엄격해야하는 경우 다음을 수행합니다.

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

자세한 내용은 다음을 참조하십시오.

URL 일치를위한 개선 된 자유롭고 정확한 정규식 패턴

Question 9

나는 이것을 발견 했다 하위 부품을 포함한 대부분의 샘플 링크를 포함하는합니다.

정규식은 다음과 같습니다.

(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))?

Question 10

URL 패턴이 있으면 문자열에서 검색 할 수 있습니다. 패턴 에 url 문자열의 시작과 끝이 표시 되지 ^않고 $표시 되는지 확인하십시오 . 따라서 P가 URL의 패턴이면 P와 일치하는 항목을 찾습니다.

Question 11

정규식 아래에서 문자열에서 URL을 찾았습니다.

/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/

Question 12

여기에 좀 더 최적화 된 정규 표현식이 있습니다.

(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:\/~\+#]*[A-Z\-\@?^=%&amp;\/~\+#]){2,6}?

다음은 데이터 테스트입니다 : https://regex101.com/r/sFzzpY/6

Question 13

짧고 간단합니다. 아직 자바 스크립트 코드에서 테스트하지 않았지만 작동 할 것 같습니다.

((http|ftp|https):\/\/)?(([\w.-]*)\.([\w]*))

regex101.com의 코드

Question 14

이 정규식을 사용합니다.

/((\w+:\/\/\S+)|(\w+[\.:]\w+\S+))[^\s,\.]/ig

http://google.com , https://dev-site.io:8080/home?val=1&count=100 , www.regexr.com, localhost : 8080 / path, 와 같은 많은 URL에서 잘 작동합니다 . ..

Question 15

이것은 (필요한 것에 따라) Rajeev의 대답에 대한 약간의 개선 / 조정입니다.

([\w\-_]+(?:(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:/~\+#]*[A-Z\-\@?^=%&amp;/~\+#]){2,6}?

여기를 참조 하십시오그것이 무엇을하고 일치하지 않는지에 대한 예는 를 .

나는 이것없이 url을 잡으려고했기 때문에 "http"등에 대한 검사를 제거했다. 난독 화 된 URL (즉, 사용자가 "."대신 [점]을 사용하는 경우)을 잡기 위해 정규식에 약간 추가했습니다. 마지막으로 "\ w"를 "AZ"에서 "{2,3}"로 바꾸어 v2.0 및 "moo.0dd"와 같은 오 탐지를 줄였습니다.

이 환영에 대한 모든 개선.

Question 16

너무 단순하지만 작동 방법은 다음과 같습니다.

[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+

파이썬에서 테스트했고 문자열 구문 분석에 앞뒤에 공백이 있고 URL에 공백이없는 한 (이전에 본 적이없는) 괜찮을 것입니다.

여기에 그것을 보여주는 온라인 ide가 있습니다.

그러나 사용하면 다음과 같은 몇 가지 이점이 있습니다.

그것은 인식 file:하고localhost 뿐만 아니라 IP 주소
그것은 것이다 결코 그들없이 일치하지
#또는 -(이 게시물의 URL 참조) 와 같은 비정상적인 문자는 신경 쓰지 않습니다.

Question 17

@JustinLevene에서 제공하는 정규식을 사용하면 백 슬래시에 적절한 이스케이프 시퀀스가 없습니다. 이제 올바로 업데이트되었으며 FTP 프로토콜과도 일치하는 조건으로 추가되었습니다. 프로토콜이 있거나없는 모든 URL과 "www"가없는 모든 URL과 일치합니다.

암호: ^((http|ftp|https):\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?

예 : https://regex101.com/r/uQ9aL4/65

Question 18

향상

다음과 같은 URL을 감지합니다.

https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http : //www.example.com#up
http://255.255.255.255
255.255.255.255
http : // www.site.com:8008

정규식 :

/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/gm

Question 19

직접 작성 :

let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#]?[\w-]+)*\/?/gm

다음 도메인 모두에서 작동합니다.

https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255

당신은 할 수 는 regex101 여기에 수행하는 방법을 참조 하고 필요한 조정

Question 20

두 점 또는 마침표 사이의 텍스트를 찾는 논리를 사용합니다.

아래 정규식은 파이썬에서 잘 작동합니다.

(?<=\.)[^}]*(?=\.)

Question 21

텍스트의 URL 일치는 그렇게 복잡하지 않아야합니다.

(?:(?:(?:ftp|http)[s]*:\/\/|www\.)[^\.]+\.[^ \n]+)

https://regex101.com/r/wewpP1/2

Question 22

나는 이것을 사용했다

^(https?:\\/\\/([a-zA-z0-9]+)(\\.[a-zA-z0-9]+)(\\.[a-zA-z0-9\\/\\=\\-\\_\\?]+)?)$

Question 23

(?:vnc|s3|ssh|scp|sftp|ftp|http|https)\:\/\/[\w\.]+(?:\:?\d{0,5})|(?:mailto|)\:[\w\.]+\@[\w\.]+

각 부분에 대한 설명이 필요하다면 regexr [.] com에서 모든 캐릭터에 대한 훌륭한 설명을 얻을 수 있습니다.

이것은 "|"로 나뉩니다. 또는 "OR"은 사용 가능한 모든 URI에 "//"가 포함되어 있지 않기 때문에 여기에서 일치하려는 스키마 또는 조건 목록을 만들 수 있습니다.

Question 24

나는 C # Uri 클래스를 활용했으며 IP 주소, localhost와 잘 작동합니다.

 public static bool CheckURLIsValid(string url)
    {
        Uri returnURL;

       return (Uri.TryCreate(url, UriKind.Absolute, out returnURL)
           && (returnURL.Scheme == Uri.UriSchemeHttp || returnURL.Scheme == Uri.UriSchemeHttps));


    }

Question 25

나는 Stefan Henze의 솔루션을 좋아했지만 34.56을 선택했습니다. 너무 일반적이며 구문 분석되지 않은 html이 있습니다. URL에는 4 개의 앵커가 있습니다.

www,

http : \ (및 co),

. 그 뒤에 문자와 /,

또는 편지. 및 다음 중 하나 : https://ftp.isc.org/www/survey/reports/current/bynum.txt .

이 스레드에서 많은 정보를 사용했습니다. 다들 감사 해요.

"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"

위의 경우 "eurls : www.google.com, facebook.com, http : //test.com/"과 같은 문자열을 제외한 거의 모든 문제를 해결합니다.이 문자열은 단일 문자열로 반환됩니다. Tbh idk 내가 gopher 등을 추가 한 이유. Proof R 코드

if(T){
  wierdurl<-vector()
  wierdurl[1]<-"https://JP納豆.例.jp/dir1/納豆 "
  wierdurl[2]<-"xn--jp-cd2fp15c.xn--fsq.jp "
  wierdurl[3]<-"http://52.221.161.242/2018/11/23/biofourmis-collab"
  wierdurl[4]<-"https://12000.org/ "
  wierdurl[5]<-"  https://vg-1.com/?page_id=1002 "
  wierdurl[6]<-"https://3dnews.ru/822878"
  wierdurl[7]<-"The link of this question: /programming/6038061/regular-expression-to-find-urls-within-a-string
  Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
  The code below catches all urls in text and returns urls in list. "
  wierdurl[8]<-"Thelinkofthisquestion:/programming/6038061/regular-expression-to-find-urls-within-a-string
  Alsotherearesomeurls:www.google.com,facebook.com,http://test.com/method?param=wasd
  Thecodebelowcatchesallurlsintextandreturnsurlsinlist. "
  wierdurl[9]<-"Thelinkofthisquestion:/programming/6038061/regular-expression-to-find-urls-within-a-stringAlsotherearesomeurlsZwww.google.com,facebook.com,http://test.com/method?param=wasdThecodebelowcatchesallurlsintextandreturnsurlsinlist."
  wierdurl[10]<-"1facebook.com/1res"
  wierdurl[11]<-"1facebook.com/1res/wat.txt"
  wierdurl[12]<-"www.e "
  wierdurl[13]<-"is this the file.txt i need"
  wierdurl[14]<-"xn--jp-cd2fp15c.xn--fsq.jpinspiredby "
  wierdurl[15]<-"[xn--jp-cd2fp15c.xn--fsq.jp/inspiredby "
  wierdurl[16]<-"xnto--jpto-cd2fp15c.xnto--fsq.jpinspiredby "
  wierdurl[17]<-"fsety--fwdvg-gertu56.ffuoiw--ffwsx.3dinspiredby "
  wierdurl[18]<-"://3dnews.ru/822878 "
  wierdurl[19]<-" http://mywebsite.com/msn.co.uk "
  wierdurl[20]<-" 2.0http://www.abe.hip "
  wierdurl[21]<-"www.abe.hip"
  wierdurl[22]<-"hardware/software/data"
  regexstring<-vector()
  regexstring[2]<-"(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[3]<-"/(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#\\/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#\\/%=~_|$])/igm"
  regexstring[4]<-"[a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]?"
  regexstring[5]<-"((http|ftp|https)\\:\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[6]<-"((http|ftp|https):\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?"
  regexstring[7]<-"(http|ftp|https)(:\\/\\/)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
  regexstring[8]<-"(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#/%=~_|$])"
  regexstring[10]<-"((http[s]?|ftp):\\/)?\\/?([^:\\/\\s]+)((\\/\\w+)*\\/)([\\w\\-\\.]+[^#?\\s]+)(.*)?(#[\\w\\-]+)?"
  regexstring[12]<-"http[s:/]+[[:alnum:]./]+"
  regexstring[9]<-"http[s:/]+[[:alnum:]./]+" #in DLpages 230
  regexstring[1]<-"[[:alnum:]-]+?[.][:alnum:]+?(?=[/ :])" #in link_graphs 50
  regexstring[13]<-"^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$"
  regexstring[14]<-"(((((http|ftp|https):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]+(?:(?:\\.[\\w_-]+)*))((\\.((org|com|net|edu|gov|mil|int)|(([:alpha:]{2})(?=[, ]))))|([\\/]([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
  regexstring[15]<-"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
    }

for(i in wierdurl){#c(7,22)
  for(c in regexstring[c(15)]) {
    print(paste(i,which(regexstring==c)))
    print(str_extract_all(i,c))
  }
}

Question 26

이것은 최고의 것입니다.

NSString *urlRegex="(http|ftp|https|www|gopher|telnet|file)(://|.)([\\w_-]+(?:(?:\\.[\\w_-]+)‌+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?";

Question 27

이것은 가장 간단한 것입니다. 나를 위해 잘 작동합니다.

%(http|ftp|https|www)(://|\.)[A-Za-z0-9-_\.]*(\.)[a-z]*%

Question 28

간단합니다.

이 패턴을 사용하십시오. \b((ftp|https?)://)?([\w-\.]+\.(com|net|org|gov|mil|int|edu|info|me)|(\d+\.\d+\.\d+\.\d+))(:\d+)?(\/[\w-\/]*(\?\w*(=\w+)*[&\w-=]*)*(#[\w-]+)*)?

다음이 포함 된 모든 링크와 일치합니다.

허용되는 프로토콜 : http, https 및 ftp

허용 도메인 : * .com, * .net, * .org, * .gov, * .mil, * .int, * .edu, * .info 및 * .me 또는 IP

허용되는 포트 : true

허용되는 매개 변수 : true

허용 된 해시 : true