LeetCode: Web Crawler Posted on August 5, 2019July 26, 2020 by braindenny Web Crawler Similar Problems: CheatSheet: Leetcode For Code Interview CheatSheet: Common Code Problems & Follow-ups Tag: #bfs, #dfs, #graph Given a url startUrl, implement a web crawler to crawl all links that are under the same hostname as startUrl. Your crawler should: Start from the page: startUrl Call HtmlParser.getUrls(url) to get all urls from a webpage of given url. Do not crawl the same link twice. Only crawl the links that are under the same hostname as startUrl. As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. Example 1: Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" Output: [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/us" ] Explanation: Edges show how these urls connect with each other. 1. startUrl http://news.yahoo.com/news/topics/ (index 2) links to: -> http://news.yahoo.com/news (index 1) -> http://news.yahoo.com (index 0). 2. http://news.yahoo.com (index 0) links to: -> http://news.yahoo.com/us (index 4). Example 2: Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com" ] edges = [[0,2],[2,1],[3,2],[3,1],[3,0]] startUrl = "http://news.google.com" Output: ["http://news.google.com"] Explanation: The startUrl links to all other pages that do not share the same hostname. Constraints: 1 <= urls.length <= 1000 1 <= urls[i] <= 300 Hostname labels may contain only the ASCII letters ‘a’ through ‘z’ (in a case-insensitive manner), the digits ‘0’ through ‘9’ and the hyphen-minus character (‘-‘). The hostname may not start or end with the hyphen-minus character (‘-‘). See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames You may assume there’re no duplicates in url library. Github: code.dennyzhang.com Credits To: leetcode.com Leave me comments, if you have better ways to solve. Solution: bfs ## https://code.dennyzhang.com/web-crawler ## Basic Ideas: BFS ## ## For the return, no sorting would be reuqired ## ## Complexity: Time O(n), Space O(n) # """ # This is HtmlParser's API interface. # You should not implement it, or speculate about its implementation # """ #class HtmlParser(object): # def getUrls(self, url): # """ # :type url: str # :rtype List[str] # """ class Solution: def getDomainName(self, url): start = len("http://") end = start+1 while end < len(url): if url[end] == '/': break end += 1 return url[start:end] def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: domain = self.getDomainName(startUrl) # BFS visited = {startUrl:True} queue = collections.deque() queue.append(startUrl) res = [startUrl] while len(queue) > 0: for i in range(len(queue)): url = queue.popleft() # find nexts for url2 in htmlParser.getUrls(url): if url2 not in visited and self.getDomainName(url2) == domain: visited[url2] = True queue.append(url2) res.append(url2) return res Solution: dfs ## https://code.dennyzhang.com/web-crawler ## Basic Ideas: DFS ## ## For the return, no sorting would be reuqired ## ## Complexity: Time O(n), Space O(n) # """ # This is HtmlParser's API interface. # You should not implement it, or speculate about its implementation # """ #class HtmlParser(object): # def getUrls(self, url): # """ # :type url: str # :rtype List[str] # """ class Solution: def getDomainName(self, url): start = len("http://") end = start+1 while end < len(url): if url[end] == '/': break end += 1 return url[start:end] def dfs(self, startUrl, domain, visited, htmlParser, res): # return visited or not qualified if startUrl in visited or self.getDomainName(startUrl) != domain: return res.append(startUrl) visited[startUrl] = True for url in htmlParser.getUrls(startUrl): self.dfs(url, domain, visited, htmlParser, res) def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: domain = self.getDomainName(startUrl) res = [] visited = {} self.dfs(startUrl, domain, visited, htmlParser, res) return res Post Views: 0 Post navigation LeetCode: Video StitchingLeetCode: Replace the Substring for Balanced String Leave a Reply Cancel replyYour email address will not be published.Comment Name Email Website