Skip to content

Prepare For Coder Interview – Denny

  • Basic
  • Medium
  • Hard
  • Architect
  • Life

LeetCode: Web Crawler

Posted on August 5, 2019July 26, 2020 by braindenny

Web Crawler



Similar Problems:

  • CheatSheet: Leetcode For Code Interview
  • CheatSheet: Common Code Problems & Follow-ups
  • Tag: #bfs, #dfs, #graph

Given a url startUrl, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.
  • Do not crawl the same link twice.
  • Only crawl the links that are under the same hostname as startUrl.

Web Crawler

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified.

Example 1:

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]
Explanation:
Edges show how these urls connect with each other.
1. startUrl http://news.yahoo.com/news/topics/ (index 2) links to:
     -> http://news.yahoo.com/news (index 1)
     -> http://news.yahoo.com      (index 0). 
2. http://news.yahoo.com (index 0) links to:
     -> http://news.yahoo.com/us   (index 4). 

Example 2:

Input: 
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Explanation: The startUrl links to all other pages that do not share the same hostname.

Constraints:

  • 1 <= urls.length <= 1000
  • 1 <= urls[i] <= 300
  • Hostname labels may contain only the ASCII letters ‘a’ through ‘z’ (in a case-insensitive manner), the digits ‘0’ through ‘9’ and the hyphen-minus character (‘-‘).
  • The hostname may not start or end with the hyphen-minus character (‘-‘).
  • See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
  • You may assume there’re no duplicates in url library.

Github: code.dennyzhang.com

Credits To: leetcode.com

Leave me comments, if you have better ways to solve.


  • Solution: bfs
## https://code.dennyzhang.com/web-crawler
## Basic Ideas: BFS
##
## For the return, no sorting would be reuqired
##
## Complexity: Time O(n), Space O(n)
# """
# This is HtmlParser's API interface.
# You should not implement it, or speculate about its implementation
# """
#class HtmlParser(object):
#    def getUrls(self, url):
#        """
#        :type url: str
#        :rtype List[str]
#        """

class Solution:
    def getDomainName(self, url):
        start = len("http://")
        end = start+1
        while end < len(url):
            if url[end] == '/':
                break
            end += 1
        return url[start:end]

    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        domain = self.getDomainName(startUrl)
        # BFS
        visited = {startUrl:True}
        queue = collections.deque()
        queue.append(startUrl)
        res = [startUrl]
        while len(queue) > 0:
            for i in range(len(queue)):
                url = queue.popleft()
                # find nexts
                for url2 in htmlParser.getUrls(url):
                    if url2 not in visited and self.getDomainName(url2) == domain:
                        visited[url2] = True
                        queue.append(url2)
                        res.append(url2)
        return res

  • Solution: dfs
## https://code.dennyzhang.com/web-crawler
## Basic Ideas: DFS
##
## For the return, no sorting would be reuqired
##
## Complexity: Time O(n), Space O(n)
# """
# This is HtmlParser's API interface.
# You should not implement it, or speculate about its implementation
# """
#class HtmlParser(object):
#    def getUrls(self, url):
#        """
#        :type url: str
#        :rtype List[str]
#        """

class Solution:
    def getDomainName(self, url):
        start = len("http://")
        end = start+1
        while end < len(url):
            if url[end] == '/':
                break
            end += 1
        return url[start:end]

    def dfs(self, startUrl, domain, visited, htmlParser, res):
        # return visited or not qualified
        if startUrl in visited or self.getDomainName(startUrl) != domain:
            return
        res.append(startUrl)
        visited[startUrl] = True
        for url in htmlParser.getUrls(startUrl):
            self.dfs(url, domain, visited, htmlParser, res)

    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        domain = self.getDomainName(startUrl)
        res = []
        visited = {}
        self.dfs(startUrl, domain, visited, htmlParser, res)
        return res
linkedin
github
slack

Post Views: 0
Posted in MediumTagged #bfs, #dfs, #graph

Post navigation

LeetCode: Video Stitching
LeetCode: Replace the Substring for Balanced String

Leave a Reply Cancel reply

Your email address will not be published.

Tags

#array #backtracking #bfs #binarytree #bitmanipulation #blog #classic #codetemplate #combination #dfs #dynamicprogramming #game #graph #greedy #heap #inspiring #interval #linkedlist #manydetails #math #palindrome #recursive #slidingwindow #stack #string #subarray #trie #twopointer #twosum binarysearch editdistance hashmap intervaldp knapsack monotone oodesign presum rectangle redo review rotatelist series sql treetraversal unionfind

Recent Posts

  • a
  • a
  • a
  • a
  • a

Recent Comments

    Archives

    Categories

    • Amusing
    • Basic
    • Easy
    • Hard
    • Life
    • Medium
    • Resource
    • Review
    • Series
    • Uncategorized
    Proudly powered by WordPress | Theme: petals by Aurorum.