Remove Headers, Footers, External Links and their related data #508

syed-al · 2024-10-18T18:08:41Z

syed-al
Oct 18, 2024

Hi, Thanks for this great work.

I have been playing around with this, to crawl webpages and get content in markdown format, which can be used to provide to LLMs for grounding. But when I used them to say get data for news articles, I get lots and lots of unnecessary data like headers, footers, nav options, other article links. More than 50% of the data is not the actual data.

Yes, I can use LLMExtraction, but that will increase the bill tremendously as the input are like 10-20 web articles, with each web article ranging around 5000-7000 tokens. I saw one option is to provide elements, but I want the crawler to be generic, which can work on any website, so I don't have any fix elements which I know can remove header, nav, footer information for sure. Any way where the playwright extractor, can get the actual content of the webpage. I understand this may not be perfect, but I want to reduce the extra content as much as possible while being generic across all websites.

P.S: I would love to contribute to this project, I am not much experienced in the JS/TS side, but I am pretty confident on the python side of things. So is there any discord/slack/telegram group where I can join to discuss on how to contribute?

Answered by unclecode

Oct 21, 2024

@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72):

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control o…

View full answer

mentaLwz · 2024-10-19T07:17:58Z

mentaLwz
Oct 19, 2024

I have the same need too.
Perhaps you can try things like this to get the main content, I used this for my own need now :

def clean_content(content):
    # Find the start of the content (first # title)
    start_index = content.find('#')
    if start_index == -1:
        return ""  # No title found

    # Find the end of the content (next ## title)
    end_index = content.find('##', start_index + 1)
    if end_index == -1:
        # If no ## title found, return until the end
        return content[start_index:]
    else:
        return content[start_index:end_index].strip()


article['markdown_content'] = asyncio.run(crawl_url(article['source_url']))
article['markdown_content'] = clean_content(article['markdown_content'])

0 replies

syed-al · 2024-10-19T08:22:15Z

syed-al
Oct 19, 2024
Author

I too started with this, but the problem is many times, the title is not h2(##), sometime is h1(#) or h3(###) etc. And also the ads, signup banners and other links too will have ##, causing all other content also to be read. It's difficult I think to get the content from generic pages deterministically without LLMs.

0 replies

mentaLwz · 2024-10-19T08:36:43Z

mentaLwz
Oct 19, 2024

agreed.

0 replies

unclecode · 2024-10-20T11:18:17Z

unclecode
Oct 20, 2024
Maintainer

@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together.

0 replies

QuangTQV · 2024-10-20T13:25:25Z

QuangTQV
Oct 20, 2024

@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together.

I also need to remove unnecessary content. You should set additional parameters such as HTML selectors, excluding specific HTML tags that start with 'header', 'footer', and similar.

0 replies

QuangTQV · 2024-10-20T13:30:21Z

QuangTQV
Oct 20, 2024

@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together.

Clean the HTML first and then convert the HTML to markdown.

0 replies

unclecode · 2024-10-21T06:24:44Z

unclecode
Oct 21, 2024
Maintainer

@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72):

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
            
            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the excluded_tags that you can pass. You can see that here we have parameters to exclude or include external links, and the same goes for social media links such as Facebook, Pinterest, and so on. You can also pass extra domains that you want to be excluded from the links, and the same thing goes for images.

Another thing you can see here is the concept of the word count threshold. By setting this amount to a high number, like 10, you are basically excluding any HTML blocks that contain text with less than 10 words. This is a very useful way of removing unnecessary text. You can also use this one, depending on your needs.

Finally, our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way.

@QuangTQV I already shared same answer to your other issues, perhaps I close that issue and we continue here thx.

0 replies

syed-al · 2024-10-21T06:30:24Z

syed-al
Oct 21, 2024
Author

Here is the sample url to get the content:
https://www.hindustantimes.com/world-news/us-news/cnn-reporter-chokes-on-laughter-covering-trumps-x-rated-arnold-palmer-remark-live-watch-101729484003535.html

It is a new article, where are lot of links, extra information, other article information etc, which I would want to neglect and only get the main content of the article. The main article is around 30 lines, but the entire markdown I get is more than 600-700 lines.

 # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler(verbose=False) as crawler:
        # Run the crawler on a URL
        results = await crawler.arun_many(
            urls=urls,
        )

I am for now using the basic usage without any extra flags. Will try the flags you mentioned

Also email id for discord invitation: abdksyed@gmail.com

0 replies

QuangTQV · 2024-10-21T07:41:59Z

QuangTQV
Oct 21, 2024

@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72):
async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
            
            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")
Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the excluded_tags that you can pass. You can see that here we have parameters to exclude or include external links, and the same goes for social media links such as Facebook, Pinterest, and so on. You can also pass extra domains that you want to be excluded from the links, and the same thing goes for images.

Another thing you can see here is the concept of the word count threshold. By setting this amount to a high number, like 10, you are basically excluding any HTML blocks that contain text with less than 10 words. This is a very useful way of removing unnecessary text. You can also use this one, depending on your needs.

Finally, our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way.

@QuangTQV I already shared same answer to your other issues, perhaps I close that issue and we continue here thx.

Could you provide an example with this URL? 'https://tiki.vn/search?q=gi%C3%A0y%20adidas

0 replies

unclecode · 2024-10-21T10:40:30Z

unclecode
Oct 21, 2024
Maintainer

@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown.

async def main():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "Tiki Shoes",
        "baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
        "fields": [
            {
                "name": "image",
                "selector": "picture > img",
                "type": "attribute",
                "attribute": "src",
            },
            {
                "name": "price",
                "selector": '.price-discount__price',
                "type": "text",
            },
            {
                "name": "brand",
                "selector": ".above-product-name-info",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "h3",
                "type": "text",
            }
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(
            url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            # delay_before_return_html = 1,
            wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
            magic=True
        )

        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))
        
        with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
            f.write(result.extracted_content)

Output:

[{
  "price": "971.000\u20ab",
  "brand": "BITI'S",
  "description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...]

0 replies

QuangTQV · 2024-10-21T10:46:44Z

QuangTQV
Oct 21, 2024

@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown.

async def main():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "Tiki Shoes",
        "baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
        "fields": [
            {
                "name": "image",
                "selector": "picture > img",
                "type": "attribute",
                "attribute": "src",
            },
            {
                "name": "price",
                "selector": '.price-discount__price',
                "type": "text",
            },
            {
                "name": "brand",
                "selector": ".above-product-name-info",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "h3",
                "type": "text",
            }
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(
            url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            # delay_before_return_html = 1,
            wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
            magic=True
        )

        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))
        
        with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
            f.write(result.extracted_content)

Output:

[{
  "price": "971.000\u20ab",
  "brand": "BITI'S",
  "description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...]

I want to extract with all URLs, this is just an example URL, how can I remove redundant data?

0 replies

unclecode · 2024-10-21T10:47:05Z

unclecode
Oct 21, 2024
Maintainer

@syed-al I just added a new heuristic function that has the capability to produce a better markdown, which I call 'fit markdown.' The way they're going to use it is as follows: [code]. I also share with you the output of the markdown: it's much cleaner and I really like it. It's good for pages like the example you shared, and it will be soon released in the new version.

async def main():
    async with AsyncWebCrawler(verbos=True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            word_count_threshold = 10,
        )
        # Save markdown to file
        with open(os.path.join(__data__, "mexico_places.md"), "w") as f:
            f.write(result.fit_markdown)

    print("Done")

Everything remains the same. You simply pick up the fit_markdown property of the result crawl item. I also attach the markdown here. Result is very clean and exactly contains the main part. Btw Its under the test, perhaps this will be the first task you can help upon joining to Discord.

mexico_places.md

0 replies

QuangTQV · 2024-10-22T01:48:05Z

QuangTQV
Oct 22, 2024

@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown.

async def main():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "Tiki Shoes",
        "baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
        "fields": [
            {
                "name": "image",
                "selector": "picture > img",
                "type": "attribute",
                "attribute": "src",
            },
            {
                "name": "price",
                "selector": '.price-discount__price',
                "type": "text",
            },
            {
                "name": "brand",
                "selector": ".above-product-name-info",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "h3",
                "type": "text",
            }
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(
            url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            # delay_before_return_html = 1,
            wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
            magic=True
        )

        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))
        
        with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
            f.write(result.extracted_content)

Output:

[{
  "price": "971.000\u20ab",
  "brand": "BITI'S",
  "description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...]

I want to extract with all URLs, this is just an example URL, how can I remove redundant data?

@unclecode Can you help me @@ ?

0 replies

syed-al · 2024-10-22T06:17:28Z

syed-al
Oct 22, 2024
Author

Hi @unclecode

When would the new version be released on pip?

Also, waiting for the discord invite: abdksyed@gmail.com

0 replies

unclecode · 2024-10-24T11:02:09Z

unclecode
Oct 24, 2024
Maintainer

@syed-al tomorrow, Friday. Sorry for the delay, I will send the invitation before weekend, and you are most welcome.

0 replies

unclecode · 2024-10-24T11:04:48Z

unclecode
Oct 24, 2024
Maintainer

@QuangTQV No worries, Let me first understand what exactly you're looking for. You said you want to do this with all URLs. Maynard, what do you refer to by 'all URLs'? When you're using the JsonCssExtractionStrategy, it's a mechanism that works exactly with one specific layout on the page. So I have to understand what you mean by 'all URLs'. Are you referring to any page or any form so that you can use the JsonCssExtractionStrategy, unless all the pages you're using have a similar HTML structure? So I guess I didn't get your question properly. You explained the task you have in a complete way rather than the example, and then I will try to illustrate a code for you to show you how it works.

0 replies

QuangTQV · 2024-10-24T15:54:43Z

QuangTQV
Oct 24, 2024

@QuangTQV No worries, Let me first understand what exactly you're looking for. You said you want to do this with all URLs. Maynard, what do you refer to by 'all URLs'? When you're using the JsonCssExtractionStrategy, it's a mechanism that works exactly with one specific layout on the page. So I have to understand what you mean by 'all URLs'. Are you referring to any page or any form so that you can use the JsonCssExtractionStrategy, unless all the pages you're using have a similar HTML structure? So I guess I didn't get your question properly. You explained the task you have in a complete way rather than the example, and then I will try to illustrate a code for you to show you how it works.

I want to create a chatbot for any website, so I need to crawl the website's content and then use RAG. What I need is to save costs, and the more junk content I can eliminate while crawling, the better. And of course, each website has a different layout, so I can't use a fixed regex.

0 replies

unclecode · 2024-10-31T05:59:03Z

unclecode
Oct 31, 2024
Maintainer

@syed-al Hi, sorry for the delay, it has been very hectic last two weeks, but now all good, I just send the invitation link to you, looking forward to see you on the other side, work on this smart FIT markdown ;)

0 replies

Easonliuliang · 2026-03-04T11:05:24Z

Easonliuliang
Mar 4, 2026

I had the same problem — news articles coming back with 600-700 lines when the actual content is 30 lines. Headers, footers, related articles, ad slots all mixed in.

I ended up building a dedicated content extraction tool that focuses on identifying and keeping only the main content. On news sites specifically, it strips the output down to just the article text + headings.

It is open source (Go, Apache 2.0): https://github.com/Easonliuliang/purify

Might save you the step of using an LLM to clean up the extracted data.

0 replies

Uh oh!

Remove Headers, Footers, External Links and their related data #508

Uh oh!

Replies: 19 comments

Uh oh!

Uh oh!

syed-al Oct 19, 2024 Author

Uh oh!

Uh oh!

unclecode Oct 20, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

unclecode Oct 21, 2024 Maintainer

Uh oh!

Uh oh!

syed-al Oct 21, 2024 Author

Uh oh!

Uh oh!

unclecode Oct 21, 2024 Maintainer

Uh oh!

Uh oh!

unclecode Oct 21, 2024 Maintainer

Uh oh!

Uh oh!

syed-al Oct 22, 2024 Author

Uh oh!

unclecode Oct 24, 2024 Maintainer

Uh oh!

unclecode Oct 24, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

unclecode Oct 31, 2024 Maintainer

Uh oh!

syed-al
Oct 19, 2024
Author

unclecode
Oct 20, 2024
Maintainer

unclecode
Oct 21, 2024
Maintainer

syed-al
Oct 21, 2024
Author

unclecode
Oct 21, 2024
Maintainer

unclecode
Oct 21, 2024
Maintainer

syed-al
Oct 22, 2024
Author

unclecode
Oct 24, 2024
Maintainer

unclecode
Oct 24, 2024
Maintainer

unclecode
Oct 31, 2024
Maintainer