Remove Headers, Footers, External Links and their related data #508
-
|
Hi, Thanks for this great work. I have been playing around with this, to crawl webpages and get content in markdown format, which can be used to provide to LLMs for grounding. But when I used them to say get data for news articles, I get lots and lots of unnecessary data like headers, footers, nav options, other article links. More than 50% of the data is not the actual data. Yes, I can use LLMExtraction, but that will increase the bill tremendously as the input are like 10-20 web articles, with each web article ranging around 5000-7000 tokens. I saw one option is to provide elements, but I want the crawler to be generic, which can work on any website, so I don't have any fix elements which I know can remove header, nav, footer information for sure. Any way where the playwright extractor, can get the actual content of the webpage. I understand this may not be perfect, but I want to reduce the extra content as much as possible while being generic across all websites. P.S: I would love to contribute to this project, I am not much experienced in the JS/TS side, but I am pretty confident on the python side of things. So is there any discord/slack/telegram group where I can join to discuss on how to contribute? |
Beta Was this translation helpful? Give feedback.
Replies: 19 comments
-
|
I have the same need too. def clean_content(content):
# Find the start of the content (first # title)
start_index = content.find('#')
if start_index == -1:
return "" # No title found
# Find the end of the content (next ## title)
end_index = content.find('##', start_index + 1)
if end_index == -1:
# If no ## title found, return until the end
return content[start_index:]
else:
return content[start_index:end_index].strip()
article['markdown_content'] = asyncio.run(crawl_url(article['source_url']))
article['markdown_content'] = clean_content(article['markdown_content']) |
Beta Was this translation helpful? Give feedback.
-
|
I too started with this, but the problem is many times, the title is not h2(##), sometime is h1(#) or h3(###) etc. And also the ads, signup banners and other links too will have ##, causing all other content also to be read. It's difficult I think to get the content from generic pages deterministically without LLMs. |
Beta Was this translation helpful? Give feedback.
-
|
agreed. |
Beta Was this translation helpful? Give feedback.
-
|
@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together. |
Beta Was this translation helpful? Give feedback.
-
I also need to remove unnecessary content. You should set additional parameters such as HTML selectors, excluding specific HTML tags that start with 'header', 'footer', and similar. |
Beta Was this translation helpful? Give feedback.
-
Clean the HTML first and then convert the HTML to markdown. |
Beta Was this translation helpful? Give feedback.
-
|
@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72): async def main():
async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
# bypass_cache=True,
word_count_threshold = 10,
excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
exclude_external_links = False, # Default is True
exclude_social_media_links = True, # Default is True
exclude_external_images = True, # Default is False
# social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py
html2text = {
"escape_dot": False,
# Add more options here
}
)
# Save markdown to file
with open(os.path.join(__data, "mexico_places.md"), "w") as f:
f.write(result.markdown)
print("Done")Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the Another thing you can see here is the concept of the word count threshold. By setting this amount to a high number, like 10, you are basically excluding any HTML blocks that contain text with less than 10 words. This is a very useful way of removing unnecessary text. You can also use this one, depending on your needs. Finally, our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way. @QuangTQV I already shared same answer to your other issues, perhaps I close that issue and we continue here thx. |
Beta Was this translation helpful? Give feedback.
-
|
Here is the sample url to get the content: It is a new article, where are lot of links, extra information, other article information etc, which I would want to neglect and only get the main content of the article. The main article is around 30 lines, but the entire markdown I get is more than 600-700 lines. # Create an instance of AsyncWebCrawler
async with AsyncWebCrawler(verbose=False) as crawler:
# Run the crawler on a URL
results = await crawler.arun_many(
urls=urls,
)I am for now using the basic usage without any extra flags. Will try the flags you mentioned Also email id for discord invitation: abdksyed@gmail.com |
Beta Was this translation helpful? Give feedback.
-
Could you provide an example with this URL? 'https://tiki.vn/search?q=gi%C3%A0y%20adidas |
Beta Was this translation helpful? Give feedback.
-
|
@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown. async def main():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "Tiki Shoes",
"baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
"fields": [
{
"name": "image",
"selector": "picture > img",
"type": "attribute",
"attribute": "src",
},
{
"name": "price",
"selector": '.price-discount__price',
"type": "text",
},
{
"name": "brand",
"selector": ".above-product-name-info",
"type": "text",
},
{
"name": "description",
"selector": "h3",
"type": "text",
}
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
result = await crawler.arun(
url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
extraction_strategy=extraction_strategy,
bypass_cache=True,
# delay_before_return_html = 1,
wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
magic=True
)
items = json.loads(result.extracted_content)
print(f"Successfully extracted {len(items)} items")
print(json.dumps(items[0], indent=2))
with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
f.write(result.extracted_content)Output: [{
"price": "971.000\u20ab",
"brand": "BITI'S",
"description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...] |
Beta Was this translation helpful? Give feedback.
-
I want to extract with all URLs, this is just an example URL, how can I remove redundant data? |
Beta Was this translation helpful? Give feedback.
-
|
@syed-al I just added a new heuristic function that has the capability to produce a better markdown, which I call 'fit markdown.' The way they're going to use it is as follows: [code]. I also share with you the output of the markdown: it's much cleaner and I really like it. It's good for pages like the example you shared, and it will be soon released in the new version. async def main():
async with AsyncWebCrawler(verbos=True) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
bypass_cache=True,
word_count_threshold = 10,
)
# Save markdown to file
with open(os.path.join(__data__, "mexico_places.md"), "w") as f:
f.write(result.fit_markdown)
print("Done")Everything remains the same. You simply pick up the |
Beta Was this translation helpful? Give feedback.
-
@unclecode Can you help me @@ ? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @unclecode When would the new version be released on pip? Also, waiting for the discord invite: abdksyed@gmail.com |
Beta Was this translation helpful? Give feedback.
-
|
@syed-al tomorrow, Friday. Sorry for the delay, I will send the invitation before weekend, and you are most welcome. |
Beta Was this translation helpful? Give feedback.
-
|
@QuangTQV No worries, Let me first understand what exactly you're looking for. You said you want to do this with all URLs. Maynard, what do you refer to by 'all URLs'? When you're using the |
Beta Was this translation helpful? Give feedback.
-
I want to create a chatbot for any website, so I need to crawl the website's content and then use RAG. What I need is to save costs, and the more junk content I can eliminate while crawling, the better. And of course, each website has a different layout, so I can't use a fixed regex. |
Beta Was this translation helpful? Give feedback.
-
|
@syed-al Hi, sorry for the delay, it has been very hectic last two weeks, but now all good, I just send the invitation link to you, looking forward to see you on the other side, work on this smart FIT markdown ;) |
Beta Was this translation helpful? Give feedback.
-
|
I had the same problem — news articles coming back with 600-700 lines when the actual content is 30 lines. Headers, footers, related articles, ad slots all mixed in. I ended up building a dedicated content extraction tool that focuses on identifying and keeping only the main content. On news sites specifically, it strips the output down to just the article text + headings. It is open source (Go, Apache 2.0): https://github.com/Easonliuliang/purify Might save you the step of using an LLM to clean up the extracted data. |
Beta Was this translation helpful? Give feedback.
@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72):