Remove Headers, Footers, External Links and their related data

Hi, Thanks for this great work.

I have been playing around with this, to crawl webpages and get content in markdown format, which can be used to provide to LLMs for grounding. But when I used them to say get data for news articles, I get lots and lots of unnecessary data like headers, footers, nav options, other article links. More than 50% of the data is not the actual data.

Yes, I can use LLMExtraction, but that will increase the bill tremendously as the input are like 10-20 web articles, with each web article ranging around 5000-7000 tokens. I saw one option is to provide elements, but I want the crawler to be generic, which can work on any website, so I don't have any fix elements which I know can remove header, nav, footer information for sure. Any way where the playwright extractor, can get the actual content of the webpage. I understand this may not be perfect, but I want to reduce the extra content as much as possible while being generic across all websites.

P.S: I would love to contribute to this project, I am not much experienced in the JS/TS side, but I am pretty confident on the python side of things. So is there any discord/slack/telegram group where I can join to discuss on how to contribute?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove Headers, Footers, External Links and their related data #181

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Remove Headers, Footers, External Links and their related data #181

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions