Skip to content

Remove Headers, Footers, External Links and their related data #181

@syed-al

Description

@syed-al

Hi, Thanks for this great work.

I have been playing around with this, to crawl webpages and get content in markdown format, which can be used to provide to LLMs for grounding. But when I used them to say get data for news articles, I get lots and lots of unnecessary data like headers, footers, nav options, other article links. More than 50% of the data is not the actual data.

Yes, I can use LLMExtraction, but that will increase the bill tremendously as the input are like 10-20 web articles, with each web article ranging around 5000-7000 tokens. I saw one option is to provide elements, but I want the crawler to be generic, which can work on any website, so I don't have any fix elements which I know can remove header, nav, footer information for sure. Any way where the playwright extractor, can get the actual content of the webpage. I understand this may not be perfect, but I want to reduce the extra content as much as possible while being generic across all websites.

P.S: I would love to contribute to this project, I am not much experienced in the JS/TS side, but I am pretty confident on the python side of things. So is there any discord/slack/telegram group where I can join to discuss on how to contribute?

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions