Evaluation notebooks for NLP and Ecology domains [#23] by MikeACedric · Pull Request #24 · sciknoworg/deep-research

MikeACedric · 2026-03-27T15:21:27Z

This PR references the issue [#23] and includes the following changes:

Repository Cleanup:

Removed unused files and folders to streamline the project structure.

New Notebooks:
Added 4 new notebooks (2 per domain) covering single report and batch evaluation:

NLP — single report & batch evaluation notebooks
Ecology — single report & batch evaluation notebooks

Documentation:

Updated docs to include step-by-step instructions for running the new notebooks.

…ed documentation [#23]

jd-coderepos · 2026-04-01T10:31:32Z

eval/README.md

+
+<h2 align="center">LLM-as-a-Judge (LLMJ) Framework</h2>
+
+The **LLM-as-a-Judge** is a robust, Python-based evaluation toolkit designed for objective and configurable analysis of Large Language Model (LLM)-generated research reports. It enables researchers to quantify report quality across key dimensions like `breadth`, `depth`, `rigor`, `gap` and `innovation`, with pointwise rubrics implemented via **[YESciEval](https://yescieval.readthedocs.io/)** — a framework for scientific evaluation of LLM outputs.


@MikeACedric Please note: LLM-as-a-judge is an general term as a name for an evaluation paradigm predominant in the AI community these days. It is not a toolkit! The framework here is YESciEval. I would ask to give some thought into this description text and avoid copying descriptions that do not make sense from earlier versions of this repository.

Here the description should say what is being evaluated and how is it being evaluated. For the latter part, it can read as follows "we adopt the LLM-as-a-judge paradigm w.r.t. to the x implemented evaluation categories in the YESciEval library which includes y evaluation rubrics."

MikeACedric added 2 commits March 27, 2026 02:48

♻️ restructured evaluation scripts, added jupyter notebooks and updat…

b34cb0d

…ed documentation [#23]

📝updated the README file

60dae80

MikeACedric requested a review from jd-coderepos March 27, 2026 15:21

MikeACedric self-assigned this Mar 27, 2026

jd-coderepos requested changes Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation notebooks for NLP and Ecology domains [#23]#24

Evaluation notebooks for NLP and Ecology domains [#23]#24
MikeACedric wants to merge 2 commits intomainfrom
feature/restructure-llm-as-a-judge

MikeACedric commented Mar 27, 2026 •

edited

Loading

Uh oh!

jd-coderepos Apr 1, 2026

Uh oh!

jd-coderepos Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		<h2 align="center">LLM-as-a-Judge (LLMJ) Framework</h2>

		The LLM-as-a-Judge is a robust, Python-based evaluation toolkit designed for objective and configurable analysis of Large Language Model (LLM)-generated research reports. It enables researchers to quantify report quality across key dimensions like `breadth`, `depth`, `rigor`, `gap` and `innovation`, with pointwise rubrics implemented via [YESciEval](https://yescieval.readthedocs.io/) — a framework for scientific evaluation of LLM outputs.

Conversation

MikeACedric commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jd-coderepos Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jd-coderepos Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MikeACedric commented Mar 27, 2026 •

edited

Loading