-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy path02-Week2.Rmd
More file actions
553 lines (343 loc) · 19.8 KB
/
02-Week2.Rmd
File metadata and controls
553 lines (343 loc) · 19.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
# Week 2: Corpus Construction and Modeling Digital Humanities Data
***2/4/2021 Thursday***
Prepare for Class:
* Flanders, Julia and Fotis Jannidis. "Data Modeling in a Digital Humanities Context." In *The Shape of Data in the Digital Humanities: Modeling Texts and Text-based Resources*, edited by Julia Flanders and Fotis Jannidis, 3-25. New York: Routledge, 2019. [Course Google Drive]
* Underwood, Ted. "Algorithmic Modeling: Or, Modeling Data We Do Not Yet Understand." In *The Shape of Data in the Digital Humanities: Modeling Texts and Text-based Resources*, edited by Julia Flanders and Fotis Jannidis, 250-263. New York: Routledge, 2019. [Course Google Drive]
* Martin Weisser. “Understanding Corpus Design.” In *Practical Corpus Linguistics: An Introduction to Corpus‐Based Language Analysis*, edited by Martin Weisser, 29-42. London: Wiley (2015). [Course Google Drive]
* Watch eScriptorium training video #1, “Introduction to eScriptorium.”:
[](https://www.youtube.com/watch?v=N0hSNC3YvD4){target="_blank"}
[https://www.youtube.com/watch?v=N0hSNC3YvD4](https://www.youtube.com/watch?v=N0hSNC3YvD4){target="_blank"}
In class:
* Weekly reading report.
* Collaboratory assignment on digital text production (bring your computers!).
## most recent OCR transcription model for eScriptorium
Download it [here](./2021IslamicateWorldCourse_files/arabPers-WithDiffTypefaces.mlmodel)
## where do I find eScriptorium?
The UMD instance of eScriptorium can be found here: [http://simorgh.umd.edu:8080/](http://simorgh.umd.edu:8080/){target="_blank"}
## Getting started with eScriptorium 0.6.9
Tutorial written by [Alix Chague](https://escripta.hypotheses.org/440){target="_blank"}, and translated by Jonathan Allen.
### Additional Documentation:
- Video demo of part of the eScriptorium features: [https://escripta.hypotheses.org/escriptorium-video-gallery](https://translate.google.com/website?sl=auto&tl=en&u=https://escripta.hypotheses.org/escriptorium-video-gallery){target="_blank"}
- Source code of eScriptorium: [https://gitlab.inria.fr/scripta/escriptorium](https://translate.google.com/website?sl=auto&tl=en&u=https://gitlab.inria.fr/scripta/escriptorium){target="_blank"}
### Prerequisites:
- Secure an account on an eScriptorium instance;
- Or install a local instance of eScriptorium ( with Docker or without Docker ).
### Summary
- 1. Step by step
- 1.1. Connection to eScriptorium
- 1.2. Create a new document
- 1.2.1. Point terminology
- 1.2.2. Instructions
- 1.3. Upload images
- 1.3.1. Access the interface
- 1.3.2. Import local files
- 1.3.3. Import images from PDF file
- 1.3.4. Use an IIIF manifesto
- 1.4. Manually annotate documents
- 1.4.1. Access the interface
- 1.4.2. Annotate segments and areas on the image
- 1.4.3. Annotate the transcript
- 1.4.4. Aparté on the articulation of *baselines* , polygons
and transcriptions
- 1.4.5. Change the order of segments
- 1.4.6. Semantic annotation
- 1.5. Import annotations
- 1.5.1. Import structured annotations in XML
- 1.5.2. Import annotations from plain text
- 1.6. Automatically annotate documents
- 1.6.1. Instructions
- 1.6.2. Setting up segmentation
- 1.7. Train models
- 1.8. Export data
- 2. Manage a collaborative annotation campaign
- 2.1. Create a user group
- 2.2. Share a document with another user or with a group
- 2.3. Share a template with another user or with a group
### 1. Step by step
#### 1.1. Logging into eScriptorium
You log into eScriptorium with an individual account made up of a login,
a password and an email address. An account is created by the user after
receiving an invitation or by the site administrator:
{width="5.208333333333333in"
height="1.2777777777777777in"}
*Login form*
Once logged on, you will see the dashboard which allows you to manage
all the documents created by and shared with the account. On first
login, this dashboard is empty:
{width="5.208333333333333in"
height="1.8194444444444444in"}
#### 1.2. Create a new document
##### 1.2.1. Point terminology
- A \"**document**\" is a collection of images forming a whole unit;
- A \" **document-part**\" is an image\--a page belonging to a
document.
##### 1.2.2. Instructions
To create a document, click on the "**Create new**" button; a new page
is displayed as well as an input form:
{width="5.208333333333333in"
height="2.486111111111111in"}
The document \"description\" tab is where you fill in the metadata.
The \"Name\" field is mandatory. The others are optional and can be
completed later. We will come back to the "Ontology" section later.
Once the information has been entered, click on "Create" to create the
document; a message is then displayed: "Document created successfully!\"
You can still modify the data entered: the \"Create\" button has been
replaced by an \"Update\" button.
{width="5.208333333333333in"
height="0.8194444444444444in"}
To return to this form from the dashboard, click on the "Edit" icon in
front of the document.
#### 1.3. Upload images
##### 1.3.1. Access the interface
The "Images" interface is used to manage most of the tasks related to
automatic image processing and imports and exports.
To access the image loading interface, click on the "Images" tab.
{width="6.5in" height="3.1944444444444446in"}
There are several ways to import images to the platform.
##### 1.3.2. Import local files
Simply load the files using drag-and-drop or click in the box to open
the file explorer. **Note:** you must wait until **all** the images are
loaded before refreshing the page!
{width="6.5in" height="4.083333333333333in"}
##### 1.3.3. Import images from PDF file
Click on the Import button, then Images (PDF) option to load a PDF file
and automatically extract the images. Please note, only the images are
imported. If the PDF contains a text layer corresponding to the
transcription, it is not imported.
{width="6.049451006124235in"
height="3.632784339457568in"}
##### 1.3.4. Use an IIIF manifesto
Click on the Import button, then the Images (IIIF) option, then enter
the url of an IIIF manifesto. All the images associated with the
manifesto are then copied locally, as well as the metadata, which is
visible in the "Description" tab.
Example:
[https://gallica.bnf.fr/iiif/ark:/12148/btv1b53026595r/manifest.json](https://gallica.bnf.fr/iiif/ark:/12148/btv1b53026595r/manifest.json){target="_blank"}
{width="6.5in" height="4.083333333333333in"}
*Demonstration of importing images through an IIIF manifesto.*
#### 1.4. Manually annotate documents
##### 1.4.1. Access the interface
Manual annotation is necessary to generate *ground truth data* used to
train models, or to correct the result of these operations. It can also
be used as part of an annotation campaign that does not make use of
Kraken models (eScriptorium only serves as an input environment).
The *document-parts* edit tab allows you to create or modify annotation
elements manually. It is accessed by clicking on the "Edit" tab at the
top of the page or by clicking on the "Edit" button on each "map"
representing a part-document. There are several viewing panes in this
tab corresponding to each possible editing task.
{width="5.208333333333333in"
height="1.7916666666666667in"}
{width="1.0416666666666667in"
height="1.4027777777777777in"}
Another way to access the \"Edit\" tab
- \"Image source\" allows you to display the source image;
- "Segmentation" displays the segmentation editing pane;
- \"Transcription\" displays the editing pane of the transcription in
diplomatic view;
- \"Text\" is used to display the editing pane of the transcription in
text mode.
{width="6.5in"
height="2.9027777777777777in"}
##### 1.4.2. Annotate segments and areas on the image
The segment editing pane allows you to perform several essential
operations:
- Drawing *baselines* corresponding to the locations of the text on
the image in two ways:
- Free route (not recommended);
- Point by point plot.
- Adding points on a line (select it then double-click at the point\'s
location);
- Moving one or more point (s) by a simple select-drag;
- Deleting one or more point (s) on a line;
- Cutting one or more lines using the scissors tool;
- Joining one or more lines;
- Configuring the reading direction of a line;
- Activating the calculation of the polygons associated with each line
(this task is automatic and managed asynchronously without any
action on the part of the user after the first use).
In this pane, it is also possible to create regions (or *zones* ) and
associate segments to them. A segment located within the perimeter of a
region is therefore not automatically associated with it.
All the operations that can be carried out in this pane are detailed in
a help window (to display it, click on \"?\"). It will thus be noted
that there are many shortcuts making it possible to simplify all these
operations.
##### 1.4.3. Annotate the transcript
Only when there are segments and masks on the image is it possible to
use the functionalities of the "Transcription" and "Text" tabs.
To add or modify a transcription associated with a segment from the
"Transcription" pane, click on the corresponding zone. An input window
is displayed. To record a transcription, press "Enter": the
transcription interface automatically displays the input field for the
next segment.
As you type in the "Transcription" pane, the crenellated areas are
replaced by text and the content of the "Text" pane changes. It is
therefore also possible to modify the text from the "Text" pane and even
to copy and paste several lines at once.
{width="6.5in" height="2.5833333333333335in"}
##### 1.4.4. A note on the articulation of baselines, polygons and transcriptions**
The *baseline* is the central element for storing information in the
eScriptorium database. So :
- It is possible to modify a *baseline* (remove / move / add points)
without impacting the transcription;
- The polygon is always calculated from the *baseline*, including
during training
- It is possible to modify the polygon manually (not recommended)
without this impacting the transcription.
{width="5.208333333333333in"
height="2.3194444444444446in"}
During the training of the Kraken models, the calculation of the
polygons is reset: the user therefore has an interest in not intervening
on the polygons and on the contrary to ensure that the *baselines* are
drawn in such a way that the automatically generated polygons are
correct. . Moreover, when manually entering ground truth data, care
should be taken to transcribe only what is inside the polygon.
##### 1.4.5. Changing the order of segments
The playback order of the segments is calculated automatically. You can
display the order number of each segment from the "Segmentation" pane by
clicking on "Toggle ordering display (L)" or from the "Text" pane where
the lines are displayed in the following order.
It is possible to modify this order from the "Text" pane by clicking
"Toggle sorting mode." A simple drag and drop of the lines is then
enough to carry out the modification.
**Note:** it is advisable to ensure the quality of the segmentation
before changing the order of the lines because adding or deleting lines
systematically restarts the calculation of this order, overwriting
manual modifications in the process.
{width="6.5in" height="2.5833333333333335in"}
##### 1.4.6. Semantic annotation
It is possible to associate labels (or *tags* ) with segments and
regions by following an ontology pre-defined by the user from the
"Description" tab. There are default *tags* , but it is possible to add
some using the input fields (click on \"+\" then \"Update\" to add the
new *tag* to the list) or to delete (uncheck the box in front of the
*tag* then click on \"Update\").
From the "Segmentation" pane, select an area or a segment, then click on
"Set the type on all selected lines / regions (T)" and choose the
appropriate *tag* . The color of the area or line changes.
It is possible to apply a *tag* to several regions or segments at once:
all you have to do is select several (Ctrl + click-drag or keep Ctrl
pressed and click the targeted segments / regions).
{width="6.5in" height="3.5in"}
#### 1.5. Import annotations
##### 1.5.1. Import structured annotations in XML
It is possible to import segments and / or a transcript generated
outside of eScriptorium. To do this, in the "Images" tab, click on the
"Import / Transcription (XML)" option.
A form appears allowing you to specify a name to identify the version to
import and to load the file to import. This can be an ALTO XML file, a
PAGE XML file or a ZIP containing several ALTO or PAGE XML files.
It is not necessary to select beforehand which document-parts are
concerned by the import: the link is made automatically according to the
information contained in the XML files.
{width="5.208333333333333in"
height="3.5694444444444446in"}
**Please note**, after importing segments generated without Kraken /
eScriptorium, it is essential to **reset the polygons ( *masks* )**
before using these documents to train a transcription model.
##### 1.5.2. Import annotations from plain text
It is possible to manually import a transcription from a plain text
version using the "Text" pane of the "Edit" tab. Each line break then
indicates the passage to a new segment. It is therefore necessary to
ensure that the reading order of the lines matches.
#### 1.6. Automatically annotate documents
##### 1.6.1. Instructions
Automatic document annotation is managed from the "Images" tab.
- Select the images to annotate;
- Click on "Segment" (for the detection of baselines, polygons and /
or areas) or on "Transcribe" (for transcription);
- A form is displayed: it allows you to load a Kraken model or to use
a model already available in eScriptorium and, for segmentation, to
configure the annotation.
##### 1.6.2. Setting up segmentation
- \"Segmentation steps\" allows you to manage what level of
segmentation is carried out:
- "Line and Regions" leads to the generation of zones (provided
that the model used has been trained for this), baselines and
associated polygons.
- "Line Baselines and Masks" leads to the generation of baselines
and associated polygons.
- \"Only line Masks\" does not require loading a model; this
option allows you to reset on the fly the calculation of the
polygons associated with the baselines already present on the
images.
- "Regions" makes it possible to keep intact the baselines and
polygons already present on the images and to generate only the
zones.
- \"Text direction\" is used to indicate the reading direction of the
lines.
{width="3.1944444444444446in" height="3.375in"}
#### 1.7. Train models
##### 1.7.1. Start a workout
Training of Kraken models is started from the "Images" tab; it is
tracked from the "Models" tab (inactive as long as no model other than
the default model is associated with the document).
Access to training functions may be subject to authorization. It is
managed from the administration interface.
To train a model, all the mobilized images / annotations must be in the
same document.
- Select the share documents containing the ground truth;
- Click on \"Train\" then select the type of model to train:
\"Segment\" for a segmentation model, \"Recognizer\" for a
transcription model;
- Fill out the form then click on \"Train\";
- The training task is started, sometimes you have to wait before this
is displayed.
When the training is finished, the user receives a notification and can
access the final model from the "Models" tab. It is then possible to
download the model or use it on the document.
##### 1.7.2. Refine a model or start from scratch?
The training configuration form allows you to choose:
- If you want to start from scratch (simply indicate the name to give
to the new model that will be created);
- Or if you want to refine a model (it can be loaded from the file
manager, or be already present in the list of models associated with
the current document).
**Caution:** in version 0.6.9, modifying the name of the model to train
is only taken into account in the case of a model starting from zero. In
order not to overwrite an already existing model, make sure to download
the model locally, rename it according to the name desired for the new
model, then load it into eScriptorium from the form.
#### 1.8. Export annotations
The export of annotations is carried out from the "Images" tab.
- Select the relevant part-documents;
- Click on \"Export\" then fill out the form:
- Specify the version of the transcription to export;
- Specify the export format (\"Alto\" for XML ALTO, \"Pagexml\"
for XML PAGE, or \"Text\" for plain text);
- Check \"Include Images\" if you also want to export the images;
- Click on Export and save the generated ZIP file.
### 2. Manage a collaborative annotation campaign
#### 2.1. Create a user group (admin)
It is possible to create user groups from the administrator\'s dashboard
(provided you have the associated rights). These groups are used to
define work teams or to assign certain rights to a set of users.
It will soon be possible for a user with sufficient rights to create a
team and invite other users to it.
#### 2.2. Share a document with another user or with a group
A user can share a document with multiple users, including outside of
the group (s) to which it belongs. From the \"Description\" tab:
- Click on the \"Share\" button
- Enter the name of the user with whom to share the document, or check
his name in the list that appears
- To confirm, click on \"Share\"
{width="6.5in" height="3.1944444444444446in"}
#### 2.3. Share a template with another user or with a group
A template is associated with a document and not with a user. To share a
model with another user, there are two possibilities:
- Download the model from the "Models" tab and send the model via
another exchange system (e-mails, etc.);
- Share with the user the document to which the template is
associated. It is then up to the other user to download the model
and then load it into the document where he wishes to apply it.