The regular expression [ˆa-zA-Z], which we used to avoid embedded instances of "the", implies that there must be some single (although non-alphabetic) character before the the. We can avoid this by specifying that before the the we require either the beginning-of-line or a non-alphabetic character, and the same at the end of the line:
grep -E "(^|[^a-zA-Z])[tT]he([^a-zA-Z]|^)" wizard_of_oz
The process we just went through was based on fixing two kinds of errors: false false positives positives, strings that we incorrectly matched like other or there, and false negafalse negatives tives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing systems. Reducing the overall error rate for an application thus involves two antagonistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
Some aliases for common ranges, which can be used mainly to save typing:
Comentarios
Publicar un comentario