Extract all emails from a text document
I have a document that contains text and HTML tags, and there is are a lot of tags like <label>[email protected]</label>
How do I extract all the emails from this document using Linux commands.
I tried using grep -e "[a-zA-Z0-9._]\+@[a-zA-Z]\+.[a-zA-Z]\+"
but it didn't work
The following is an example of such a document:
<tbody><tr class="d_gh d2l-table-row-first" header=""><th class="d_gs d2l-table-cell-first" rowspan="1" colspan="1"><input
class="d2l-checkbox float_l" type="checkbox" title="Select all rows" onclick="UI.GC('z_k').g_sa(this.checked)"
name="z_k_cb_sa"></th><th scope="col" class="d_hch d_gw d_gl"><d2l-table-col-sort-button data-d2l-table-sort-field="LastName"
data-d2l-table-next-sort-dir="asc" title="Sort by Last Name" nosort="">Last
Name</d2l-table-col-sort-button>, <d2l-table-col-sort-button data-d2l-table-sort-field="FirstName"
data-d2l-table-next-sort-dir="asc" title="Sort by First Name" nosort="">First Name</d2l-table-col-sort-button></th><th
scope="col" class="d_hch d_gl">Email Address</th><th scope="col" class="d_hch d_gl"><d2l-table-col-sort-button
data-d2l-table-sort-field="RoleName" data-d2l-table-next-sort-dir="asc" title="Sort by Role"
desc="">Role</d2l-table-col-sort-button></th><th scope="col" class="d_hch d_gl
d2l-table-cell-last"><label>Type</label></th></tr><tr><td class="d_gd_sel d2l-table-cell-first"
style="white-space:nowrap;"><input class="d2l-checkbox" type="checkbox" title="Select Nida" name="SystemContactsGrid_cb"
value="2" onclick="UI.GC('z_k').g_sr('2')"></td><th scope="row" class="d_ich">Ahmed, Nida</th><td
class="d_gn"><label>[email protected]</label></td><td><label>Student</label></td><td class="d_gn d2l-table-cell-last"><label>Internal
Email</label></td></tr><tr><td class="d_gd_sel d2l-table-cell-first" style="white-space:nowrap;"><input class="d2l-checkbox"
type="checkbox" title="Select Milen" name="SystemContactsGrid_cb" value="3" onclick="UI.GC('z_k').g_sr('3')"></td><th
scope="row" class="d_ich">Andic, Milena</th><td
class="d_gn"><label>[email protected]</label></td><td><label>Student</label></td><td class="d_gn
d2l-table-cell-last"><label>Internal Email</label></td></tr><tr><td class="d_gd_sel d2l-table-cell-first"
style="white-space:nowrap;"><input class="d2l-checkbox" type="checkbox" title="Select Anthony" name="SystemContactsGrid_cb"
value="4" onclick="UI.GC('z_k').g_sr('4')"></td><th scope="row" class="d_ich">Macdonald, Anthony</th><td
class="d_gn"><label>[email protected]</label></td><td><label>Student</label></td><td class="d_gn
d2l-table-cell-last"><label>Internal Email</label></td></tr><tr><td class="d_gd_sel d2l-table-cell-first"
style="white-space:nowrap;"><input class="d2l-checkbox" type="checkbox" title="Select" name="SystemContactsGrid_cb
The output of Linux shell script of command should be
[email protected]
[email protected]
[email protected]
which are the email addresses that o
Generally it's not a good idea to just plain process a parsed html file.
Try to use something like xmllint
xmllint --xpath "//label/text()" file
Please note, the input file should be valid html, the one provided in the example is not.
Example:
<body>
<label>[email protected]</label>
<label>[email protected]</label>
</body>
xmllint --xpath "//label/text()" file
Outputs:
[email protected]
[email protected]
Please also note, it will output any value between label tags. (it will also output "Student" as well if your example is formatted correctly) But this should get you going.