searching for specialized patterns using grep in a json file
I wonder how can I only grep the "created_at": ones that are followed by }, and a new line like below:
"hashtags": [],
"urls": []
},
"created_at": "Wed Oct 19 22:19:42 +0000 2016",
"retweeted": false,
"coordinates": null,
"in_reply_to_user_id_str": null,
"source": "<a href=\"http://tweetlogix.com\" rel=\"nofollow\">Tweetlogix</a>",
"in_reply_to_status_id_str": null,
"in_reply_to_screen_name": null,
"in_reply_to_user_id": null,
"place": null,
"retweet_count": 0,
"id_str": "788867246953201664"
},
{
"favorited": false,
"contributors": null,
"truncated": false,
"text": "Reddit Exposes Hillary Clinton Staff Trying To Frame Assange As \u2018Pedo\u2019 https://t.co/KNj14p8QqN via @yournewswire",
"possibly_sensitive": false,
"is_quote_status": false,
"in_reply_to_status_id": null,
"user": {
"follow_request_sent": false,
"has_extended_profile": false,
"profile_use_background_image": true,
"time_zone": "Eastern Time (US & Canada)",
Initially, I was using grep -wirnE 'Wed Oct 19 2(1:[0-5][0-9]:[0-5][0-9]|2:([0-2][0-9]:[0-5][0-9]|30:00)) .* 2016' * > results_created_at
and then using wc -l results_created_at
to count the number of tweets that were created in that specific time range. However, turns out, we could have profile images or users which were also created in that time range. So, I would like to know how to only search for tweets using the initial grep command I had?
I have been looking at many of the tweets in my files and seems in all of which, }, \n (newlines) is followed by "created_at": and then a few lines after we have the text.
Adding -z
to your grep options will make grep treat newlines as null terminating characters (\0
) as opposed to separate lines however they do not seem to be matchable in the regex. The workaround for this is to simply match everything (.*
) up until the end of your desired pattern (in your case "created_at").
Next you can add -o
to have grep only output what is actually matched, otherwise it outputs the whole file (since it is now essentially one giant line). Alternatively if the only purpose of outputting to a file is to later wc -l
I would instead suggest you use grep's -c
option which will print the number of matches rather than the match itself.
This translates to the following command:
grep -wirnEzc '},.*created_at' *
Expanding on this to include your previous pattern as well we get:
grep -wirnEzc '},.*created_at":\s"Wed Oct 19 2(1:[0-5][0-9]:[0-5][0-9]|2:([0-2][0-9]:[0-5][0-9]|30:00)) .* 2016' *