Read, edit, and write a text file line-wise using Ruby
Solution 1:
In general, there's no way to make arbitrary edits in the middle of a file. It's not a deficiency of Ruby. It's a limitation of the file system: Most file systems make it easy and efficient to grow or shrink the file at the end, but not at the beginning or in the middle. So you won't be able to rewrite a line in place unless its size stays the same.
There are two general models for modifying a bunch of lines. If the file is not too large, just read it all into memory, modify it, and write it back out. For example, adding "Kilroy was here" to the beginning of every line of a file:
path = '/tmp/foo'
lines = IO.readlines(path).map do |line|
'Kilroy was here ' + line
end
File.open(path, 'w') do |file|
file.puts lines
end
Although simple, this technique has a danger: If the program is interrupted while writing the file, you'll lose part or all of it. It also needs to use memory to hold the entire file. If either of these is a concern, then you may prefer the next technique.
You can, as you note, write to a temporary file. When done, rename the temporary file so that it replaces the input file:
require 'tempfile'
require 'fileutils'
path = '/tmp/foo'
temp_file = Tempfile.new('foo')
begin
File.open(path, 'r') do |file|
file.each_line do |line|
temp_file.puts 'Kilroy was here ' + line
end
end
temp_file.close
FileUtils.mv(temp_file.path, path)
ensure
temp_file.close
temp_file.unlink
end
Since the rename (FileUtils.mv
) is atomic, the rewritten input file will pop into existence all at once. If the program is interrupted, either the file will have been rewritten, or it will not. There's no possibility of it being partially rewritten.
The ensure
clause is not strictly necessary: The file will be deleted when the Tempfile instance is garbage collected. However, that could take a while. The ensure
block makes sure that the tempfile gets cleaned up right away, without having to wait for it to be garbage collected.
Solution 2:
If you want to overwrite a file line by line, you'll have to ensure the new line has the same length as the original line. If the new line is longer, part of it will be written over the next line. If the new line is shorter, the remainder of the old line just stays where it is. The tempfile solution is really much safer. But if you're willing to take a risk:
File.open('test.txt', 'r+') do |f|
old_pos = 0
f.each do |line|
f.pos = old_pos # this is the 'rewind'
f.print line.gsub('2010', '2011')
old_pos = f.pos
end
end
If the line size does change, this is a possibility:
File.open('test.txt', 'r+') do |f|
out = ""
f.each do |line|
out << line.gsub(/myregex/, 'blah')
end
f.pos = 0
f.print out
f.truncate(f.pos)
end
Solution 3:
Just in case you are using Rails or Facets, or you otherwise depend on Rails' ActiveSupport, you can use the atomic_write extension to File
:
File.atomic_write('path/file') do |file|
file.write('your content')
end
Behind the scenes, this will create a temporary file which it will later move to the desired path, taking care of closing the file for you.
It further clones the file permissions of the existing file or, if there isn't one, of the current directory.
Solution 4:
You can write in the middle of a file but you have to be carefull to keep the length of the string you overwrite the same otherwise you overwrite some of the following text. I give an example here using File.seek, IO::SEEK_CUR gives he current position of the file pointer, at the end of the line that is just read, the +1 is for the CR character at the end of the line.
look_for = "bbb"
replace_with = "xxxxx"
File.open(DATA, 'r+') do |file|
file.each_line do |line|
if (line[look_for])
file.seek(-(line.length + 1), IO::SEEK_CUR)
file.write line.gsub(look_for, replace_with)
end
end
end
__END__
aaabbb
bbbcccddd
dddeee
eee
After executed, at the end of the script you now have the following, not what you had in mind I assume.
aaaxxxxx
bcccddd
dddeee
eee
Taking that in consideration, the speed using this technique is much better than the classic 'read and write to a new file' method. See these benchmarks on a file with music data of 1.7 GB big. For the classic approach I used the technique of Wayne. The benchmark is done withe the .bmbm method so that caching of the file doesn't play a very big deal. Tests are done with MRI Ruby 2.3.0 on Windows 7. The strings were effectively replaced, I checked both methods.
require 'benchmark'
require 'tempfile'
require 'fileutils'
look_for = "Melissa Etheridge"
replace_with = "Malissa Etheridge"
very_big_file = 'D:\Documents\muziekinfo\all.txt'.gsub('\\','/')
def replace_with file_path, look_for, replace_with
File.open(file_path, 'r+') do |file|
file.each_line do |line|
if (line[look_for])
file.seek(-(line.length + 1), IO::SEEK_CUR)
file.write line.gsub(look_for, replace_with)
end
end
end
end
def replace_with_classic path, look_for, replace_with
temp_file = Tempfile.new('foo')
File.foreach(path) do |line|
if (line[look_for])
temp_file.write line.gsub(look_for, replace_with)
else
temp_file.write line
end
end
temp_file.close
FileUtils.mv(temp_file.path, path)
ensure
temp_file.close
temp_file.unlink
end
Benchmark.bmbm do |x|
x.report("adapt ") { 1.times {replace_with very_big_file, look_for, replace_with}}
x.report("restore ") { 1.times {replace_with very_big_file, replace_with, look_for}}
x.report("classic adapt ") { 1.times {replace_with_classic very_big_file, look_for, replace_with}}
x.report("classic restore") { 1.times {replace_with_classic very_big_file, replace_with, look_for}}
end
Which gave
Rehearsal ---------------------------------------------------
adapt 6.989000 0.811000 7.800000 ( 7.800598)
restore 7.192000 0.562000 7.754000 ( 7.774481)
classic adapt 14.320000 9.438000 23.758000 ( 32.507433)
classic restore 14.259000 9.469000 23.728000 ( 34.128093)
----------------------------------------- total: 63.040000sec
user system total real
adapt 7.114000 0.718000 7.832000 ( 8.639864)
restore 6.942000 0.858000 7.800000 ( 8.117839)
classic adapt 14.430000 9.485000 23.915000 ( 32.195298)
classic restore 14.695000 9.360000 24.055000 ( 33.709054)
So the in_file replacement was 4 times faster.