Fixing a semi-redundant text file, best practice opinions wanted

651 Fixing a semi-redundant text f

Fixing a semi-redundant text file, best practice opinions wanted

@tobi.fp

Follow Recommendations Offline Message

almost 8 years ago
- 0
  Positive Vote
- 0
  Negative Vote
- 0
  Save Favourite
- 0
- 0
- 0
- 0
- 651
Answer it
First a little background: Link to attachments

I have a lot of text generated by some voice-to-text application (I honestly do not know the name of the application since I do not have physical access, however I have access to the live output). I am mining this data realtime and the output text looks like the first attachment, some parts are very clean, and some, well, very redundant.

I have now written a piece of software in Python that cleans up the text (Attachment two). The thing is, I can only do it on a lot of text at a time, eg. my backups which has hundreds of megabytes of pure text, when it comes in realtime, it gets hard to process only a few strings, since the semi-redundancy lasts 15-25 lines (as you can see in attachment 1).

The software works on the bigger files, and I am now trying to rewrite the code so it works with the live output.

But since I am a self taught programmer, I was wondering if anybody could share how their approach to doing the job would be.

My approach is (also see attachment two, however I am bad at commenting, so I do not know if you would get much out of it):
1. Open the file(plain text) and wait until 25 lines has been written to the file
2. read 25 lines into a list, lets call it MasterList
3. run clean-up functions (1-7) on MasterList (see below)
4. Print lines 10-14 to cleaned up file (first time it prints lines 0-14)
5. Push line 5-24 of MasterList to the beginning of MasterList, making them now have indices 0-19
6. Read 5 new lines to Masterlist or wait until 5 new lines are ready
7. Go back to #3
--> Note regarding #3: The cleanup functions do the following: *Compare lines by the use of Fuzzy String Matching, fuzzywuzzy, and delete duplicate or semiduplicate lines

*Check if the first word in a sentence is the same as the last word in the previous sentence, and in that case delete the last word in the previous sentence *Smaller stuff, to make the text look clean.

My questions is: Would you go about it in a completely different way? Maybe machine learning? Another language may be better suited? Any libraries or even software that already would do this?

If you do read my code, I am also eager to learn my mistakes, if you see some stupid thing I am doing, criticism (even harsh criticism if you feel like bashing me) is very welcome.

Thank you ever so much for your time.
Tags
python string matching fuzzywuzzy
Answer it

Fixing a semi-redundant text file, best practice opinions wanted

0 Answer(s)

Answer it

Unable to start Java!! Mr. Nerd figure out why..

Positive Votes

Negative Votes

Delete Comment

Post Projects

Manage Company