I am working in a code for doing some statistical analysis in kmers. It is kind of over-underrepresented kmers in dna sequences. The statistical implementation takes in consideration all sub-words that are component inside the kmer/n-grams/n-tuples in analysis, like dimers (two letters words), trimers (three letters words), tetramers… I will called simply words1 and words2
I need help to implement a function or functions to help me out to:
1- receive as input two dna sequences (string1 and string2) and one alphabet (ACGT).
2- Compare the characters present in string2 that are present in string1 and get the index of each character present.
I have this function for that:
def get_shared_idxs(word1, word2): """ > kmer = "ACGCAT" > get_shared_idxs(kmer, "AA") [0, 4] """ shared = [] for i, c in enumerate(word1): if c in word2: shared.append(i) return shared
2- Get a combinations of sub-words, for example:
list_sub_words = ['AC','AG','AC','AA','AT','CG','CC','CA','CT', 'GC','GA','GT','CA','CT','AT']
. I have this function to get all the subwords(words2) I need
def get_kmer_combo(kmer, alphabet, repeats=1): """ > alphabet = "ACGT" > kmer = "ACGCAT" >get_kmer_combo(kmer, alphabet, repeats=1) ['AC', 'AG', 'AC', 'AA', 'AT', 'CG', 'CC', 'CA', 'CT', 'GC', 'GA', 'GT', 'CA', 'CT', 'AT'] """ return ["".join(c) for c in combinations(kmer, r=repeats)]
3- Then I compare both words (words1 and words2) and find in words2 the gap positions represented by index missing in words2 that are not in word1. For example, word2 = “AA”, indices that are in word1 [0,4], so to AA x ACGCAT -> A CGC A It miss indices [1,2,3]. I have this function:
def get_missing_idxs(word1, present_idxs): kmer_idxs = [i for i in range(len(word1))] return set(kmer_idxs).difference(set(present_idxs))
# for "AA"
missing_index = [i for i in get_missing_idxs(kmer, [0,4]) if i < max([0,4])]
[1, 2, 3]
3- Then with the missing indexes I can generate generate combinations with the alphabet letters (according to the length of the missing positions in the case of “AA” = [1, 2, 3] ) and use it to fulfill the gaps or the missing positions in string2 (AA to get
AnnnA = [AAAAA, AAACA, AAAGA, AAATA,AACAA...]
until return the 4^3 combinations.
4- If for example the indices are AC = [0,1] I check for this consecutive indices
def is_consecutive(indices): """ > is_consecutive([0,2,3]), is_consecutive([2,3,4]) (False, True) """ return sorted(indices) == list(range(min(indices), max(indices) + 1))
If the indices are consecutives I return the word2 as it is or I do not need fill it.
I could hard code some functions, but I want to generalize the code.Because I need all duplets, triplets, quadruplets and pentamers, from the word1 as show in the examples below.
Take a look in some examples:
# number of combinations is the number of sub-words needed
**string1** = ACGCAT
**indices** = [0,1,2,3,4,5]
**alphabet** = "ACGT"
**string2** **indices** `number of combinations`
AC [0,1] 1
AG [0,2] 4 # AAG ACG AGG ATG
AC [0,3] 16 # AAAC AACC AAGC AATC ...
AA [0,4] 64
AT [0,5] 256
CA [1,4] 16
CT [1,5] 64
CG [1,2] 1
CC [1,3] 4
GC [2,3] 1
GA [2,4] 4
GT [2,5] 16
CA [1,4] 16
CT [1,5] 64
AC [0,3] 16
ACG [0,1,2] 1
ACC [0,1,3] 4
ACA [0,1,4] 16
ACT [0,1,5] 64
AGC [0,2,3] 4
AGA [0,2,4] 16
AGT [0,2,5] 64
ACA [0,1,4] 16
ACT [0,1,5] 64
AAT [0,4,5] 64
CGC [1,2,3] 1
CGA [1,2,4] 4
CGT [1,2,5] 16
CnA [1,3,4] 4
CCT [1,3,5] 16
CAT [1,4,5] 16
GCA [2,3,4] 1
GCT [2,3,5] 4
GAT [2,4,5] 4
CAT [3,4,5] 1
ACGC [0,1,2,3] 1
ACGA [0,1,2,4] 4
ACGT [0,1,2,5] 16
ACCA [0,1,3,4] 4
ACCT [0,1,3,5] 16
ACAT [0,1,4,5] 16
AGCA [0,2,3,4] 4
AGCT [0,2,3,5] 16
AGAT [0,2,4,5] 16
ACAT [0,1,4,5] 16
CGCA [1,2,3,4] 1
CGCT [1,2,3,5] 4
CGAT [1,2,4,5] 4
CCAT [1,3,4,5] 4
GCAT [2,3,4,5] 1
ACGCA [0,1,2,3,4] 1
ACGCT [0,1,2,3,5] 4
ACGAT [0,1,2,4,5] 4
ACCAT [0,1,3,4,5] 4
AGCAT [0,2,3,4,5] 4
CGCAT [1,2,3,4,5] 1
If you guys have some tips and want to share it, all help will be very well accepted.
Thank you for your time and help.
Observation:
The “ns” were inserted in the string2 are just illustrative and they are not in the real sequences. They are the missing positions or gaps I need to fulfill with the letters of the alphabet.
string2 indices number of combinations(4^n)
AC [0,1] 1 # consecutive indices return AC
AnG [0,2] 4 # AAG,ACG,AGG,ATG
AnnC [0,3] 16 #AAAC AACC AAGC AATC ACAC ACCC...
AnnnA [0,4] 64 # AAAAA AAACA AAAGA AAATA ...
AnnnnT [0,5] 256 # AAAAAT AAAACT AAAAGT AAAATT...
CnnA [1,4] 16
CnnnT [1,5] 64
CG [1,2] 1
CnC [1,3] 4
GC [2,3] 1
GnA [2,4] 4
GnnT [2,5] 16
CnnA [1,4] 16
CnnnT [1,5] 64
AnnC [0,3] 16
ACG [0,1,2] 1
ACnC [0,1,3] 4
ACnnA [0,1,4] 16
ACnnnT [0,1,5] 64
AnGC [0,2,3] 4
AnGnA [0,2,4] 16
AnGnnT [0,2,5] 64
ACnnA [0,1,4] 16
ACnnnT [0,1,5] 64
AnnnAT [0,4,5] 64
CGC [1,2,3] 1
CGnA [1,2,4] 4
CGnnT [1,2,5] 16
CnCA [1,3,4] 4
CnCnT [1,3,5] 16
CnnAT [1,4,5] 16
GCA [2,3,4] 1
GCnT [2,3,5] 4
GnAT [2,4,5] 4
CAT [3,4,5] 1
ACGC [0,1,2,3] 1
ACGnA [0,1,2,4] 4
ACGnnT [0,1,2,5] 16
ACnCA [0,1,3,4] 4
ACnCnT [0,1,3,5] 16
ACnnAT [0,1,4,5] 16
AnGCA [0,2,3,4] 4
AnGCnT [0,2,3,5] 16
AnGnAT [0,2,4,5] 16
ACnnAT [0,1,4,5] 16
CGCA [1,2,3,4] 1
CGCnT [1,2,3,5] 4
CGnAT [1,2,4,5] 4
CnCAT [1,3,4,5] 4
GCAT [2,3,4,5] 1
ACGCA [0,1,2,3,4] 1
ACGCnT [0,1,2,3,5] 4
ACGnAT [0,1,2,4,5] 4
ACnCAT [0,1,3,4,5] 4
AnGCAT [0,2,3,4,5] 4
CGCAT [1,2,3,4,5] 1